results: 研究结果显示,提案的方法可以准确检测肺癌区域在 CT 图像中。Abstract
Lung cancer is one of the prevalence diseases in the world which cause many deaths. Detecting early stages of lung cancer is so necessary. So, modeling and simulating some intelligent medical systems is an essential which can help specialist to accurately determine and diagnose the disease. So this paper contributes a new lung cancer detection model in CT images which use machine learning methods. There are three steps in this model: noise reduction (pre-processing), segmentation (middle-processing) and optimize segmentation for detect exact are of nodules. This article use some filters for noise reduction and then use Independent Recurrent Neural Networks (IndRNN) as deep learning methods for segmentation which optimize and tune by Genetic Algorithm. The results represented that the proposed method can detect exact area of nodules in CT images.
摘要
lung cancer 是世界上一种非常普遍的疾病,导致许多人死亡。早期诊断lung cancer非常重要。因此,建模和模拟一些智能医疗系统是必要的,以帮助专家准确地诊断疾病。本文提出了一种新的肺癌检测模型,用于CT图像中检测肺癌。该模型包括三个步骤:噪声减少(预处理)、分割(中处理)和优化分割以寻找精确的肿体部分。本文使用了一些 filters 来减少噪声,然后使用独立Recurrent Neural Networks(IndRNN)作为深度学习方法进行分割,并通过遗传算法进行优化和调整。结果表明,提议的方法可以准确地检测CT图像中的肿体部分。
results: 该研究实现了无漫游 Aberration-free 全色图像捕集,并达到了与真实图像相同的高分辨率。Abstract
Recent advances in metasurface lenses (metalenses) have shown great potential for opening a new era in compact imaging, photography, light detection and ranging (LiDAR), and virtual reality/augmented reality (VR/AR) applications. However, the fundamental trade-off between broadband focusing efficiency and operating bandwidth limits the performance of broadband metalenses, resulting in chromatic aberration, angular aberration, and a relatively low efficiency. In this study, a deep-learning-based image restoration framework is proposed to overcome these limitations and realize end-to-end metalens imaging, thereby achieving aberration-free full-color imaging for massproduced metalenses with 10-mm diameter. Neural network-assisted metalens imaging achieved a high resolution comparable to that of the ground truth image.
摘要
results: 本文通过分析SCADA和PMU测量链中每个Component测量链的测量错误来源和导致测量错误展现非零均值、非泊尔分布和时间变化统计特征,提供了一些关于测量链模型的信息。Abstract
In this document, the supervisory control and data acquisition (SCADA) and phasor measurement unit (PMU) measurement chain modeling will be studied, where the measurement error sources of each component in the SCADA and PMU measurement chains and the reasons leading to measurement errors exhibiting non-zero-mean, non-Gaussian, and time-varying statistical characteristic are summarized and analyzed. This document provides a few equations, figures, and discussions about the details of the SCADA and PMU measurement error chain modeling, which are intended to facilitate the understanding of how the measurement errors are designed for each component in the SCADA and PMU measurement chains. The measurement chain models described here are also used for synthesizing measurement errors with realistic characteristics in simulation cases to test the developed algorithms or methodologies.
摘要
在这份文档中,我们将研究超visory control和数据获取(SCADA)和phasor measurement unit(PMU)测量链模型化,包括每个SCADA和PMU测量链中的测量错误来源以及导致测量错误表现非零均值、非泊尔分布和时间变化统计特征的原因。本文提供了一些方程、图表和讨论,用于解释SCADA和PMU测量链中每个组件的测量错误设计,以及如何使用这些模型来生成实际特征的测量错误。这些测量链模型也可以用来在 simulations中测试已经开发的算法或方法。
Low-complexity Linear Multicast Beamforming for Cache-aided MIMO Communications
results: 我们通过定义相对稳定率(symmetric rate),即每用户接收的部分重叠流总带宽,来评估提posed scheme。我们使用代替优化和顺序凸 Approximation(SCA)解决非对称稳定率最大化问题。此外,我们还开发了一种快速 iterative Lagrangian-based算法,减少了计算负担相比之前的设计。我们通过广泛的 simulations 证明了我们的提案的有效性。Abstract
A practical and scalable multicast beamformer design in multi-input multi-output~(MIMO) coded caching~(CC) systems is introduced in this paper. The proposed approach allows multicast transmission to multiple groups with partially overlapping user sets using receiver dimensions to distinguish between different group-specific streams. Additionally, it provides flexibility in accommodating various parameter configurations of the MIMO-CC setup and overcomes practical limitations, such as the requirement to use successive interference cancellation~(SIC) at the receiver, while achieving the same degrees-of-freedom~(DoF). To evaluate the proposed scheme, we define the symmetric rate as the sum rate of the partially overlapping streams received per user, comprising a linear multistream multicast transmission vector and the linear minimum mean square error~(LMMSE) receiver. The resulting non-convex symmetric rate maximization problem is solved using alternative optimization and successive convex approximation~(SCA). Moreover, a fast iterative Lagrangian-based algorithm is developed, significantly reducing the computational overhead compared to previous designs. The effectiveness of our proposed method is demonstrated by extensive simulations.
摘要
本文提出了一种实用并可扩展的多播扩 beamformer设计方法,用于多输入多出力(MIMO)编码缓存(CC)系统。该方法允许多播传输多个群组,使用接收维度来分辨不同群组专用流。此外,该方法可以适应不同的 MIMO-CC 设置参数,并超越实际限制,例如在接收端使用successive interference cancellation(SIC),而达到同样的度量。为评估提议方案,我们定义了对称率为每个用户收到多播流的相互 overlap 流速率,包括多播流传输矩阵和线性最小均方误差(LMMSE)接收器。得到的非 conjugate 对称率最大化问题可以使用代替优化和Successive Convex Approximation(SCA)解决。此外,我们还开发了一种快速的Lagrangian基于算法,可以快速减少计算负担。对比之前的设计,我们的方法的效iveness被实验证明。
The optimization, design and performance of the FBCM23 ASIC for the upgraded CMS beam monitoring system
results: 通过优化设计、总体架构和细节实现,实现了高速和高精度的束条件监测系统。Abstract
We present the development of the FBCM23 ASIC designed for the Phase-II upgrade of the Fast Beam Condition Monitoring (FBCM) system built at the CMS experiment which will replace the present luminometer based on the BCM1F ASIC [1]. The FBCM system should provide reliable luminosity measurement with 1 ns time resolution enabling the detection of beam-induced background. The FBCM23 ASIC comprises 6 channels of the fast front-end amplifier working in transimpedance configuration followed by CR-RC$^3$ shaper and leading edge discriminator. The paper will show the optimization of the design, overall architecture, and the detailed implementation in a CMOS 65 nm process as well as preliminary electrical performance.
摘要
我团队在 presente 端的 FBCM23 ASIC 的开发,该 ASIC 用于 CMS 实验中的 Phase-II 升级 Fast Beam Condition Monitoring (FBCM) 系统,将取代现有的 luminosimeter 基于 BCM1F ASIC [1]。 FBCM 系统应该提供高可靠性的 luminosity 测量,时间分辨率为 1 ns,以检测 beam-induced background。 FBCM23 ASIC 包括 6 个 быстро前端晶体管工作在抗阻挺配置下,然后是 CR-RC$^3$ 涨幅器和主边缘分类器。 我们将在 CMOS 65 nm 制程中实现这个设计,并将提供优化设计、总体架构和详细实现。 此外,我们还将介绍 FBCM23 ASIC 的初步电子性能。
Indefinite causal order for quantum phase estimation with Pauli noise
for: 这个论文探讨了最近的Switched quantum channels with indefinite causal order在量子阶段测量任务中的应用,特别是在Pauli noise中。
methods: 该论文使用了 Switched quantum channels with indefinite causal order来探讨量子阶段测量任务中的Pauli noise的影响。
results: 研究发现,Pauli noise可以导致非标准的能力,与标准的量子阶段测量任务不同。这些能力具有特定的Pauli noise特征,同时也与 depolarizing noise和thermal noise相似。Abstract
This letter further explores the recent scheme of switched quantum channels with indefinite causal order applied to the reference metrological task of quantum phase estimation in the presence of noise. We especially extend the explorations, previously reported with depolarizing noise and thermal noise, to the class of Pauli noises, important to the qubit and not previously addressed. Nonstandard capabilities, not accessible with standard quantum phase estimation, are exhibited and analyzed, with significant properties that are specific to the Pauli noises, while other properties are found in common with the depolarizing noise or the thermal noise. The results show that the presence and the type of quantum noise are both crucial to the determination of the nonstandard capabilities from the switched channel with indefinite causal order, with a constructive action of noise reminiscent of stochastic resonance phenomena. The study contributes to a more comprehensive and systematic characterization of the roles and specificities of quantum noise in the operation of the novel devices of switched quantum channels with indefinite causal order.
摘要
Translated into Simplified Chinese:这封信件进一步探讨了最近提出的Switched quantum channels with indefinite causal order的方案,应用于量子阶段估计任务中的量子噪声存在下。我们特别扩展了之前对 depolarizing noise和thermal noise的探讨,到Pauli noise这一类,对于qubit而言非常重要,尚未被考虑。我们发现了不可 accessing的非标准功能,并对其进行分析,发现了Pauli noise特有的特征,同时也发现了与 depolarizing noise或thermal noise相同的特征。结果表明,噪声的存在和类型都是 Switched channel with indefinite causal order中非标准功能的决定因素,具有constructive noise的特点,类似于Stochastic resonance现象。这种研究对于Quantum noise在Switched quantum channels with indefinite causal order新型设备中的作用进行了更加全面和系统的描述。
On Age of Information and Energy-Transfer in a STAR-RIS-assisted System
paper_authors: Mohammad Reza Kavianinia, Mohammad Mehdi Setoode, Mohammad Javad Emadi
for: 这 paper 的主要目标是提高无线传感器网络中的能量有限设备和时间敏感应用的性能,包括充电设备和提供时间敏感信息用户的状态更新。
methods: 这 paper 使用了多天线基站(BS)和同时发送和反射可配置智能表面(STAR-RIS),以控制状态更新性能,并对信息用户进行适应频率和干扰优化。
results: 通过对干扰优化和频率选择进行优化,可以降低时间敏感信息用户的平均吞吐量,同时满足能量收集设备的最低能量要求。 Additionally, two different energy-splitting and mode switching policies at STAR-RIS are studied.Abstract
Battery-limited devices and time-sensitive applications are considered as key players in the forthcoming wireless sensor network. Therefore, the main goal of the network is two-fold; Charge battery-limited devices, and provide status updates to users where information-freshness matters. In this paper, a multi-antenna base station (BS) in assistance of simultaneously-transmitting-and-reflecting reconfigurable intelligent surface (STAR-RIS) transmits power to energy-harvesting devices while controlling status update performance at information-users by analyzing age of information (AoI) metric. Therefore, we derive a scheduling policy at BS, and analyze joint transmit beamforming and amplitude-phase optimization at BS and STAR-RIS, respectively, to reduce average sum-AoI for the time-sensitive information-users while satisfying minimum required energy at energy-harvesting users. Moreover, two different energy-splitting and mode switching policies at STAR-RIS are studied. Then, by use of an alternating optimization algorithm, the optimization problem is studied and non-convexity of the problem is tackled by using the successive convex approximation technique. Through numerical results, AoI-metric and energy harvesting requirements of the network are analyzed versus different parameters such as number of antennas at BS, size of STAR-RIS, and transmitted power to highlight how we can improve two fold performance of the system by utilizing STAR-RIS compared to the conventional RIS structure.
摘要
供电量限制的设备和时间敏感应用是未来无线传感网络的关键参与者。因此,无线传感网络的主要目标是两重的:充电供电量限制的设备,并提供时间敏感的用户状态更新。在这篇论文中,一个多天线基站(BS)在协助同时传输和反射可配置智能表面(STAR-RIS)的情况下,向能量收集设备传输能量,并控制状态更新性能,以保证信息的新鲜度。因此,我们 derive了一个排序策略,并分析了基站和STAR-RIS的共同发射扩展和振荡优化,以降低时间敏感信息用户的平均总新鲜度,同时满足能量收集用户的最低能量要求。此外,我们还研究了STAR-RIS中的能量分配和模式切换策略。然后,通过使用 alternate optimization algorithm,我们研究了优化问题,并使用Successive Convex Approximation技术来解决非 convex性问题。通过numerical results,我们分析了无线传感网络中的AoI-度量和能量收集要求,并对不同参数(BS的天线数、STAR-RIS的大小和传输功率)进行了分析,以highlight如何通过使用STAR-RIS提高无线传感网络的双重性能。
Empowering the 6G Cellular Architecture with Open RAN
results: 研究表明,采用Open RAN架构可以实现6G网络的许多优点,包括更高的可扩展性、更好的可用性、更高的能效率和更低的成本。同时,Open RAN架构也可以支持许多新的使用场景,如无人机支持和在原来未得到服务的地区进行成本效果的扩展。Abstract
Innovation and standardization in 5G have brought advancements to every facet of the cellular architecture. This ranges from the introduction of new frequency bands and signaling technologies for the radio access network (RAN), to a core network underpinned by micro-services and network function virtualization (NFV). However, like any emerging technology, the pace of real-world deployments does not instantly match the pace of innovation. To address this discrepancy, one of the key aspects under continuous development is the RAN with the aim of making it more open, adaptive, functional, and easy to manage. In this paper, we highlight the transformative potential of embracing novel cellular architectures by transitioning from conventional systems to the progressive principles of Open RAN. This promises to make 6G networks more agile, cost-effective, energy-efficient, and resilient. It opens up a plethora of novel use cases, ranging from ubiquitous support for autonomous devices to cost-effective expansions in regions previously underserved. The principles of Open RAN encompass: (i) a disaggregated architecture with modular and standardized interfaces; (ii) cloudification, programmability and orchestration; and (iii) AI-enabled data-centric closed-loop control and automation. We first discuss the transformative role Open RAN principles have played in the 5G era. Then, we adopt a system-level approach and describe how these Open RAN principles will support 6G RAN and architecture innovation. We qualitatively discuss potential performance gains that Open RAN principles yield for specific 6G use cases. For each principle, we outline the steps that research, development and standardization communities ought to take to make Open RAN principles central to next-generation cellular network designs.
摘要
5G技术的创新和标准化对于每个细节的无线网络架构都带来了进步。这包括新的频率带和信号技术的引入 для无线访问网络(RAN),以及基于微服务和网络函数虚拟化(NFV)的核心网络。然而,与任何新技术一样,实际部署的速度不能 Instantly match the pace of innovation。为了解决这个差距,一个关键的开发方向是RAN,以使其更加开放、适应性强、功能强和易于管理。在这篇论文中,我们强调了接受新的细节架构的潜在可transformative的潜力,从传统系统转移到进步的开放RAN原则。这承诺了6G网络更加灵活、成本效益、能效和可靠。它开放了一系列新的应用场景,从无限支持自动化设备到低成本扩展到之前未经服务的地区。开放RAN原则包括:(i)分解的架构,模块化和标准化接口;(ii)云化、可编程和管理;以及(iii)基于人工智能的数据驱动关闭控制和自动化。我们首先讨论了在5G时代中开放RAN原则的转变性作用。然后,我们采用系统化方法,描述了如何在6G RAN和架构方面应用开放RAN原则。我们qualitatively讨论了开放RAN原则对特定6G应用场景的可能性提升。对每一原则,我们列出了研究、开发和标准化社区应该采取的步骤,以使开放RAN原则成为下一代无线网络设计的中心。
MAINS: A Magnetic Field Aided Inertial Navigation System for Indoor Positioning
results: 比起单独的INS,MAINS显示出了remarkable two orders of magnitude reduction in position error,并且与现有磁场帮助导航方法相比,显示了slightly improved horizontal position accuracy。Abstract
A Magnetic field Aided Inertial Navigation System (MAINS) for indoor navigation is proposed in this paper. MAINS leverages an array of magnetometers to measure spatial variations in the magnetic field, which are then used to estimate the displacement and orientation changes of the system, thereby aiding the inertial navigation system (INS). Experiments show that MAINS significantly outperforms the stand-alone INS, demonstrating a remarkable two orders of magnitude reduction in position error. Furthermore, when compared to the state-of-the-art magnetic-field-aided navigation approach, the proposed method exhibits slightly improved horizontal position accuracy. On the other hand, it has noticeably larger vertical error on datasets with large magnetic field variations. However, one of the main advantages of MAINS compared to the state-of-the-art is that it enables flexible sensor configurations. The experimental results show that the position error after 2 minutes of navigation in most cases is less than 3 meters when using an array of 30 magnetometers. Thus, the proposed navigation solution has the potential to solve one of the key challenges faced with current magnetic-field simultaneous localization and mapping (SLAM) solutions: the very limited allowable length of the exploration phase during which unvisited areas are mapped.
摘要
本文提出了一种基于磁场的慢 Navigation System (MAINS),用于室内导航。MAINS 利用一个磁场仪数组测量空间磁场变化,然后用此来估算系统的位移和方向变化,从而帮助惯性导航系统 (INS)。实验结果表明,相比单独使用 INS,MAINS 可以大幅提高位置精度, reductions in position error of two orders of magnitude were observed. Comparing with the state-of-the-art magnetic-field-aided navigation approach, the proposed method exhibits slightly improved horizontal position accuracy. However, it has noticeably larger vertical error on datasets with large magnetic field variations. One of the main advantages of MAINS compared to the state-of-the-art is that it enables flexible sensor configurations. The experimental results show that the position error after 2 minutes of navigation in most cases is less than 3 meters when using an array of 30 magnetometers. Therefore, the proposed navigation solution has the potential to solve one of the key challenges faced with current magnetic-field simultaneous localization and mapping (SLAM) solutions: the very limited allowable length of the exploration phase during which unvisited areas are mapped.
Metasurface Sensing Approach to DOA Estimation of Coherent Signals
results: 相比传统DOA估计方法,该方法具有两大优点:一是只需一个接收通道,可以避免频道匹配错误;二是可以处理多个普通信号。对不同条件下的性能曲线进行分析,并与现有方法进行比较。实验结果表明该方法的有效性。Abstract
The DOA estimation method of coherent signals based on periodical coding metasurface is proposed. After periodical coding, the DOA information of incident signals in the time domain is represented as the amplitude and phase information at different frequency points in the frequency domain. Finite time Fourier transform (FTFT) is performed on the received signal and appropriate frequency points are selected to reconstruct the frequency domain snapshot, then pattern smoothing (PS) technique is applied to execute DOA estimation. Compared with conventional DOA estimation methods, the proposed method has two main advantages: one is that only a single receiving channel is needed to avoid the appearance of channel mismatch errors, the other is that it can process with multiple coherent signals. The performance curves of the proposed method are analyzed under different conditions and compared with existing methods. Simulation results show the effectiveness of the proposed method.
摘要
提出了基于 periodic coding 的干扰信号 DOA 估计方法。在时域中,Periodical coding 后,入射信号的 DOA 信息被表示为不同频点的幅度和频率域中的相位信息。接下来,对接收信号进行 Finite Time Fourier Transform (FTFT),选择合适的频点来重建频率域Snapshot,然后应用 Pattern Smoothing (PS) 技术进行 DOA 估计。与传统 DOA 估计方法相比,该方法有两个主要优势:一是只需一个接收通道,可以避免通道匹配错误;二是可以处理多个干扰信号。对不同条件下的性能曲线进行分析并与现有方法进行比较,实验结果表明该方法的有效性。
results: 在室内和吸收环境中对实验数据进行测试,提出的方法比采用已有方法的state-of-the-art方法更高效,并且可以在不准确的、不规则的 Microphone设置下提供高质量的头部方向估计。Abstract
Determining the head orientation of a talker is not only beneficial for various speech signal processing applications, such as source localization or speech enhancement, but also facilitates intuitive voice control and interaction with smart environments or modern car assistants. Most approaches for head orientation estimation are based on visual cues. However, this requires camera systems which often are not available. We present an approach which purely uses audio signals captured with only a few distributed microphones around the talker. Specifically, we propose a novel method that directly incorporates measured or modeled speech radiation patterns to infer the talker's orientation during active speech periods based on a cosine similarity measure. Moreover, an automatic gain adjustment technique is proposed for uncalibrated, irregular microphone setups, such as ad-hoc sensor networks. In experiments with signals recorded in both anechoic and reverberant environments, the proposed method outperforms state-of-the-art approaches, using either measured or modeled speech radiation patterns.
摘要
判断说话人的头部方向不仅有助于多种演示信号处理应用程序,如源localization或speech增强,还可以帮助用户intsutitive的voice控制和smart环境或现代汽车助手的交互。大多数方法 для头部方向估计基于视觉cue。然而,这需要摄像头系统,而这些系统经常不可用。我们提出了一种方法,该方法仅使用了周围的一些分布式 microphone 采集的音频信号。specifically,我们提议一种直接在活跃speech期间根据cosine相似度度量来推断说话人的方向。此外,我们还提出了一种自动调整 gain 技术,以适应不准确、不规则的 microphone 设置,如随机感知网络。在室内和吸收室中进行的实验中,我们的方法比 estado-of-the-art 方法更高效,使用 either measured or modeled speech radiation patterns。
A text-dependent speaker verification application framework based on Chinese numerical string corpus
paper_authors: Litong Zheng, Feng Hong, Weijie Xu for: 这个论文主要针对短语言场景下的文本依赖性人脸认可(TD-SV)问题,并提出了一种基于多维度汇集方法的解决方案。methods: 该论文使用了一种基于Transformer的文本嵌入网络和一种基于sliding window attention的 Statistical Pooling方法,并在back-end fusion中结合了文本嵌入和说话人嵌入。results: 该论文在Hi-Mia和SHAL两个 dataset上实现了49.2%和75.0%的等错率(EER)提升,实际上表明了该方法在TD-SV问题上的高效性。Abstract
Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies such as more fine-grained pooling methods on time scales and decoupled representations of speech speaker embedding and text embedding are more suitable for TD-SV. We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion. First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN. Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss. For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method. Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
摘要
We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion.First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN.Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss.For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method.Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks
results: 该模型在使用公共可用的IEMOCAP dataset进行训练后,在四种情绪中实现了77.58%的总准确率,超过了当前的状态态eline approaches。Abstract
Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and dependable emotion recognition systems supporting optimal human-machine communication are required. Multi-modality (including speech, audio, text, images, and videos) is typically exploited in emotion recognition tasks. Much relevant research is based on merging multiple data modalities and training deep learning models utilizing low-level data representations. However, most existing emotion databases are not large (or complex) enough to allow machine learning approaches to learn detailed representations. This paper explores modalityspecific pre-trained transformer frameworks for self-supervised learning of speech and text representations for data-efficient emotion recognition while achieving state-of-the-art performance in recognizing emotions. This model applies feature-level fusion using nonverbal cue data points from motion capture to provide multimodal speech emotion recognition. The model was trained using the publicly available IEMOCAP dataset, achieving an overall accuracy of 77.58% for four emotions, outperforming state-of-the-art approaches
摘要
《情感认知在助助 роботиCS中是一个非常关键的话题,因为需要让机器人学习人类行为,以便在我们的社会中与人类进行有效的交互。因此,我们需要有高效可靠的情感认知系统,以支持人机交互。通常情感认知任务会利用多种数据模式,如speech、audio、文本、图像和视频。大多数相关研究都是基于将多种数据模式融合并使用深度学习模型来学习低级别数据表示。然而,现有的情感数据库通常不够大(或复杂),以至于机器学习方法无法学习细节的表示。本文提出了使用特定模式预训练转换器框架进行自主学习 speech和文本表示,以实现数据效果的情感认知。本模型使用非语言cue数据点来实现多模式speech情感认知,并使用公共可用的IEMOCAP数据集进行训练,实现了四种情感的总准确率为77.58%,超过了当前的状态方法》
Building Ears for Robots: Machine Hearing in the Age of Autonomy
results: 该研究提出了一个初步的软件框架,基于概率机器人理论,推荐将机器人听觉 интеグ into 更大的感知和决策背景中。它介绍了多种模型,包括 Bayes 筛、部分可见Markov决策过程(POMDP)和多智能系统,强调机器人听觉在多种角色中的多方面作用。Abstract
This study explores the significance of robot hearing systems, emphasizing their importance for robots operating in diverse and uncertain environments. It introduces the hardware design principles using robotaxis as an example, where exterior microphone arrays are employed to detect sound events such as sirens. The challenges, goals, and test methods are discussed, focusing on achieving a suitable signal-to-noise ratio (SNR). Additionally, it presents a preliminary software framework rooted in probabilistic robotics theory, advocating for the integration of robot hearing into the broader context of perception and decision-making. It discusses various models, including Bayes filters, partially observable Markov decision processes (POMDP), and multiagent systems, highlighting the multifaceted roles that robot hearing can play. In conclusion, as service robots continue to evolve, robot hearing research will expand, offering new perspectives and challenges for future development beyond simple sound event classification.
摘要
Translated into Simplified Chinese:这个研究探讨了机器人听力系统的重要性,强调其在多样化和不确定环境中运行的重要性。它介绍了硬件设计原则,使用机器人AXI为例,并使用外部麦克麦array检测声音事件,如警响。研究涉及到的挑战、目标和测试方法都被讨论了,主要是实现适当的噪声比例(SNR)。此外,它还提出了一个初步的软件框架,基于概率机器人理论,强调机器人听力的集成到更广泛的感知和决策中。研究还讨论了多种模型,包括拟折扭滤波器、部分可见Markov决策过程(POMDP)和多代理系统,强调机器人听力在多种角度的多方面作用。 conclude,随着服务机器人的进一步发展,机器人听力研究将继续扩大,提供新的视角和挑战,超出简单的声音事件分类。
results: 比基eline模型高效、维持高质量音频生成和 log-likelihood 估计,并且在 Computational metrics 和听降试验中与其他状态级模型竞争。Abstract
This paper proposes SEFGAN, a Deep Neural Network (DNN) combining maximum likelihood training and Generative Adversarial Networks (GANs) for efficient speech enhancement (SE). For this, a DNN is trained to synthesize the enhanced speech conditioned on noisy speech using a Normalizing Flow (NF) as generator in a GAN framework. While the combination of likelihood models and GANs is not trivial, SEFGAN demonstrates that a hybrid adversarial and maximum likelihood training approach enables the model to maintain high quality audio generation and log-likelihood estimation. Our experiments indicate that this approach strongly outperforms the baseline NF-based model without introducing additional complexity to the enhancement network. A comparison using computational metrics and a listening experiment reveals that SEFGAN is competitive with other state-of-the-art models.
摘要
results: 本研究使用NASA Ames Granite Lab的试用数据进行量值和质量验证,结果显示本方法可以快速和精确地检测Scene Change。且分析了方法的耗时情况。源代码公开供进一步发展。Abstract
This work presents an algorithm for scene change detection from point clouds to enable autonomous robotic caretaking in future space habitats. Autonomous robotic systems will help maintain future deep-space habitats, such as the Gateway space station, which will be uncrewed for extended periods. Existing scene analysis software used on the International Space Station (ISS) relies on manually-labeled images for detecting changes. In contrast, the algorithm presented in this work uses raw, unlabeled point clouds as inputs. The algorithm first applies modified Expectation-Maximization Gaussian Mixture Model (GMM) clustering to two input point clouds. It then performs change detection by comparing the GMMs using the Earth Mover's Distance. The algorithm is validated quantitatively and qualitatively using a test dataset collected by an Astrobee robot in the NASA Ames Granite Lab comprising single frame depth images taken directly by Astrobee and full-scene reconstructed maps built with RGB-D and pose data from Astrobee. The runtimes of the approach are also analyzed in depth. The source code is publicly released to promote further development.
摘要
The algorithm first applies modified Expectation-Maximization Gaussian Mixture Model (GMM) clustering to two input point clouds, and then performs change detection by comparing the GMMs using the Earth Mover's Distance. The algorithm is validated using a test dataset collected by an Astrobee robot in the NASA Ames Granite Lab, which includes single frame depth images and full-scene reconstructed maps. The runtimes of the approach are also analyzed in depth. The source code is publicly released to promote further development.Translated into Simplified Chinese:这个工作提出了一种场景变化检测算法,用于支持未来宇宙驻留站的自动化机器人照顾。未来的深空驻留站,如宇航员站点 gateway,将会是无人驾驶的,需要自动化系统来维护。现有的场景分析软件在国际空站上使用 manually labeled 图像进行变化检测,而这个工作使用 raw, 未标注的点云作为输入。算法首先应用了修改后的期望最大化 Gaussian Mixture Model (GMM) 分群,然后通过比较 GMM 使用地球运动员的距离进行变化检测。算法被证明使用 NASA Ames 大理石实验室收集的测试数据进行量化和质量化验证,包括单框深度图像和全景重构的地图。还进行了 runtime 的深入分析。源代码公开发布,以便进一步的开发。
MEDPSeg: End-to-end segmentation of pulmonary structures and lesions in computed tomography
paper_authors: Diedre S. Carmo, Jean Ribeiro, Alejandro P. Comellas, Joseph M. Reinhardt, Sarah E. Gerard, Letícia Rittner, Roberto A. Lotufo for: 这种研究旨在提高Computed Tomography(CT)图像分割的自动化精度,以便更好地诊断和评估肺病。methods: 这种方法使用多尺度学习和多任务学习,并使用可变输出通道来优化多个层次结构。results: 研究达到了多个目标中的状态 arts performance,特别是在分割玻璃层和填充的问题上,这是有限的手动标注可用性的一个挑战。Abstract
The COVID-19 pandemic response highlighted the potential of deep learning methods in facilitating the diagnosis and prognosis of lung diseases through automated segmentation of normal and abnormal tissue in computed tomography (CT). Such methods not only have the potential to aid in clinical decision-making but also contribute to the comprehension of novel diseases. In light of the labor-intensive nature of manual segmentation for large chest CT cohorts, there is a pressing need for reliable automated approaches that enable efficient analysis of chest CT anatomy in vast research databases, especially in more scarcely annotated targets such as pneumonia consolidations. A limiting factor for the development of such methods is that most current models optimize a fixed annotation format per network output. To tackle this problem, polymorphic training is used to optimize a network with a fixed number of output channels to represent multiple hierarchical anatomic structures, indirectly optimizing more complex labels with simpler annotations. We combined over 6000 volumetric CT scans containing varying formats of manual and automated labels from different sources, and used polymorphic training along with multitask learning to develop MEDPSeg, an end-to-end method for the segmentation of lungs, airways, pulmonary artery, and lung lesions with separation of ground glass opacities, and parenchymal consolidations, all in a single forward prediction. We achieve state-of-the-art performance in multiple targets, particularly in the segmentation of ground glass opacities and consolidations, a challenging problem with limited manual annotation availability. In addition, we provide an open-source implementation with a graphical user interface at https://github.com/MICLab-Unicamp/medpseg.
摘要
COVID-19 疫情响应中,深度学习方法在诊断和诊断急性肺病方面具有潜在的潜力。通过自动分割正常和异常组织,这些方法不仅可以帮助临床决策,还可以帮助我们更好地理解新型疾病。由于手动分割大量胸部 computed tomography(CT)套件的劳动密集程度,现有的可靠自动方法的需求尤为急迫。然而,现有的大多数模型仅仅优化固定的注释格式每个网络输出。为解决这个问题,我们使用多态训练来优化一个网络,使其能够表示多个层次生物结构,并间接优化更复杂的标签使用更简单的注释。我们将多达6000个胸部 CT 扫描图像,含有不同来源的手动和自动注释,并使用多态训练和多任务学习来开发 MEDPSeg,一种结束到终点的方法,用于肺、血液肺、肺动脉和肺损伤的分割,包括分割玻璃涂抹和肺组织损伤,alles在单个前向预测中。我们实现了多个目标中的状态革命性表现,特别是在分割玻璃涂抹和肺组织损伤方面,这是有限的手动注释可用性的一个挑战。此外,我们还提供了一个开源实现和图形用户界面,可以在 中找到。
PointNeRF++: A multi-scale, point-based Neural Radiance Field
results: 该方法在 NeRF Synthetic、ScanNet 和 KITTI-360 数据集上进行验证,与当前最佳方法相比,显著超越了其性能。Abstract
Point clouds offer an attractive source of information to complement images in neural scene representations, especially when few images are available. Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud quality is low -- e.g., sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple representation that aggregates point clouds at multiple scale levels with sparse voxel grids at different resolutions. To deal with point cloud sparsity, we average across multiple scale levels -- but only among those that are valid, i.e., that have enough neighboring points in proximity to the ray of a pixel. To help model areas without points, we add a global voxel at the coarsest scale, thus unifying "classical" and point-based NeRF formulations. We validate our method on the NeRF Synthetic, ScanNet, and KITTI-360 datasets, outperforming the state of the art by a significant margin.
摘要
点云提供了一个有把握的信息来补充图像,特别是当有限的图像 disponible 时。基于点云的神经渲染方法已经存在,但它们在点云质量低下不表现好 -- 例如,稀疏或不完整,这 часто发生在实际数据中。我们解决这些问题的方式是使用多尺度级别的点云汇集,并使用稀疏 voxel 网格在不同的分辨率上。为了处理点云稀疏性,我们在有效的多尺度级别之间进行平均 -- 但只有在相邻的像素射线附近有足够的邻居点。为了处理没有点的区域,我们添加了最粗级别的全球 voxel,从而统一 "传统" 和基于点的 NeRF 表示方法。我们在 NeRF Synthetic、ScanNet 和 KITTI-360 数据集上验证了我们的方法,并在比较领域中超越了现状。
Calibrated Uncertainties for Neural Radiance Fields
results: 根据论文的结果,这两种方法都能够在缺乏数据的情况下,实现对NeRF模型的预测结果的准确量化和评估,并且可以保持图像质量。此外,这些方法还可以应用于视图增强和下一个最佳视图选择等应用场景。Abstract
Neural Radiance Fields have achieved remarkable results for novel view synthesis but still lack a crucial component: precise measurement of uncertainty in their predictions. Probabilistic NeRF methods have tried to address this, but their output probabilities are not typically accurately calibrated, and therefore do not capture the true confidence levels of the model. Calibration is a particularly challenging problem in the sparse-view setting, where additional held-out data is unavailable for fitting a calibrator that generalizes to the test distribution. In this paper, we introduce the first method for obtaining calibrated uncertainties from NeRF models. Our method is based on a robust and efficient metric to calculate per-pixel uncertainties from the predictive posterior distribution. We propose two techniques that eliminate the need for held-out data. The first, based on patch sampling, involves training two NeRF models for each scene. The second is a novel meta-calibrator that only requires the training of one NeRF model. Our proposed approach for obtaining calibrated uncertainties achieves state-of-the-art uncertainty in the sparse-view setting while maintaining image quality. We further demonstrate our method's effectiveness in applications such as view enhancement and next-best view selection.
摘要
神经辐射场(NeRF)技术已经实现了新视角合成的很好的结果,但仍缺失一个关键组件:准确测量模型预测结果的不确定性。 probabilistic NeRF 方法已经尝试解决这个问题,但它们的输出概率通常不准确,因此不能捕捉模型的真正信任水平。 在罕见视图设定中,加额数据缺失导致了calibration的特别困难。在这篇论文中,我们介绍了首个从 NeRF 模型中获取准确不确定性的方法。我们的方法基于一种强健可靠的度量计算每个像素不确定性的预测 posterior 分布中。我们提出了两种技术来消除了需要了 held-out 数据的需求。首先,基于 patch 采样,我们训练了每个场景的两个 NeRF 模型。第二,我们提出了一种新的 meta-calibrator,只需训练一个 NeRF 模型。我们的提议的方法在罕见视图设定中实现了状态的较好的不确定性,同时保持图像质量。我们进一步示出了我们的方法在视图增强和下一个视图选择等应用中的效iveness。
CLIPDrawX: Primitive-based Explanations for Text Guided Sketch Synthesis
results: 提出了CLIPDrawX算法,可以更好地Visualize CLIP文本嵌入,并且可以跟踪Synthesis过程,每个视觉概念都可以由基本形来解释Abstract
With the goal of understanding the visual concepts that CLIP associates with text prompts, we show that the latent space of CLIP can be visualized solely in terms of linear transformations on simple geometric primitives like circles and straight lines. Although existing approaches achieve this by sketch-synthesis-through-optimization, they do so on the space of B\'ezier curves, which exhibit a wastefully large set of structures that they can evolve into, as most of them are non-essential for generating meaningful sketches. We present CLIPDrawX, an algorithm that provides significantly better visualizations for CLIP text embeddings, using only simple primitive shapes like straight lines and circles. This constrains the set of possible outputs to linear transformations on these primitives, thereby exhibiting an inherently simpler mathematical form. The synthesis process of CLIPDrawX can be tracked end-to-end, with each visual concept being explained exclusively in terms of primitives. Implementation will be released upon acceptance. Project Page: $\href{https://clipdrawx.github.io/}{\text{https://clipdrawx.github.io/}$.
摘要
我们希望通过理解CLIP的文本提示下的视觉概念,我们显示CLIP的秘密空间可以 solely通过线性变换来visual化简单的几何基本形如圆和直线。现有的方法通过优化的绘制 synthesis来实现这一点,但是它们在 Bézier 曲线空间中进行,这个空间包含大量不必要的结构,这些结构不必要 для生成有意义的绘制。我们提出了CLIPDrawX算法,它可以为CLIP文本嵌入提供更好的视觉化,只使用简单的基本形如直线和圆。这将限制可能的输出到线性变换这些基本形上,从而表现出更简单的数学形式。CLIPDrawX的合成过程可以跟踪到终端,每个视觉概念都可以solely在基本形上进行解释。我们将在接受后发布实现。项目页面: $\href{https://clipdrawx.github.io/}{\text{https://clipdrawx.github.io/}$
STEREOFOG – Computational DeFogging via Image-to-Image Translation on a real-world Dataset
For: The paper is written for exploring the potential of Image-to-Image translation (I2I) in the domain of fog removal, specifically using a real-world dataset called STEREOFOG.* Methods: The paper uses the pix2pix I2I Machine Learning (ML) framework and optimizes it for the STEREOFOG dataset.* Results: The final model achieves an average Complex Wavelet-Structural Similarity (CW-SSIM) score of $0.76$, demonstrating the suitability of the technique for fog removal.Here’s the information in Simplified Chinese text:
results: 最终模型在Complex Wavelet-Structural Similarity(CW-SSIM)测试中取得了0.76的平均分数,证明了该技术的适用性。Abstract
Image-to-Image translation (I2I) is a subtype of Machine Learning (ML) that has tremendous potential in applications where two domains of images and the need for translation between the two exist, such as the removal of fog. For example, this could be useful for autonomous vehicles, which currently struggle with adverse weather conditions like fog. However, datasets for I2I tasks are not abundant and typically hard to acquire. Here, we introduce STEREOFOG, a dataset comprised of $10,067$ paired fogged and clear images, captured using a custom-built device, with the purpose of exploring I2I's potential in this domain. It is the only real-world dataset of this kind to the best of our knowledge. Furthermore, we apply and optimize the pix2pix I2I ML framework to this dataset. With the final model achieving an average Complex Wavelet-Structural Similarity (CW-SSIM) score of $0.76$, we prove the technique's suitability for the problem.
摘要
Image-to-Image翻译(I2I)是机器学习(ML)的一种 subclass,具有很大的应用前景,因为存在两个领域的图像和图像之间的翻译需求,如阴天气的去fog。例如,这可能对自动驾驶汽车有很大的帮助,因为目前它们在阴天气条件下困难 navigate。然而,I2I任务的数据不够多,通常具有难以获得的特点。我们在这里引入 STEREOFOG 数据集,包含 $10,067$ 对 fogged 和clear 图像,通过自己设计的设备捕捉,用于探索 I2I 在这个领域的潜力。这是现实世界中唯一一个这样的数据集,我们知道。此外,我们应用并优化 pix2pix I2I ML 框架,并使用最终模型在 CW-SSIM 指标上取得了 $0.76$ 的平均分数,证明了该技术在这个问题上的适用性。
Cable Slack Detection for Arresting Gear Application using Machine Vision
results: 该方法在实际舰船视频数据上进行验证,能够准确地检测锚缚系统中的缺陷,并具有较少的假阳性。用户界面也是设计得便捷,可以让操作员根据实际需求重新定义检测区域和调整方法。Abstract
The cable-based arrestment systems are integral to the launch and recovery of aircraft onboard carriers and on expeditionary land-based installations. These modern arrestment systems rely on various mechanisms to absorb energy from an aircraft during an arrestment cycle to bring the aircraft to a full stop. One of the primary components of this system is the cable interface to the engine. The formation of slack in the cable at this interface can result in reduced efficiency and drives maintenance efforts to remove the slack prior to continued operations. In this paper, a machine vision based slack detection system is presented. A situational awareness camera is utilized to collect video data of the cable interface region, machine vision algorithms are applied to reduce noise, remove background clutter, focus on regions of interest, and detect changes in the image representative of slack formations. Some algorithms employed in this system include bilateral image filters, least squares polynomial fit, Canny Edge Detection, K-Means clustering, Gaussian Mixture-based Background/Foreground Segmentation for background subtraction, Hough Circle Transforms, and Hough line Transforms. The resulting detections are filtered and highlighted to create an indication to the shipboard operator of the presence of slack and a need for a maintenance action. A user interface was designed to provide operators with an easy method to redefine regions of interest and adjust the methods to specific locations. The algorithms were validated on shipboard footage and were able to accurately identify slack with minimal false positives.
摘要
飞机起落时的缠绕系统是舰载机和远程陆基设施的重要 Component。这些现代缠绕系统利用不同的机制来吸收飞机在降落周期中的能量,以将飞机停在完全停止的状态。缠绕系统中一个主要的元件是缠绕终端与引擎之间的缆线接口。当缆线中形成弹性时,可能会导致效率下降,并促使维护人员在继续运作之前 removeslack。在这篇论文中,一个基于机器见识的缺陷检测系统被提出。使用的是一个 situational awareness camera 收集缠绕区域的视频数据,并将机器见识算法应用到降低噪音、除去背景噪音、专注于区域点,并检测缠绕区域中的变化。这些算法包括 bilateral image filters、最小二乘扩展、Canny Edge Detection、K-Means clustering、Gaussian Mixture-based Background/Foreground Segmentation for background subtraction、Hough Circle Transforms、Hough line Transforms。所得的检测结果被筛选和显示,以创建一个警示操作员的缺陷存在和需要维护行动。用户界面也被设计来提供操作员一个简单的方法来重新定义区域点和调整方法到特定的位置。这些算法在舰载视频上验证,能够具体地识别缺陷,并仅具有 minimal false positives。
MoE-AMC: Enhancing Automatic Modulation Classification Performance Using Mixture-of-Experts
For: This paper proposes a novel Mixture-of-Experts (MoE) based model called MoE-AMC to address Automatic Modulation Classification (AMC) in a well-balanced manner across varying Signal-to-Noise Ratio (SNR) conditions.* Methods: The proposed MoE-AMC model combines the strengths of two existing models, LSRM (a Transformer-based model) and HSRM (a ResNet-based model), using the MoE framework. This integration enables MoE-AMC to capture distinctive signal features under diverse SNR scenarios and achieve leading performance in modulation classification.* Results: The proposed MoE-AMC model achieved an average classification accuracy of 71.76% across different SNR levels in experiments using the RML2018.01a dataset, surpassing the performance of previous State-of-the-Art (SOTA) models by nearly 10%.Abstract
Automatic Modulation Classification (AMC) plays a vital role in time series analysis, such as signal classification and identification within wireless communications. Deep learning-based AMC models have demonstrated significant potential in this domain. However, current AMC models inadequately consider the disparities in handling signals under conditions of low and high Signal-to-Noise Ratio (SNR), resulting in an unevenness in their performance. In this study, we propose MoE-AMC, a novel Mixture-of-Experts (MoE) based model specifically crafted to address AMC in a well-balanced manner across varying SNR conditions. Utilizing the MoE framework, MoE-AMC seamlessly combines the strengths of LSRM (a Transformer-based model) for handling low SNR signals and HSRM (a ResNet-based model) for high SNR signals. This integration empowers MoE-AMC to achieve leading performance in modulation classification, showcasing its efficacy in capturing distinctive signal features under diverse SNR scenarios. We conducted experiments using the RML2018.01a dataset, where MoE-AMC achieved an average classification accuracy of 71.76% across different SNR levels, surpassing the performance of previous SOTA models by nearly 10%. This study represents a pioneering application of MoE techniques in the realm of AMC, offering a promising avenue for elevating signal classification accuracy within wireless communication systems.
摘要
《自动调谐分类(AMC)在时间序列分析中发挥了重要作用,如信号分类和识别在无线通信中。深度学习基于的AMC模型在这个领域表现出了显著的潜力。然而,现有的AMC模型不足reichly考虑了处理信号的不同SNR(噪声比)情况下的不均衡性,从而导致其性能不匀。本研究提出了MoE-AMC模型,一种基于权值的mixture-of-experts(MoE)模型,用于解决AMC问题。MoE-AMC通过将LSRM(一种基于Transformer的模型)和HSRM(一种基于ResNet的模型)融合在一起,以实现在不同SNR水平下的均衡性。这种整合使得MoE-AMC能够在模ulation分类中表现出最佳的性能,并且在不同SNR情况下能够捕捉到独特的信号特征。我们在RML2018.01a数据集上进行了实验,MoE-AMC在不同SNR水平上平均分类精度达71.76%,超过了过去最佳模型的性能 by nearly 10%。这项研究表明了MoE技术在AMC领域的应用,并提供了提高无线通信系统中信号分类精度的可能性。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.
You Can Run but not Hide: Improving Gait Recognition with Intrinsic Occlusion Type Awareness
results: 对 GREW 和 BRIAR dataset 进行了实验,结果表明在同等 occlusion 情况下,通过加入 occlusion 意识来改进了识别精度。Abstract
While gait recognition has seen many advances in recent years, the occlusion problem has largely been ignored. This problem is especially important for gait recognition from uncontrolled outdoor sequences at range - since any small obstruction can affect the recognition system. Most current methods assume the availability of complete body information while extracting the gait features. When parts of the body are occluded, these methods may hallucinate and output a corrupted gait signature as they try to look for body parts which are not present in the input at all. To address this, we exploit the learned occlusion type while extracting identity features from videos. Thus, in this work, we propose an occlusion aware gait recognition method which can be used to model intrinsic occlusion awareness into potentially any state-of-the-art gait recognition method. Our experiments on the challenging GREW and BRIAR datasets show that networks enhanced with this occlusion awareness perform better at recognition tasks than their counterparts trained on similar occlusions.
摘要
尚未得到充分关注的问题是 occlusion 问题,特别是在无法控制的户外序列中进行识别。这个问题对于跑步识别在远程进行识别来说非常重要,因为任何小的障碍物都可能影响识别系统。现有的方法假设有完整的身体信息时提取跑步特征。当身体部分被遮盖时,这些方法可能会幻想并输出损坏的跑步签名,因为它们尝试找不存在的身体部分。为解决这个问题,我们利用学习到的遮盖类型而提取身份特征。因此,在这项工作中,我们提出了一种遮盖意识的跑步识别方法,可以将这种遮盖意识模糊到任何现状最佳的跑步识别方法中。我们在 GREW 和 BRIAR 数据集上进行了实验,并证明了在相似的遮盖下进行训练的网络在识别任务中表现更好。
PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
paper_authors: Zhenyu Li, Shariq Farooq Bhat, Peter Wonka
For: High-resolution single image depth estimation* Methods: Patch-wise fusion network, Global-to-Local module, Consistency-Aware Training and Inference* Results: Generates high-resolution depth maps with intricate details, improves RMSE by 17.3% and 29.4% on UnrealStereo4K and MVS-Synth, respectively.Here’s the Chinese translation of the three points:* For: 高分辨率单张图深度估计* Methods: patch-wise协调网络、全局至本地模块、一致性感知训练和推理* Results: 生成高分辨率深度图像,包含细节信息,提高RMSE指标17.3%和29.4%在UnrealStereo4K和MVS-Synth上。Abstract
Single image depth estimation is a foundational task in computer vision and generative modeling. However, prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise, but they often face limitations, ranging from error propagation to the loss of high-frequency details. We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer, inconsistent tiled predictions via high-level feature guidance, (2) A Global-to-Local (G2L) module that adds vital context to the fusion network, discarding the need for patch selection heuristics, and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach, emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably, our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth, respectively.
摘要
单图深度估计是计算机视觉领域的基础任务之一,但现有的深度估计模型在面对今天常见的高分辨率相机和设备时受到限制。现有的高分辨率策略显示了承诺,但它们经常面临限制,从错误卷积到高频率细节的丢失。我们提出了PatchFusion,一种新的瓷片基于框架,它包括三个关键组件来改进当前状态艺:1. 一个覆盖率匹配网络,将全局一致的粗略预测与更细的、不一致的瓷片预测 fusion via 高级特征指导。2. 一个全球至本地(G2L)模块,通过提供瓷片级别的上下文,消除了选择瓷片规则的需要。3. 一种具有适度均衡和细节细节的训练和推理方法,强调瓷片重叠一致性,从而消除了后期处理的需要。实验表明,我们的框架可以在UnrealStereo4K、MVS-Synth和Middleburry 2014上生成高分辨率的深度图像,并且保留了细节。PatchFusion 独立于基础模型 для深度估计。值得注意的是,我们在ZoeDepth 基础模型上建立的框架实现了17.3%和29.4%的改进,分别在UnrealStereo4K 和 MVS-Synth 上。
results: 通过将神经隐函数场景中的颜色分配给 polygon 网格,使得用户可以通过 gradient 反向传播来进行高级编辑,包括对象添加、组件除除、特定部分塑形、颜色调整等。Abstract
Neural implicit fields have emerged as a powerful 3D representation for reconstructing and rendering photo-realistic views, yet they possess limited editability. Conversely, explicit 3D representations, such as polygonal meshes, offer ease of editing but may not be as suitable for rendering high-quality novel views. To harness the strengths of both representations, we propose a new approach that employs a mesh as a guiding mechanism in editing the neural radiance field. We first introduce a differentiable method using marching tetrahedra for polygonal mesh extraction from the neural implicit field and then design a differentiable color extractor to assign colors obtained from the volume renderings to this extracted mesh. This differentiable colored mesh allows gradient back-propagation from the explicit mesh to the implicit fields, empowering users to easily manipulate the geometry and color of neural implicit fields. To enhance user control from coarse-grained to fine-grained levels, we introduce an octree-based structure into its optimization. This structure prioritizes the edited regions and the surface part, making our method achieve fine-grained edits to the neural implicit field and accommodate various user modifications, including object additions, component removals, specific area deformations, and adjustments to local and global colors. Through extensive experiments involving diverse scenes and editing operations, we have demonstrated the capabilities and effectiveness of our method. Our project page is: \url{https://cassiepython.github.io/MNeuEdit/}
摘要
neural implicit fields 已经成为一种具有高品质渲染和重建能力的3D表示方法,但它们具有有限可编辑性。相反,显式3D表示方法,如多面体网格,可以方便地进行编辑,但可能无法渲染高品质的新视图。为了利用这两种表示方法的优势,我们提出了一种新的方法,它使用多面体网格作为导向机制来编辑神经辐射场。我们首先介绍了一种可微分的方法,使用冲击四面体来从神经隐函数中提取多面体网格,然后设计了一种可微分的颜色提取器,将从神经辐射渲染中获取的颜色分配给提取出的多面体网格。这种可微分的颜色多面体网格允许从显式网格到神经隐函数的导向幂回传播,使用户可以轻松地修改神经隐函数的几何和颜色。为了增强用户控制的精度,我们引入了一个基于八叉树的结构,该结构在优化过程中给予优先级于编辑的区域和表面部分,使我们的方法实现精度的编辑和多种用户修改,包括对象添加、组件移除、特定区域变形、以及本地和全局颜色的调整。经过了对多种场景和编辑操作的广泛实验,我们证明了我们的方法的能力和效果。关于我们的项目,请参考我们的项目页面:
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
results: 本 paper 的实验结果表明,使用该方法可以在 sparse-view 相机设定下实现 2K 分辨率的 Rendering,并且比预期的方法更高效。Abstract
We present a new approach, termed GPS-Gaussian, for synthesizing novel views of a character in a real-time manner. The proposed method enables 2K-resolution rendering under a sparse-view camera setting. Unlike the original Gaussian Splatting or neural implicit rendering methods that necessitate per-subject optimizations, we introduce Gaussian parameter maps defined on the source views and regress directly Gaussian Splatting properties for instant novel view synthesis without any fine-tuning or optimization. To this end, we train our Gaussian parameter regression module on a large amount of human scan data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. The proposed framework is fully differentiable and experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.
摘要
我们提出了一种新的方法,称为GPS-Gaussian,用于在实时模式下生成人物的新视图。我们的方法可以在精 sparse-view 摄像头设置下实现 2K 分辨率渲染。不同于原始 Gaussian Splatting 或神经隐式渲染方法,我们引入了定义在源视图上的 Gaussian 参数地图,并直接 regression Gaussian Splatting 属性以实现无需优化或微调的即时新视图合成。为此,我们在大量人体扫描数据上训练了我们的 Gaussian 参数回归模块,并同时训练了深度估计模块以将 2D 参数地图提升到 3D 空间。我们的框架完全是可导数学,并在多个数据集上进行了实验,证明了我们的方法可以超越现有方法,同时具有出色的渲染速度。
Aligning and Prompting Everything All at Once for Universal Visual Perception
results: 实验表明,APE 可以在多个 dataset 上达到或与当前状态的论文水平,而且不需要任务特有的精度调整。Abstract
Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks, APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.
摘要
现代视觉基础模型在最近几年来得到了广泛的研究和应用。然而,传统的方法,即将实例级任务转化为对象-词对应,会带来巨大的交叉Modal交互,这不仅不效果地启动对象检测和视觉定位,还会增加模型的复杂性。另一些研究强调像素级任务,但是这些任务常常面临巨大的对象和背景分类批注的缺失,同时也会因为前景对象和背景分类的干扰而降低效果。与先前的方法不同,我们提出了APE,一种通用的视觉感知模型,可以同时在图像中对多个任务进行多种任务,即检测、分割和定位,并将其视为一个对象-词对应的实例级别任务。具体来说,APE通过重新定义语言引导的地面检测,使得模型可以规模化地提高模型提示,同时保持交叉模式的效果。为了填补像素级任务之间的粒度差,APE将语义和пан奥普 Segmentation相等,以代表实体学习。APE在具有自然和挑战性的数据集上进行广泛的实验,结果表明,只需一个适用于所有任务的模型 weights,APE可以与先前的状态艺模型相当或超越,证明了一种有效的通用视觉感知是可能的。代码和训练模型可以在https://github.com/shenyunhang/APE 获取。
Steerers: A framework for rotation equivariant keypoint descriptors
results: 我们通过学习线性变换,使描述符能够抵抗旋转。我们称之为”导向”。我们在所有的可能的导向上进行优化,并在AIMS和Roto-360中获得了状态机器人的结果。我们在github上发布了代码和模型参数。Abstract
Image keypoint descriptions that are discriminative and matchable over large changes in viewpoint are vital for 3D reconstruction. However, descriptions output by learned descriptors are typically not robust to camera rotation. While they can be made more robust by, e.g., data augmentation, this degrades performance on upright images. Another approach is test-time augmentation, which incurs a significant increase in runtime. We instead learn a linear transform in description space that encodes rotations of the input image. We call this linear transform a steerer since it allows us to transform the descriptions as if the image was rotated. From representation theory we know all possible steerers for the rotation group. Steerers can be optimized (A) given a fixed descriptor, (B) jointly with a descriptor or (C) we can optimize a descriptor given a fixed steerer. We perform experiments in all of these three settings and obtain state-of-the-art results on the rotation invariant image matching benchmarks AIMS and Roto-360. We publish code and model weights at github.com/georg-bn/rotation-steerers.
摘要
From representation theory, we know all possible steerers for the rotation group. Steerers can be optimized in three settings: (A) given a fixed descriptor, (B) jointly with a descriptor, or (C) we can optimize a descriptor given a fixed steerer. We conduct experiments in all three settings and achieve state-of-the-art results on the rotation-invariant image matching benchmarks AIMS and Roto-360. Our code and model weights are available at github.com/georg-bn/rotation-steerers.
Readout Guidance: Learning Control from Diffusion Features
results: 相比于先前的条件生成方法,Readout Guidance需要更少的额外参数和训练样本,并提供了一个便捷简单的方法来重produce不同的条件控制,只需要一个架构和抽取过程。我们在拖动基本操作、个体相同生成和空间对齐控制等应用中展示了这些优势。项目页面:https://readout-guidance.github.io。Abstract
We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control. Project page: https://readout-guidance.github.io.
摘要
我们介绍Readout Guidance,一种控制文本至图散diffusion模型的方法,使用受训头部网络,将受训模型的特征中的信号提取出来。这些受训头部网络可以将单一图像的特征,如姿势、深度和边缘提取出来,或者高阶特征,如相对性和外观相似度。此外,通过比较受训头部网络的估计和用户定义的目标进行比较,并将梯度传递回受训头部网络,这些估计可以用来导引抽样过程。与过去的条件生成方法相比,Readout Guidance需要更少的额外参数和训练数据,并且提供了一个便利的和简单的方法来在单一框架和抽样过程中实现不同的条件控制。我们在拖曳式操作、身份相符的生成和空间相对的控制中展示了这些优点。相关页面:https://readout-guidance.github.io。
Rejuvenating image-GPT as Strong Visual Representation Learners
results: 该论文的实验结果表明,D-iGPT模型在ImageNet-1K数据集上达到了89.5%的顶峰1准确率,并且在下游任务和外部样本上具有强大的泛化和稳定性。Abstract
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at \href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}.
摘要
这个论文提出了一种基于autoregressive预训练的新方法,即将像素预测目标从原始像素转换为semantic标签,以便更高层次地理解视觉内容。此外,我们还增加了一种在预测模型中指定模型预测不仅下一个标签,还要预测可见的标签的方法。这种管道被称为D-iGPT。我们在EXTensive的实验中发现,D-iGPT在视觉表示学习中表现出了极佳的表现,特别是在ImageNet-1K dataset上。通过公共可用的数据集进行训练,D-iGPT可以达到89.5%的顶峰1准确率,使用vanilla ViT-Large模型。此外,D-iGPT还在下游任务和外部样本上表现出了强大的普适性和稳定性。代码可以在\href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}上找到。
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
methods: 该方法基于Stable Diffusion模型, retained its rich prior knowledge,并可以在几天内在单个GPU上练习Synthetic数据上进行精度调整。
results: 该方法在多个数据集上达到了state-of-the-art表现,包括特定情况下超过20%的性能提升。Is there anything else you’d like to know?Abstract
Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.
摘要
《单目深度估计》是计算机视觉领域的基本任务之一。从单个图像中恢复3D深度是 geommetrically ill-posed,需要场景理解,因此不 surprisingly,深度学习的发展带来了breakthrough。单目深度估计器的卓越进步与模型容量的增长相随,从relatively modest CNNs到大型Transformer architectures。然而,单目深度估计器对于图像中的不熟悉内容和布局会做出困难,因为它们在训练中所学到的视觉世界的知识是通过数据所限制的,并且在新领域中进行零容量扩展是具有挑战性的。这种情况激励我们来探索 Whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation。我们提出了Marigold方法,这是基于Stable Diffusion的单目深度估计器,它保留了丰富的先前知识。这个估计器可以在几天内,使用单个GPU和synthetic training data进行精细调整。它在各种数据集上达到了state-of-the-art表现,包括一些特定情况下的20%以上表现提升。项目页面:https://marigoldmonodepth.github.io。
Optimizing Camera Configurations for Multi-View Pedestrian Detection
results: 对多个 simulations enario,配置生成器 consistently 表现出色,超过了随机搜索、经验法则和人工设计的配置。Abstract
Jointly considering multiple camera views (multi-view) is very effective for pedestrian detection under occlusion. For such multi-view systems, it is critical to have well-designed camera configurations, including camera locations, directions, and fields-of-view (FoVs). Usually, these configurations are crafted based on human experience or heuristics. In this work, we present a novel solution that features a transformer-based camera configuration generator. Using reinforcement learning, this generator autonomously explores vast combinations within the action space and searches for configurations that give the highest detection accuracy according to the training dataset. The generator learns advanced techniques like maximizing coverage, minimizing occlusion, and promoting collaboration. Across multiple simulation scenarios, the configurations generated by our transformer-based model consistently outperform random search, heuristic-based methods, and configurations designed by human experts, shedding light on future camera layout optimization.
摘要
笔者们证明,通过多视图(multi-view)的共同考虑,可以大幅提高人员检测下 occlusion 的准确率。为实现这一目标,需要设计合适的相机配置,包括相机位置、方向和视场(FoV)。通常,这些配置是基于人类经验或规则所设计的。在这项工作中,我们提出了一种基于 transformer 的相机配置生成器。通过 reinforcement learning,这个生成器可以自主探索巨量的动作空间,并寻找以培训数据集为标准的最高检测精度的配置。生成器学习了高级技巧,如涵盖率最大化、 occlusion 最小化和协作促进。在多个 simulation 场景中,由我们的 transformer-based 模型生成的配置一直高于随机搜索、基于经验或规则的方法,以及由人类专家设计的配置,这有助于未来相机布局优化。
results: 该方法在实验中得到了与全模型相同的性能,而且更加高效。Abstract
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp
摘要
我们提出了一种将对象识别视为下一个字符预测的方法。我们的想法是使用语言解码器,通过自动循环预测图像嵌入获得的文本字符,以生成标签。为了将这个预测过程与自动预测联系起来,我们定制了一个非 causal注意力 маска,包括两个关键特征:将不同的标签中的字符视为独立,并将图像字符视为前缀。这种掩码机制 inspirits 一种高效的方法——一次采样——同时在探索阶段,并将生成的标签按照其概率排序。为了进一步提高效率,我们提议一种简单的策略——构建一个紧凑的解码器, simply discarding the intermediate blocks of a pretrained language model。这种方法可以在一个更高效的方式下达到同样的性能。代码可以在 GitHub 上找到:https://github.com/kaiyuyue/nxtp。
results: 我们通过了广泛的实验,并证明了IL可以在特征匹配和 pose 估计等任务上达到更高的性能,比如取得了state-of-the-art的匹配模型的平均提高率为30%。Abstract
Learning feature correspondence is a foundational task in computer vision, holding immense importance for downstream applications such as visual odometry and 3D reconstruction. Despite recent progress in data-driven models, feature correspondence learning is still limited by the lack of accurate per-pixel correspondence labels. To overcome this difficulty, we introduce a new self-supervised scheme, imperative learning (IL), for training feature correspondence. It enables correspondence learning on arbitrary uninterrupted videos without any camera pose or depth labels, heralding a new era for self-supervised correspondence learning. Specifically, we formulated the problem of correspondence learning as a bilevel optimization, which takes the reprojection error from bundle adjustment as a supervisory signal for the model. To avoid large memory and computation overhead, we leverage the stationary point to effectively back-propagate the implicit gradients through bundle adjustment. Through extensive experiments, we demonstrate superior performance on tasks including feature matching and pose estimation, in which we obtained an average of 30% accuracy gain over the state-of-the-art matching models.
摘要
学习特征对应是计算机视觉中的基本任务,对下游应用如视觉运动和3D重建具有重要意义。尽管最近的数据驱动模型在此方面做出了重要进步,但特征对应学习仍受到准确每个像素对应标签的缺乏所限。为了解决这Difficulty, we introduce a new self-supervised scheme, imperative learning (IL), for training feature correspondence. It enables correspondence learning on arbitrary uninterrupted videos without any camera pose or depth labels, marking a new era for self-supervised correspondence learning. Specifically, we formulated the problem of correspondence learning as a bilevel optimization, which takes the reprojection error from bundle adjustment as a supervisory signal for the model. To avoid large memory and computation overhead, we leverage the stationary point to effectively back-propagate the implicit gradients through bundle adjustment. Through extensive experiments, we demonstrate superior performance on tasks including feature matching and pose estimation, obtaining an average of 30% accuracy gain over the state-of-the-art matching models.
MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians
methods: 这篇论文使用了Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians方法,这种方法使用了三维 Gaussian splatting来模型手部的动作和拟合。
results: 论文的结果表明,使用这种方法可以高精度地估算手部和物体之间的接触,并且比其他方法更加准确。Abstract
Understanding how we grasp objects with our hands has important applications in areas like robotics and mixed reality. However, this challenging problem requires accurate modeling of the contact between hands and objects. To capture grasps, existing methods use skeletons, meshes, or parametric models that can cause misalignments resulting in inaccurate contacts. We present MANUS, a method for Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians. We build a novel articulated 3D Gaussians representation that extends 3D Gaussian splatting for high-fidelity representation of articulating hands. Since our representation uses Gaussian primitives, it enables us to efficiently and accurately estimate contacts between the hand and the object. For the most accurate results, our method requires tens of camera views that current datasets do not provide. We therefore build MANUS-Grasps, a new dataset that contains hand-object grasps viewed from 53 cameras across 30+ scenes, 3 subjects, and comprising over 7M frames. In addition to extensive qualitative results, we also show that our method outperforms others on a quantitative contact evaluation method that uses paint transfer from the object to the hand.
摘要
理解我们如何用手握住物品有着重要的应用在机器人和混合现实领域。然而,这是一个具有困难的问题,需要准确地模型手部和物品之间的接触。现有的方法使用骨架、面或参数模型,可能会导致偏移,从而导致不准确的接触。我们提出了MANUS方法,即无标记手部物品抓取capture方法,使用三维 Gaussian 表示。我们构建了一种新的三维 Gaussian 表示,该表示可以高精度地表示手部的动作。由于我们的表示使用 Gaussian 基本体,因此可以高效地和准确地估计手部和物品之间的接触。为了获得最高的准确性,我们的方法需要数十个摄像头视图,当前的数据集并不提供这些视图。因此,我们构建了MANUS-Grasps数据集,该数据集包含了手部物品抓取的30+场景、3名subject、700万帧以上的视图。此外,我们还提供了详细的qualitative结果,以及与其他方法进行比较的量化接触评估方法。
BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation
results: 通过包括 pozitional encoding和低通滤波器在内的生成器,实现了equivariant性,可以生成大规模、甚至无限规模的3D场景。Abstract
Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely, objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover, by adequately incorporating positional encoding and low-pass filters into the generator, the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at https://zqh0253.github.io/BerfScene/.
摘要
生成大规模3D场景不能简单地采用现有的3D物体生成技术,因为3D场景通常具有复杂的空间配置和许多物体在不同的缩放级别上出现。我们因此提议一种实用和高效的3D表示方法,即通过沿用鸟瞰视图(BEV)地图的指导,将3D场景中的物体 sintesized 到相应的BEV地图上。这样,可以通过控制相应的BEV地图来轻松地 manipulate 生成的3D场景。此外,通过适当地包含位坐编码和低通滤波器到生成器中,使得表示变得对给定BEV地图的位坐编码和缩放级别具有对称性。这种对称性使得我们可以通过synthesize 局部场景并将其拼接在一起,以实现大规模、甚至无限规模的3D场景生成。我们的项目网站是https://zqh0253.github.io/BerfScene/.
Re-Nerfing: Enforcing Geometric Constraints on Neural Radiance Fields through Novel Views Synthesis
results: 在mip-NeRF 360数据集上进行了广泛的实验,并达到了对Zip-NeRF的提高,即使只使用少量视图进行训练。Abstract
Neural Radiance Fields (NeRFs) have shown remarkable novel view synthesis capabilities even in large-scale, unbounded scenes, albeit requiring hundreds of views or introducing artifacts in sparser settings. Their optimization suffers from shape-radiance ambiguities wherever only a small visual overlap is available. This leads to erroneous scene geometry and artifacts. In this paper, we propose Re-Nerfing, a simple and general multi-stage approach that leverages NeRF's own view synthesis to address these limitations. With Re-Nerfing, we increase the scene's coverage and enhance the geometric consistency of novel views as follows: First, we train a NeRF with the available views. Then, we use the optimized NeRF to synthesize pseudo-views next to the original ones to simulate a stereo or trifocal setup. Finally, we train a second NeRF with both original and pseudo views while enforcing structural, epipolar constraints via the newly synthesized images. Extensive experiments on the mip-NeRF 360 dataset show the effectiveness of Re-Nerfing across denser and sparser input scenarios, bringing improvements to the state-of-the-art Zip-NeRF, even when trained with all views.
摘要
Re-Nerfing 的步骤如下:首先,我们将可以提供的见解训练一个 NeRF。然后,我们使用已经优化的 NeRF 来生成 Pseudo-views ,模拟一个双眼或三眼设置。最后,我们将第二个 NeRF 训练,使其满足体积、视线的约束,这些约束是通过新生成的图像来提供。我们在 mip-NeRF 360 dataset 进行了广泛的实验,展示了 Re-Nerfing 在不同的输入情况下的有效性,包括更密集和更疏的输入情况。此外,我们发现 Re-Nerfing 可以超越 zip-NeRF,甚至在所有见解训练时仍然保持顶尖性。
paper_authors: Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu
for: 这篇论文是为了实现高质量的新视图synthesis from an in-the-wild video,即使场景动态和缺乏parallax。
methods: 该论文使用explicit video representations,将静止和动态视频内容分别处理,使用扩展的平面基本场景表示法和球面幂函数和偏移图来捕捉视点依赖的效果和非平面复杂表面几何。
results: 该方法可以快速地估算hybrid video表示,并在实时渲染新视图。实验显示,与状态对照方法相当,该方法可以从实际场景中的各种视频中生成高质量的新视图,而且100倍 faster于现有方法在训练和渲染。Abstract
Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.
摘要
对于宽泛场景中的视频 synthesis,由于场景动态和缺乏丰杂度的问题,traditional methods Difficult to achieve high-quality novel views. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100 times faster in training and enabling real-time rendering.Here's the translation in Traditional Chinese:对于宽泛场景中的影像synthesis,由于场景动态和缺乏丰杂度的问题,传统方法Difficult to achieve high-quality novel views. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100 times faster in training and enabling real-time rendering.
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
results: 在公共数据集和自己收集的数据集上 validate GaussianAvatar 的精度和渲染效率,并证明其在 appearances 质量和渲染效率方面具有优于其他方法。Abstract
We present GaussianAvatar, an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling, where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover, by leveraging the differentiable motion condition, our method enables a joint optimization of motions and appearances during avatar modeling, which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset, demonstrating its superior performances in terms of appearance quality and rendering efficiency.
摘要
我们现在提出了一种高效的 GaussianAvatar 方法,可以从单个视频中生成真实的人物模型,并且具有动态的 3D 外表。我们首先引入了可动的 3D Gaussians,用于表示不同的姿势和服装风格的人物。这种显式和可动的表示方式可以更加高效地将 3D 外表从 2D 观察中 fusion。我们的表示方式还受到动态性的支持,以便在人物模型中支持姿势 dependent 的外表模型化。此外,我们还利用了可微动作条件,以便在人物模型中进行 JOINT 优化。这种方法可以帮助解决单目设备中的不准确的运动估计问题。我们的 GaussianAvatar 方法在公共数据集和我们收集的数据集上进行验证,并证明其在外表质量和渲染效率方面具有显著的优势。
Style Aligned Image Generation via Shared Attention
results: 论文的实验表明,该方法可以在多种风格和文本提示下实现高质量的图像生成和准确性。Abstract
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.
摘要
大规模文本到图像(T2I)模型在创作领域快速占据主流,生成从文本提示中拥有视觉吸引力的输出。然而,控制这些模型以保持风格一致性仍然具有挑战性,现有的方法需要细致的调整和手动干预,以分离内容和风格。在这篇论文中,我们介绍了 StyleAligned,一种新的技术,可以在T2I模型中Establish风格对齐。我们的方法通过最小化注意力共享的Diffusion过程来保持图像内的风格一致性。这种方法允许通过引用风格的简单反转操作来创造一致的风格图像。我们的方法在多种风格和文本提示下进行评估,显示了高质量的合成和准确性,证明了它在不同的输入上的效果。
Can we truly transfer an actor’s genuine happiness to avatars? An investigation into virtual, real, posed and spontaneous faces
results: 结果显示,姿势化数据集中的AU强度较高,而自然表情数据集中的AU强度较低。此外,将真实人脸转化为计算机图形人脸时,AU强度会逐渐减弱,达到80% для AU6 和 45% для AU12。Abstract
A look is worth a thousand words is a popular phrase. And why is a simple look enough to portray our feelings about something or someone? Behind this question are the theoretical foundations of the field of psychology regarding social cognition and the studies of psychologist Paul Ekman. Facial expressions, as a form of non-verbal communication, are the primary way to transmit emotions between human beings. The set of movements and expressions of facial muscles that convey some emotional state of the individual to their observers are targets of studies in many areas. Our research aims to evaluate Ekman's action units in datasets of real human faces, posed and spontaneous, and virtual human faces resulting from transferring real faces into Computer Graphics faces. In addition, we also conducted a case study with specific movie characters, such as SheHulk and Genius. We intend to find differences and similarities in facial expressions between real and CG datasets, posed and spontaneous faces, and also to consider the actors' genders in the videos. This investigation can help several areas of knowledge, whether using real or virtual human beings, in education, health, entertainment, games, security, and even legal matters. Our results indicate that AU intensities are greater for posed than spontaneous datasets, regardless of gender. Furthermore, there is a smoothing of intensity up to 80 percent for AU6 and 45 percent for AU12 when a real face is transformed into CG.
摘要
一个表情值得千言是一句流行的话语。而为什么一个简单的表情足以表达我们对某事或某人的感受呢?这与心理学领域的社交认知理论和 психолог保罗·埃克曼的研究有着深厚的关系。人脸表情作为非语言性交流的主要方式,可以将人类之间的情感传递给另一方。我们的研究旨在评估埃克曼的动作单位(AU)在真实人脸数据集和计算机图形人脸中的表现。此外,我们还进行了一个案例研究,包括SheHulk和Genius等电影角色。我们希望在不同的数据集、姿势和性别方面发现表情的差异和相似之处。这些研究结果可以帮助多个领域,无论使用真实人类或计算机生成的人脸,如教育、医疗、娱乐、游戏、安全和法律等。我们的结果表明,姿势数据集中AU的强度比自然数据集更高,无论gender。此外,将真实人脸转换成计算机图形人脸时,AU6和AU12的强度会平滑至80%和45%。
VerA: Versatile Anonymization Fit for Clinical Facial Images
results: VerA方法在匿名化正规影像时比或者是与现有方法相当,并且在医疗影像中进行对应的双影像匿名化时也表现了出色的效果。Abstract
The escalating legislative demand for data privacy in facial image dissemination has underscored the significance of image anonymization. Recent advancements in the field surpass traditional pixelation or blur methods, yet they predominantly address regular single images. This leaves clinical image anonymization -- a necessity for illustrating medical interventions -- largely unaddressed. We present VerA, a versatile facial image anonymization that is fit for clinical facial images where: (1) certain semantic areas must be preserved to show medical intervention results, and (2) anonymizing image pairs is crucial for showing before-and-after results. VerA outperforms or is on par with state-of-the-art methods in de-identification and photorealism for regular images. In addition, we validate our results on paired anonymization, and on the anonymization of both single and paired clinical images with extensive quantitative and qualitative evaluation.
摘要
“随着数据隐私立法的不断增加,面彩散布的图像隐私化已经强调了其重要性。现有的技术主要是对单一图像进行像素化或模糊,但这些技术几乎没有处理医疗图像。这使得医疗图像隐私化仍然未得到充分处理。我们提出了VerA,一个适用于医疗面彩图像的多功能隐私化方法,其特点是:(1)保留特定 semantic 区域,以显示医疗干预结果;(2)适用于医疗图像的双向匿名化。VerA与州先进技术相比或者在匿名化和真实性方面表现出色,并且我们验证了我们的结果在双向匿名化和单向匿名化方面的可靠性和质量。”
Mathematical Supplement for the $\texttt{gsplat}$ Library
results: 论文提供了一个自 contenido的Python API,用于在github.com/nerfstudio-project/gsplat中进行矢量化和反向传播。Abstract
This report provides the mathematical details of the gsplat library, a modular toolbox for efficient differentiable Gaussian splatting, as proposed by Kerbl et al. It provides a self-contained reference for the computations involved in the forward and backward passes of differentiable Gaussian splatting. To facilitate practical usage and development, we provide a user friendly Python API that exposes each component of the forward and backward passes in rasterization at github.com/nerfstudio-project/gsplat .
摘要
这份报告介绍了gsplat库,一个可重用的差分凝聚工具箱,如果额的投影方法,由Kerbl等人提出。报告提供了对于前向和反向传递的计算细节,以便实用和开发。为了便于实践和开发,我们提供了一个易于使用的Python API,该API在github.com/nerfstudio-project/gsplat上提供了每个前向和反向传递的组件。
results: 作者在class-conditional image generation和causal modeling中应用GIVT,并取得了竞争力的结果,而且在使用GIVT进行causal modeling时,其表现比VQ-GAN和MaskGIT更好。此外,作者还应用了GIVT在UViM框架中进行�anoptic segmentation和深度估计,并取得了竞争力的结果。Abstract
We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.
摘要
我们介绍生成无限词汇 трансформа器(GIVT),它们生成vector序列中的实数值,而不是从词汇表中获取硬coded的token。为此,我们提出了两个意外的简单修改:1)在输入端,我们将词汇表 replaced with一个线性投影,将输入vector映射到一个vector space中; 2)在输出端,我们将预测值(通常将映射到一个 categorical distribution) replaced with一个多重Gaussian mixture model的参数。受VQ-GAN和MaskGIT的图像生成模式惊吓,我们使用GIVT来模型VAE中的不量化实数序列。当我们应用GIVT到阶层的masked模型中,我们获得了与MaskGIT的竞争性结果,而我们的方法则在使用VQ-GAN和MaskGIT时获得更好的结果。最后,我们在应用我们的方法到照片分类、深度估计和照片分类中获得了竞争性的结果。
ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation
results: 对比当前状态艺术方法,ArtAdapter的评价表明其在风格传输中具有无 precedent 的精度和灵活性,并且可以在零批训练情况下进行快速跑通。Abstract
This work introduces ArtAdapter, a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color, brushstrokes, and object shape, capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapte to achieve unprecedented fidelity in style transfer, ensuring close alignment with textual descriptions. Additionally, the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style, alleviating the borrowing of content from style references. Moreover, our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods.
摘要
这个研究引入了ArtAdapter,一种转换性文本到图像(T2I)风格传输框架,超越传统颜色、笔触和物体形状的限制,捕捉高级风格元素 such as 组织和独特艺术表达。我们提出的多级风格编码器和我们的显式适应机制的结合使得ArtAdapte可以实现历史上最高的风格传输精度,确保文本描述的准确性。此外,我们的增强适应approach可以进一步提高零基础风格表示,同时降低风格参照的内容借鉴风险。此外,我们的新的快速训练方法可以进一步提高零基础风格表示,同时降低过拟合风险。广泛的评估表明,ArtAdapter超越了当前状态的方法。
Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection
for: This paper focuses on open-vocabulary object detection (OVOD) and aims to directly learn region-text alignment for arbitrary concepts.
methods: The proposed method, called Pseudo-Labeling for Arbitrary Concepts (PLAC), uses a simple yet effective approach to learn arbitrary image-to-text mapping for pseudo-labeling of arbitrary concepts.
results: The proposed method shows competitive performance on the standard OVOD benchmark for noun concepts and a large improvement on referring expression comprehension benchmark for arbitrary concepts.Abstract
Open-vocabulary object detection (OVOD) has recently gained significant attention as a crucial step toward achieving human-like visual intelligence. Existing OVOD methods extend target vocabulary from pre-defined categories to open-world by transferring knowledge of arbitrary concepts from vision-language pre-training models to the detectors. While previous methods have shown remarkable successes, they suffer from indirect supervision or limited transferable concepts. In this paper, we propose a simple yet effective method to directly learn region-text alignment for arbitrary concepts. Specifically, the proposed method aims to learn arbitrary image-to-text mapping for pseudo-labeling of arbitrary concepts, named Pseudo-Labeling for Arbitrary Concepts (PLAC). The proposed method shows competitive performance on the standard OVOD benchmark for noun concepts and a large improvement on referring expression comprehension benchmark for arbitrary concepts.
摘要
Large Language Models as Consistent Story Visualizers
paper_authors: Xiaoqian Shen, Mohamed Elhoseiny for:* 这个论文旨在解决Story Visualization中的一个主要挑战,即处理异常引用和保持人物和背景的一致性。methods:* 这篇论文提出了一种新的StoryGPT-V模型,它利用了隐藏傅尔散(LDM)和大语言模型(LLM)的优势,以生成基于给定故事描述的图像,并保证人物和背景的一致性。results:* 论文的实验结果表明,StoryGPT-V模型可以在两个视觉故事描述 benchmark上显示出优秀的数据统计结果,同时具有低内存占用率和高质量人物生成能力。Abstract
Recent generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Nevertheless, a significant challenge remains in applying these models for the more intricate task of story visualization. Since it requires resolving pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution, and ensuring consistent characters and background synthesis across frames. Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce \textbf{StoryGPT-V}, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. First, we train a character-aware LDM, which takes character-augmented semantic embedding as input and includes the supervision of the cross-attention map using character segmentation masks, aiming to enhance character generation accuracy and faithfulness. In the second stage, we enable an alignment between the output of LLM and the character-augmented embedding residing in the input space of the first-stage model. This harnesses the reasoning ability of LLM to address ambiguous references and the comprehension capability to memorize the context. We conduct comprehensive experiments on two visual story visualization benchmarks. Our model reports superior quantitative results and consistently generates accurate characters of remarkable quality with low memory consumption. Our code will be made publicly available.
摘要
现代生成模型已经展现出很强的能力,可以根据文本提示生成真实和Visual Pleasing的图像。然而,在应用这些模型进行更复杂的故事视觉化任务时,还是面临着一个主要挑战,即处理宾语(他、她、他们)的解析,保证图像中的人物和背景的一致性。然而,emerging Large Language Model(LLM)表现出了强大的解释能力,可以在不确定参照中穿梭并处理广泛的序列。因此,我们介绍了 StoryGPT-V,它利用了隐藏扩散(LDM)和 LLM 的优点,生成基于给定故事描述的图像,具有高质量和一致的人物。我们的方法包括两个阶段:第一阶段是使用人物意识的 LDM,使用人物增强的semantic embedding作为输入,并使用人物分割mask来监督人物生成的准确性和忠诚性。在第二阶段,我们使用 LLM 的输出和人物增强的embedding来实现人物和背景的一致性。这使得 LLM 可以解决不确定参照和处理广泛的序列,同时保持 context的理解和记忆。我们进行了对两个视觉故事可视化 benchmark 的全面实验,我们的模型报告了superior的量化结果,并一致地生成了高质量的人物。我们的代码将会公开发布。
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
paper_authors: Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang
for: 本研究旨在实现视频编辑中的形态变化,通过自定义视频主题交换来替换源视频中的主题,并且使用 semantic point correspondences 来实现对主题的形态修改。
methods: 本研究提出了 VideoSwap 框架,通过启用 semantic point correspondences 来取代 dense correspondences,并且引入了用户点互动(如移除点和拖动点)来解决不同 semantic point correspondences。
results: 广泛的实验显示,VideoSwap 框架可以实现状态领先的视频主题交换效果,并且可以处理各种实际视频中的形态变化。Abstract
Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swapping in this work, where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. We also introduce various user-point interactions (\eg, removing points and dragging points) to address various semantic point correspondence. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.
摘要
当前的扩散基于视频编辑主要关注结构保持编辑,通过多种紧密相关性来保证时间一致性和运动对alignment。然而,这些方法在目标编辑包含形状变化时 often 无效。为了在视频编辑中实现形状变化,我们在这里进行自定义视频主题交换,目标将源视频中的主题替换为具有不同形状和特征的目标主题。与之前的方法相比,我们介绍了 VideoSwap 框架,它利用 semantic point correspondences,从我们观察到只需一小部分 semantic point 可以对主题的运动轨迹进行对应和修改其形状。我们还引入了多种用户点交互(例如,移除点和拖动点)来解决不同 semantic point correspondence。我们的实验结果表明,VideoSwap 框架可以在多种实际视频中实现状态 искусство视频主题交换。
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
methods: 使用 Gaussian 扩散模型和 parametric 形态模型进行动态 3D 表示,并通过Expression 传输和参数调整来实现精确的动画控制
results: 在多个复杂情况下,如驾驶视频reenacting,方法表现出色,胜过现有方法In English, this translates to:
for: Creating highly controllable, photorealistic head avatar models
methods: Using Gaussian splat models and parametric morphable models for dynamic 3D representation, and achieving precise animation control through expression transfer and parameter adjustment
results: Outstanding performance in multiple challenging scenarios, such as reenacting driving videos, surpassing existing methods.Abstract
We introduce GaussianAvatars, a new method to create photorealistic head avatars that are fully controllable in terms of expression, pose, and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying parametric model, e.g., through expression transfer from a driving sequence or by manually changing the morphable model parameters. We parameterize each splat by a local coordinate frame of a triangle and optimize for explicit displacement offset to obtain a more accurate geometric representation. During avatar reconstruction, we jointly optimize for the morphable model parameters and Gaussian splat parameters in an end-to-end fashion. We demonstrate the animation capabilities of our photorealistic avatar in several challenging scenarios. For instance, we show reenactments from a driving video, where our method outperforms existing works by a significant margin.
摘要
我们介绍 GaussianAvatars,一种新的方法创建高度可控的头镜器人,包括表达、姿势和视点。核心思想是基于3D Gaussian splat的动态表示,rigged到 Parametric Morphable Face Model。这种结合使得可以实现高品质渲染,同时允许精准的动画控制,例如通过expression transfer from driving sequence或者通过手动修改 morphable model parameters。我们对每个splat进行本地坐标系三角形的参数化,并优化显式偏移量来获得更加准确的几何表示。在镜器人重建中,我们同时优化 morphable model parameters和 Gaussian splat parameters的结果。我们在一些复杂的场景中展示了我们的高品质镜器人的动画能力,例如从驾驶视频中的reenactings,我们的方法在 existed works 上出perform了一定的较大的差异。
DUCK: Distance-based Unlearning via Centroid Kinematics
paper_authors: Marco Cotogni, Jacopo Bonato, Luigi Sabetta, Francesco Pelosin, Alessandro Nicolosi
for: Ensuring privacy in modern artificial intelligence models by eradicating residual influence of specific data subsets.
methods: Distance-based Unlearning via Centroid Kinematics (DUCK) algorithm using metric learning to remove samples matching the nearest incorrect centroid in the embedding space.
results: State-of-the-art performance in class removal and homogeneous sampling removal scenarios, with a novel metric (Adaptive Unlearning Score) to evaluate the unlearning process and a novel membership inference attack to assess the algorithm’s capacity to erase previously acquired knowledge.Abstract
Machine Unlearning is rising as a new field, driven by the pressing necessity of ensuring privacy in modern artificial intelligence models. This technique primarily aims to eradicate any residual influence of a specific subset of data from the knowledge acquired by a neural model during its training. This work introduces a novel unlearning algorithm, denoted as Distance-based Unlearning via Centroid Kinematics (DUCK), which employs metric learning to guide the removal of samples matching the nearest incorrect centroid in the embedding space. Evaluation of the algorithm's performance is conducted across various benchmark datasets in two distinct scenarios, class removal, and homogeneous sampling removal, obtaining state-of-the-art performance. We introduce a novel metric, called Adaptive Unlearning Score (AUS), encompassing not only the efficacy of the unlearning process in forgetting target data but also quantifying the performance loss relative to the original model. Moreover, we propose a novel membership inference attack to assess the algorithm's capacity to erase previously acquired knowledge, designed to be adaptable to future methodologies.
摘要
Machine Unlearning 是一新的领域,受到现代人工智能模型中保护隐私的需要所驱动。这种技术主要目标是在 neural model durante 训练中取消任何特定子集数据的影响。这项工作介绍了一种新的忘记算法,称为距离基于中心动力学的忘记算法(DUCK),该算法利用度量学来导引匹配 incorrect centroid embedding 空间中的样本 removals。我们在不同的 benchmark 数据集上进行了多种情况的评估,包括类型 removal 和同一类 sampling removal,并取得了当前领域的最佳性能。我们还介绍了一个新的度量,称为适应忘记分数(AUS),该度量不仅涵盖了忘记过程中忘记目标数据的效果,而且还量化了相对于原始模型的性能损失。此外,我们还提出了一种新的会员推测攻击,用于评估算法是否能够消除先前获得的知识,这种攻击适应于未来的方法ologies。
Implicit Learning of Scene Geometry from Poses for Global Localization
results: 我们的方法可以在三个常见的视觉地标集上达到最新的回归方法的姿态精度,并且在扩展学习约束下进行了改进。在推理过程中,我们的模型可以在实时中估计图像中的3D场景几何,并将其与全局坐标系对齐以获取姿态。Abstract
Global visual localization estimates the absolute pose of a camera using a single image, in a previously mapped area. Obtaining the pose from a single image enables many robotics and augmented/virtual reality applications. Inspired by latest advances in deep learning, many existing approaches directly learn and regress 6 DoF pose from an input image. However, these methods do not fully utilize the underlying scene geometry for pose regression. The challenge in monocular relocalization is the minimal availability of supervised training data, which is just the corresponding 6 DoF poses of the images. In this paper, we propose to utilize these minimal available labels (.i.e, poses) to learn the underlying 3D geometry of the scene and use the geometry to estimate the 6 DoF camera pose. We present a learning method that uses these pose labels and rigid alignment to learn two 3D geometric representations (\textit{X, Y, Z coordinates}) of the scene, one in camera coordinate frame and the other in global coordinate frame. Given a single image, it estimates these two 3D scene representations, which are then aligned to estimate a pose that matches the pose label. This formulation allows for the active inclusion of additional learning constraints to minimize 3D alignment errors between the two 3D scene representations, and 2D re-projection errors between the 3D global scene representation and 2D image pixels, resulting in improved localization accuracy. During inference, our model estimates the 3D scene geometry in camera and global frames and aligns them rigidly to obtain pose in real-time. We evaluate our work on three common visual localization datasets, conduct ablation studies, and show that our method exceeds state-of-the-art regression methods' pose accuracy on all datasets.
摘要
全球视觉地理估计照片中的相对pose,使用单张图像,在已经地图化的区域内。获取pose从单张图像,可以激发许多 робо扮和增强/虚拟现实应用。受最新的深度学习技术的激发,许多现有方法直接学习并从输入图像中RECTauss 6DoF pose。然而,这些方法并不完全利用图像下的场景几何结构来进行姿态估计。在单图像约束下,增加姿态估计的挑战是缺乏较多的监督学习数据,即图像的对应6DoF姿态。在这篇论文中,我们提议利用这些最少可用的标签(即图像的对应6DoF姿态)来学习图像下的场景几何结构,并使用这种几何结构来估计图像中的6DoF姿态。我们提出一种学习方法,使用这些姿态标签和rigid对齐来学习图像下的场景几何结构,并将其与图像进行对齐,以估计图像中的6DoF姿态。这种形式允许在执行时添加额外的学习约束,以降低3D对齐错误和2D重 проек错误,从而提高地理估计精度。在推理过程中,我们将图像中的3D场景几何结构与全球坐标系中的3D场景几何结构进行坚固对齐,以获取匹配 pose。我们对三个常见的视觉地理估计数据集进行评估,进行剖除研究,并证明我们的方法在所有数据集上超越了当前最佳回归方法的姿态精度。
paper_authors: Chelsea A. H. Sargeant, Edward G. A. Henderson, Dónal M. McSweeney, Aaron G. Rankin, Denis Page
for: 提高Image质量和射线治疗规划精度
methods: 多通道输入强调特定图像特征,并 introduce an auxiliary fusion network to enhance the fidelity of generated sCT images.
results: 有效地Addresses some of the challenges inherent in CBCT imaging, whilst restoring the contrast necessary for accurate visualisation of patients’ anatomy.Abstract
Image synthesis is used to generate synthetic CTs (sCTs) from on-treatment cone-beam CTs (CBCTs) with a view to improving image quality and enabling accurate dose computation to facilitate a CBCT-based adaptive radiotherapy workflow. As this area of research gains momentum, developments in sCT generation methods are difficult to compare due to the lack of large public datasets and sizeable variation in training procedures. To compare and assess the latest advancements in sCT generation, the SynthRAD2023 challenge provides a public dataset and evaluation framework for both MR and CBCT to sCT synthesis. Our contribution focuses on the second task, CBCT-to-sCT synthesis. By leveraging a multi-channel input to emphasize specific image features, our approach effectively addresses some of the challenges inherent in CBCT imaging, whilst restoring the contrast necessary for accurate visualisation of patients' anatomy. Additionally, we introduce an auxiliary fusion network to further enhance the fidelity of generated sCT images.
摘要
《图像合成在生成synthetic CT(sCT)方面取得了 significative进步,以提高图像质量和准确计算剂量,并促进CBCT基于适应放射治疗的工作流程。然而,由于这一领域的研究受到限制,因此发展sCT生成方法的比较困难。为了比较和评估最新的进展,SynthRAD2023挑战提供了公共数据集和评估框架,用于MR和CBCT图像到sCT合成。我们的贡献集中于第二任务,即CBCT到sCT合成。我们采用多通道输入,以强调特定图像特征,有效地解决了CBCT成像中的一些挑战,同时恢复了病人 анатомиче的细节。此外,我们还引入了辅助合并网络,以进一步提高生成的sCT图像的准确性。》
ColonNeRF: Neural Radiance Fields for High-Fidelity Long-Sequence Colonoscopy Reconstruction
paper_authors: Yufei Shi, Beijia Lu, Jia-Wei Liu, Ming Li, Mike Zheng Shou
for: colonoscopy reconstruction for diagnosing colorectal cancer
methods: neural radiance field (NeRF) with region division and integration, multi-level fusion, and DensiNet for dense camera pose guidance
results: outperforms existing methods on two benchmarks over four evaluation metrics, with a substantial increase of about 67%-85% on the SimCol-to-3D dataset and clearer textures and more accurate geometric details in reconstruction visualizations.Here’s the full text in Simplified Chinese:
results: 与现有方法相比,在两个benchmark上得分 superior performance,特别是在SimCol-to-3D数据集上的LPIPS-ALEX分数显著提高约67%-85%,并且可以得到更清晰的文本和更准确的几何细节在重建视觉中。Abstract
Colonoscopy reconstruction is pivotal for diagnosing colorectal cancer. However, accurate long-sequence colonoscopy reconstruction faces three major challenges: (1) dissimilarity among segments of the colon due to its meandering and convoluted shape; (2) co-existence of simple and intricately folded geometry structures; (3) sparse viewpoints due to constrained camera trajectories. To tackle these challenges, we introduce a new reconstruction framework based on neural radiance field (NeRF), named ColonNeRF, which leverages neural rendering for novel view synthesis of long-sequence colonoscopy. Specifically, to reconstruct the entire colon in a piecewise manner, our ColonNeRF introduces a region division and integration module, effectively reducing shape dissimilarity and ensuring geometric consistency in each segment. To learn both the simple and complex geometry in a unified framework, our ColonNeRF incorporates a multi-level fusion module that progressively models the colon regions from easy to hard. Additionally, to overcome the challenges from sparse views, we devise a DensiNet module for densifying camera poses under the guidance of semantic consistency. We conduct extensive experiments on both synthetic and real-world datasets to evaluate our ColonNeRF. Quantitatively, our ColonNeRF outperforms existing methods on two benchmarks over four evaluation metrics. Notably, our LPIPS-ALEX scores exhibit a substantial increase of about 67%-85% on the SimCol-to-3D dataset. Qualitatively, our reconstruction visualizations show much clearer textures and more accurate geometric details. These sufficiently demonstrate our superior performance over the state-of-the-art methods.
摘要
colonoscopy 重建是诊断大肠癌的关键。然而,准确的长序colonoscopy 重建面临三大挑战:(1)肠道段的不同性很大,卷曲的形状和弯曲的结构的共存;(2)摄像头的视点稀缺,限制了摄像头的运动范围;(3)复杂的肠道结构的学习。为了解决这些挑战,我们介绍了一种基于神经采样场(NeRF)的新重建框架,名为ColonNeRF,它利用神经渲染进行新视图合成。具体来说,我们的ColonNeRF通过分割肠道段的方式,将整个肠道重建成多个小段,从而降低形状不同性和保证每个段的准确性。此外,我们的ColonNeRF还包括多级融合模块,通过逐级模型肠道区域,从易于难度进行协同学习。为了解决稀缺的摄像头视点问题,我们还提出了一种增强摄像头位置的DensiNet模块,通过语义一致性来增强摄像头的密度。我们对synthetic和实际数据进行了广泛的实验,并证明了我们的ColonNeRF在四个评价指标上比现有方法表现出色。特别是,我们的LPIPS-ALEX分数在SimCol-to-3D数据集上增加了约67%-85%。重建视觉显示了更清晰的текстуры和更准确的几何特征。这些证明了我们的uperformance相对于状态方法。
SRTransGAN: Image Super-Resolution using Transformer based Generative Adversarial Network
results: 与现有方法相比,SRTransGAN提高了平均PSNR和SSIM分数的4.38%,并且通过分割图像分析了学习能力。Abstract
Image super-resolution aims to synthesize high-resolution image from a low-resolution image. It is an active area to overcome the resolution limitations in several applications like low-resolution object-recognition, medical image enhancement, etc. The generative adversarial network (GAN) based methods have been the state-of-the-art for image super-resolution by utilizing the convolutional neural networks (CNNs) based generator and discriminator networks. However, the CNNs are not able to exploit the global information very effectively in contrast to the transformers, which are the recent breakthrough in deep learning by exploiting the self-attention mechanism. Motivated from the success of transformers in language and vision applications, we propose a SRTransGAN for image super-resolution using transformer based GAN. Specifically, we propose a novel transformer-based encoder-decoder network as a generator to generate 2x images and 4x images. We design the discriminator network using vision transformer which uses the image as sequence of patches and hence useful for binary classification between synthesized and real high-resolution images. The proposed SRTransGAN outperforms the existing methods by 4.38 % on an average of PSNR and SSIM scores. We also analyze the saliency map to understand the learning ability of the proposed method.
摘要
Image超解像目标是实现从低分辨率图像中生成高分辨率图像。这是一个活跃的领域,旨在超越对应应用中的分辨率限制,如低分辨率物体识别、医疗影像提高等。基于生成领域的人工智能网络(GAN)方法在image超解像中成为了现有的州际标准,但是对于全球信息的利用效率仍然有所限制。这是因为传统的卷积神经网络(CNN)不能够充分利用全球信息,而transformer则是最近的深度学习突破,可以将图像视为序列的单位,并通过自我注意机制来更好地利用全球信息。验证了对于语言和视觉应用的成功,我们提出了SRTransGAN,一个使用对称转换器为基础的GAN方法,具体是一个使用对称转换器为生成器,生成2x和4x的图像。我们设计了识别网络使用视觉对称转换器,这样可以将图像视为序列的单位,并通过自我注意机制来更好地利用全球信息。我们的SRTransGAN与现有方法相比,平均PSNR和SSIM分数提高4.38%。我们还分析了图像焦点的对应分析,以了解我们的方法学习的能力。
Language-only Efficient Training of Zero-shot Composed Image Retrieval
results: LinCIR 可以在 48 分钟内训练,并在四个不同的 CIR 测试集上达到最高的零shot CIR 性能,包括 CIRCO、GeneCIS、FashionIQ 和 CIRR,甚至超过了一些监督学习方法在 FashionIQ 上的性能。Abstract
Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir
摘要
<> translate into Simplified Chinese composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincirNote: The translation is in Simplified Chinese, which is one of the two standardized Chinese languages. The other is Traditional Chinese.
A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data
paper_authors: Jungwon Choi, Seongho Keum, EungGu Yun, Byung-Hoon Kim, Juho Lee
for: 这个研究旨在开发一个基于 функциональ connectivity(FC)的自适应学习方法,以充分利用 FC 网络中的时间变化特性,提高模型预测的准确性和可读性。
methods: 本研究使用了Graph Neural Network(GNN)来学习 FC 网络,并提出了一个生成式的自适应学习方法,以充分利用 FC 网络中的时间变化特性和空间相关性。
results: 实验结果显示,这个方法可以从大规模(>50,000)的 fMRI 数据中学习有价值的表现,并可以建立高准确性和可靠性的模型,当作下游任务的 fine-tuning。Abstract
Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model na\"ively trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks.
摘要
深度神经网络在功能连接(FC)网络Extracted from functional Magnetic Resonance Imaging(fMRI)数据上训练得到了流行,主要是因为数据的可用性的提高和模型架构的进步,包括图神经网络(GNN)。 current research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model naively trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks.
Bootstrapping SparseFormers from Vision Foundation Models
results: 根据实验结果,从AugReg-ViT-L/16-384基础模型中bootstrap的单模态SparseFormer可以在IN-1K数据集上达到84.9%的准确率,而从CLIPs基础模型中bootstrap的多模态SparseFormer也表现出了显著的零处理性能,无需在 bootstrap 过程中看到任何标签或caption。此外,CLIP-bootstrap SparseFormer可以作为高效的视觉编码器在多Modal大语言模型中使用。Abstract
The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code will be publicly available at https://github.com/showlab/sparseformer
摘要
新提出的SparseFormer架构提供了一种新的视觉理解方法,通过调整RoI,大幅降低计算成本,同时仍然达到了令人满意的性能。然而,从零开始训练SparseFormer仍然是成本高的。在这篇论文中,我们提议使用基于ViT的视觉基础模型来快速启动SparseFormer。由于大多数SparseFormer块是标准的变换器块,因此我们可以继承大规模预训练的视觉变换器的参数,并冻结它们。因此,我们只需要训练SparseFormer特有的轻量级焦点变换器,以调整token RoI,并微调一些早期预训练的块,以使得最终的token表示相对于。通过这种方式,我们可以快速启动SparseFormer架构,从不同的大规模预训练模型(如IN-21K预训练AugRegs或CLIPs)中继承一些参数,并在几个小时内使用一些小规模的训练样本(如IN-1K)进行微调。结果显示,从AugReg-ViT-L/16-384开始的单模SparseFormer可以达到84.9%的准确率在IN-1K上,并且从CLIPs开始的多模SparseFormer也表现出了显著的零shot性能,而且没有在启动过程中看到任何标签或描述。此外,CLIP-启动的SparseFormers可以在多模大语言模型中作为高效的视觉编码器使用。代码将在https://github.com/showlab/sparseformer上公开。
UniGS: Unified Representation for Image Generation and Segmentation
results: 对于多个任务,包括填充、图像生成、引用 segmentation 和实体 segmentation,我们的方法具有同样的灵活性和高效性。Abstract
This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.
摘要
On one hand, the location-aware palette ensures the colors' consistency with the entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers.To address the lack of large-scale segmentation training data, we employ an inpainting pipeline and improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments demonstrate the efficiency of our approach, with comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at .
Semantics-aware Motion Retargeting with Vision-Language Models
paper_authors: Haodong Zhang, ZhiKe Chen, Haocheng Xu, Lei Hao, Xiaofei Wu, Songcen Xu, Zhensong Zhang, Yue Wang, Rong Xiong
for: motion retargeting between animation characters, capturing and preserving motion semantics
methods: utilizes vision-language models to extract and maintain meaningful motion semantics, incorporates high-level motion semantics into the motion retargeting process, and adopts a two-stage pipeline with skeleton-aware pre-training and fine-tuning
results: produces high-quality motion retargeting results while accurately preserving motion semantics, as demonstrated through experimental results.Abstract
Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics. Project page can be found at https://sites.google.com/view/smtnet.
摘要
捕捉和保留动作 semantics 是动作 Retargeting 中的关键。然而,大多数之前的工作忽略了 semantic 信息或者依赖人类设计的关节级表示。我们现在提出了一种新的 Semantics-aware Motion reTargeting(SMT)方法,利用视力语言模型来提取和维护有意义的动作 semantics。我们使用可导模块来渲染 3D 动作,然后将高级动作 semantics integrate 到动作 Retargeting 过程中。为保持细节动作和高级 semantics 的精度,我们采用了两个阶段管道,包括骨架准备和精度调整。实验结果表明,我们的方法可以生成高质量动作 Retargeting 结果,同时准确地保留动作 semantics。项目页面可以在 找到。
Instance-guided Cartoon Editing with a Large-scale Dataset
results: 该论文通过使用高质量的Cartoon专用数据集和一种端到端的学习模型,实现了高分辨率Character的分割,并可以应用于多种Cartoon编辑应用程序,如3D Ken Burns parallax效果、文本引导的Cartoon风格编辑和漫画化动画等。Abstract
Cartoon editing, appreciated by both professional illustrators and hobbyists, allows extensive creative freedom and the development of original narratives within the cartoon domain. However, the existing literature on cartoon editing is complex and leans heavily on manual operations, owing to the challenge of automatic identification of individual character instances. Therefore, an automated segmentation of these elements becomes imperative to facilitate a variety of cartoon editing applications such as visual style editing, motion decomposition and transfer, and the computation of stereoscopic depths for an enriched visual experience. Unfortunately, most current segmentation methods are designed for natural photographs, failing to recognize from the intricate aesthetics of cartoon subjects, thus lowering segmentation quality. The major challenge stems from two key shortcomings: the rarity of high-quality cartoon dedicated datasets and the absence of competent models for high-resolution instance extraction on cartoons. To address this, we introduce a high-quality dataset of over 100k paired high-resolution cartoon images and their instance labeling masks. We also present an instance-aware image segmentation model that can generate accurate, high-resolution segmentation masks for characters in cartoon images. We present that the proposed approach enables a range of segmentation-dependent cartoon editing applications like 3D Ken Burns parallax effects, text-guided cartoon style editing, and puppet animation from illustrations and manga.
摘要
卡通修改,被专业插画家和业余爱好者热爱,具有广泛的创作自由和在卡通领域发展原创情节。然而,现有的卡通修改文献复杂,倚靠手动操作,因为自动识别个体角色实例的挑战。因此,自动分割这些元素成为了必要的,以便激发多种卡通修改应用,如视觉风格编辑、动作分解和传输、和计算立体深度,以提供更加丰富的视觉体验。然而,大多数当前的分割方法是为自然照片设计,无法认识到卡通主题中的复杂美学,因此下降分割质量。主要挑战来自两个关键缺点:高质量卡通专用数据集罕见,以及高分辨率实例EXTRACTION模型缺失。为解决这一问题,我们介绍了高质量的卡通数据集,包括100k多个高分辨率卡通图像和其实例标注 mask。我们还提出了一种实例相关的图像分割模型,可以生成高分辨率的分割面羽毛,以便在卡通图像中的人物分割。我们表明,我们的方法可以激发多种分割取决于卡通编辑应用,如3D Ken Burns效果、文本引导卡通风格编辑和漫画图像投影。
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
methods: 提出了一种Compact Occupancy TRansformer(COTR)模型,包括geometry-aware occupancy encoder和semantic-aware group decoder,用于生成高效的三维占用表示。
results: 对多个基线进行比较,COTR模型表现明显优于基线,具有8%-15%的相对改善,证明了我们的方法的优越性。Abstract
The autonomous driving community has shown significant interest in 3D occupancy prediction, driven by its exceptional geometric perception and general object recognition capabilities. To achieve this, current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However, compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but reducant computational costs. To address the above limitations, we propose Compact Occupancy TRansformer (COTR), with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then, the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines, e.g., COTR outperforms baselines with a relative improvement of 8%-15%, demonstrating the superiority of our method.
摘要
自动驾驶社区对3D占用预测表示出了极大的兴趣,主要受其超常的几何识别和通用对象识别能力的激发。目前的研究尝试构建Tri-Perspective View(TPV)或Occupancy(OCC)表示,从鸟瞰视角出发。然而,压缩视图如TPV表示会产生3D几何信息损失,而 raw和稀疏OCC表示则需要重大但可减计算成本。为Address这些限制,我们提出Compact Occupancy TRansformer(COTR),具有几何意识occupancy编码器和semantic意识组解码器,以重建紧凑3D OCC表示。几何编码器首先生成高效的explicit-implicit视图转换后的紧凑几何OCC特征。然后,occupancy解码器进一步增强了紧凑OCC表示的semantic可分性,通过粗化到细化的semantic分组策略。实验表明,COTR与多个基准相比,表现出了明显的性能提升,例如COTR与基准相比,表现出了8%-15%的相对提升,这说明了我们的方法的优越性。
A Reliable Representation with Bidirectional Transition Model for Visual Reinforcement Learning Generalization
methods: 基于人类思维模式,提出了 BiT 模型,可以双向预测环境变化,从而提取可靠的 Representation
results: 在 DeepMind Control suite 中实现了竞争性的普遍化性和样本效率表现,并在 robotic manipulation 和 CARLA simulate 中进行了广泛的应用Abstract
Visual reinforcement learning has proven effective in solving control tasks with high-dimensional observations. However, extracting reliable and generalizable representations from vision-based observations remains a central challenge. Inspired by the human thought process, when the representation extracted from the observation can predict the future and trace history, the representation is reliable and accurate in comprehending the environment. Based on this concept, we introduce a Bidirectional Transition (BiT) model, which leverages the ability to bidirectionally predict environmental transitions both forward and backward to extract reliable representations. Our model demonstrates competitive generalization performance and sample efficiency on two settings of the DeepMind Control suite. Additionally, we utilize robotic manipulation and CARLA simulators to demonstrate the wide applicability of our method.
摘要
视觉强化学习已经在高维度观察的控制任务中证明了其效果。然而,从视觉 Observation 中提取可靠和普遍适用的表示仍然是中心挑战。根据人类思维的概念,当表示可以预测未来和跟踪历史时,该表示才是可靠和准确地理解环境的。基于这个概念,我们介绍了一种BiT模型(双向转移模型),该模型利用环境转移的双向预测能力来提取可靠的表示。我们的模型在DeepMind Control suite中的两个设置中表现出了竞争性的总结性和样本效率。此外,我们还利用了机器人操作和CARLA simulate器来证明我们的方法的广泛适用性。
Unsupervised Anomaly Detection using Aggregated Normative Diffusion
results: 研究表明,现有的状态态 искус智能异常检测方法在实际多Modal MR数据中并不能泛化 well,我们引入了一种新的异常检测方法 named Aggregated Normative Diffusion (ANDi),可以在多种异常现象中提高检测精度。Abstract
Early detection of anomalies in medical images such as brain MRI is highly relevant for diagnosis and treatment of many conditions. Supervised machine learning methods are limited to a small number of pathologies where there is good availability of labeled data. In contrast, unsupervised anomaly detection (UAD) has the potential to identify a broader spectrum of anomalies by spotting deviations from normal patterns. Our research demonstrates that existing state-of-the-art UAD approaches do not generalise well to diverse types of anomalies in realistic multi-modal MR data. To overcome this, we introduce a new UAD method named Aggregated Normative Diffusion (ANDi). ANDi operates by aggregating differences between predicted denoising steps and ground truth backwards transitions in Denoising Diffusion Probabilistic Models (DDPMs) that have been trained on pyramidal Gaussian noise. We validate ANDi against three recent UAD baselines, and across three diverse brain MRI datasets. We show that ANDi, in some cases, substantially surpasses these baselines and shows increased robustness to varying types of anomalies. Particularly in detecting multiple sclerosis (MS) lesions, ANDi achieves improvements of up to 178% in terms of AUPRC.
摘要
早期异常检测在医疗图像中,如脑MRI,对诊断和治疗许多疾病非常重要。有监督机器学习方法的限制是它们只能处理一些有充分数据的疾病。相比之下,无监督异常检测(USAD)有潜力检测更广泛的异常,通过检测图像中的异常 Pattern。我们的研究表明,现有的状态对抗方法无法在实际多modal MR数据中广泛应用。为了解决这个问题,我们引入了一种新的USAD方法,名为Aggregated Normative Diffusion(ANDi)。ANDi通过将预测的杂化步骤与真实的反转步骤进行聚合,以DDPMs已经在pyramidal Gaussian noise上进行训练。我们对ANDi进行了三种最新的USAD基线方法进行验证,并在三个不同的脑MRI数据集上进行验证。我们发现,ANDi在某些情况下可以明显超越这些基线方法,并在不同类型的异常下表现更加稳定。特别是在检测多发性硬化病(MS)斑点时,ANDi可以达到178%的AUPRC提升。
Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
results: 这个论文的实验结果显示,使用这种新机制可以在THUMOS14、ActivityNet-1.3和FineAction等三个数据集上达到69.0平均准确率、37.12平均准确率和17.20平均准确率等高水平。Abstract
Vision transformer (ViT) has shown high potential in video recognition, owing to its flexible design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it still remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation, while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone. For post-backbone information propagation, we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.0 average mAP on THUMOS14, 37.12 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.
摘要
《视transformer(ViT)在视频识别方面表现出了高 potential,归功于其灵活设计、自适应的自注意机制以及预训练的masked掩码。然而,尚未清楚如何适应这些预训练的短期ViT模型用于未处理过的视频中的动作检测(TAD)。现有的工作通常将它们视为预训练的特征提取器,对每个短截视频进行独立的特征提取,而不是捕捉不同截 snippet之间的细致关系在更广泛的时间上。为此,本文强调设计一种适应这些预训练ViT模型的新机制,以全面发挥它们的模型力,同时仍保持低计算负担和内存占用,以实现高效的TAD。为此,我们设计了有效的跨截宣传模块,从两级来进行短视频信息的沟通。对于内部幕驱驱使用的横向宣传策略,我们引入了跨截宣传策略,以便在幕驱驱中进行多个截 snippet的时间特征互动。对于后继幕驱驱使用的时间转换层,我们提出了时间转换层,以进一步进行clip级模型。使用预训练的ViT-B和VideoMAE,我们的综合 temporal action detector(ViT-TAD)在THUMOS14、ActivityNet-1.3和FineAction等测试集上达到了69.0的平均MAP、37.12的平均MAP和17.20的平均MAP。》
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
results: 我们的提出的方法在目标攻击性能和传输性能方面具有显著优势。Abstract
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical gray-box attack scenario that the adversary can only access the visual encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed InstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same visual encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from an LLM. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability.
摘要
大型视觉语言模型(LVLM)已经展现出惊人的图像理解和回快生成能力。然而,这种丰富的视觉互动也使得LVLM变得易受到敌意例子的攻击。在这篇论文中,我们提出了一种新的和实用的灰度攻击场景,攻击者只能访问受到攻击的LVLM的视觉编码器,而不知道它的提示(这些提示通常是服务提供者所拥有的Confidential information,并不公开)以及它的下游大语言模型(LLM)。这种实际场景带来了跨提示和跨模型的目标攻击的困难,攻击者想要使LVLM输出一个与攻击者选择的目标文本semantically similar的回快。为此,我们提出了一种 instruciton-tuned 目标攻击(dubbed InstructTA),可以在LVLMs中实现高度的传输性。首先,我们使用公共的文本到图像生成模型将目标回快转换成目标图像,然后使用GPT-4来推导一个合理的指令 $\boldsymbol{p}'$ FROM 目标回快。我们然后组建一个本地的伪模型(与受到攻击的LVLM共享同一个视觉编码器),以提取指令意识特征,并将这些特征与目标图像的指令意识特征进行距离最小化,以便优化攻击示例。为了进一步提高传输性,我们在指令 $\boldsymbol{p}'$ 中添加了由LLM提取的指令重构。我们的实验结果表明,我们的提posed方法在目标攻击性能和传输性方面具有优势。
FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual Explanations
results: 根据多个公开的医疗影像数据集,包括RSNA、iChallenge-PM、Covid-19、ChinaCXRSet和MontgomerySet,我们的实验结果显示,我们的方法可以在医疗影像诊断中实现state-of-the-art的分类精度和可解释性,较基eline方法更好。对于每个提出的 ком成分,我们还进行了附加的ablation研究来证明其效果。Abstract
Interpretable deep learning models have received widespread attention in the field of image recognition. Due to the unique multi-instance learning of medical images and the difficulty in identifying decision-making regions, many interpretability models that have been proposed still have problems of insufficient accuracy and interpretability in medical image disease diagnosis. To solve these problems, we propose feature-driven inference network (FeaInfNet). Our first key innovation involves proposing a feature-based network reasoning structure, which is applied to FeaInfNet. The network of this structure compares the similarity of each sub-region image patch with the disease templates and normal templates that may appear in the region, and finally combines the comparison of each sub-region to make the final diagnosis. It simulates the diagnosis process of doctors to make the model interpretable in the reasoning process, while avoiding the misleading caused by the participation of normal areas in reasoning. Secondly, we propose local feature masks (LFM) to extract feature vectors in order to provide global information for these vectors, thus enhancing the expressive ability of the FeaInfNet. Finally, we propose adaptive dynamic masks (Adaptive-DM) to interpret feature vectors and prototypes into human-understandable image patches to provide accurate visual interpretation. We conducted qualitative and quantitative experiments on multiple publicly available medical datasets, including RSNA, iChallenge-PM, Covid-19, ChinaCXRSet, and MontgomerySet. The results of our experiments validate that our method achieves state-of-the-art performance in terms of classification accuracy and interpretability compared to baseline methods in medical image diagnosis. Additional ablation studies verify the effectiveness of each of our proposed components.
摘要
医学影像识别领域内的可解释深度学习模型已经受到了广泛的关注。由于医学影像的独特多个实例学习和困难的决策区域识别,许多已提出的可解释模型仍然存在精度和可解释性不足的问题。为解决这些问题,我们提出了特征驱动推理网络(FeaInfNet)。我们的首要创新是提出了基于特征的网络逻辑结构,该结构将每个子区域图像质量与疾病模板和正常模板进行比较,并最终将每个子区域的比较结果组合以获得最终诊断。这种逻辑结构模仿医生诊断过程,使模型的解释过程变得更加可读ble,同时避免由正常区域参与到逻辑过程中的误导。其次,我们提出了本地特征面罩(LFM),以提高特征向量的表达能力。最后,我们提出了自适应动态面罩(Adaptive-DM),以解释特征向量和原型,并将其转换为人类可理解的图像块。我们在多个公共可用的医学影像数据集上进行了质量和量化的实验,包括RSNA、iChallenge-PM、Covid-19、ChinaCXRSet和MontgomerySet。实验结果表明,我们的方法在医学影像诊断中实现了状态机器人的性能,并且在精度和可解释性两个方面都超过了基eline方法。其他的抽象研究也证明了我们所提出的每一个组件的有效性。
Unveiling Objects with SOLA: An Annotation-Free Image Search on the Object Level for Automotive Data Sets
results: 我们的方法可以实现时间和努力的减少,并且在汽车数据集上进行评估,获得了良好的成绩。Abstract
Huge image data sets are the fundament for the development of the perception of automated driving systems. A large number of images is necessary to train robust neural networks that can cope with diverse situations. A sufficiently large data set contains challenging situations and objects. For testing the resulting functions, it is necessary that these situations and objects can be found and extracted from the data set. While it is relatively easy to record a large amount of unlabeled data, it is far more difficult to find demanding situations and objects. However, during the development of perception systems, it must be possible to access challenging data without having to perform lengthy and time-consuming annotations. A developer must therefore be able to search dynamically for specific situations and objects in a data set. Thus, we designed a method which is based on state-of-the-art neural networks to search for objects with certain properties within an image. For the ease of use, the query of this search is described using natural language. To determine the time savings and performance gains, we evaluated our method qualitatively and quantitatively on automotive data sets.
摘要
巨大的图像数据集是自动驾驶系统发展的基础。一大量的图像是需要训练强健的神经网络,以便在多样化的情况下提供稳定的性能。一个充分的数据集应包含挑战性的情况和物体。为测试得到的函数,需要从数据集中找到和提取这些情况和物体。虽然 recording a large amount of unlabeled data Comparatively easy, but it is much more difficult to find demanding situations and objects. However, during the development of perception systems, it must be possible to access challenging data without having to perform lengthy and time-consuming annotations. Therefore, we designed a method based on state-of-the-art neural networks to search for objects with certain properties within an image. For ease of use, the query of this search is described using natural language. To determine the time savings and performance gains, we evaluated our method qualitatively and quantitatively on automotive data sets.
Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing
results: 该论文通过在模拟环境中训练并在真实机器人上部署了该方法,并进行了详细的ablation,证明了在各种各样的手动对象旋转任务中,通过融合视觉和感觉的反馈,可以提高强化学习和实际性性能。Abstract
Executing contact-rich manipulation tasks necessitates the fusion of tactile and visual feedback. However, the distinct nature of these modalities poses significant challenges. In this paper, we introduce a system that leverages visual and tactile sensory inputs to enable dexterous in-hand manipulation. Specifically, we propose Robot Synesthesia, a novel point cloud-based tactile representation inspired by human tactile-visual synesthesia. This approach allows for the simultaneous and seamless integration of both sensory inputs, offering richer spatial information and facilitating better reasoning about robot actions. The method, trained in a simulated environment and then deployed to a real robot, is applicable to various in-hand object rotation tasks. Comprehensive ablations are performed on how the integration of vision and touch can improve reinforcement learning and Sim2Real performance. Our project page is available at https://yingyuan0414.github.io/visuotactile/ .
摘要
执行需要质感丰富的操作任务时,感觉和视觉反馈的融合是必要的。然而,这两种模式之间的差异带来了很大的挑战。在这篇论文中,我们介绍了一种系统,它利用视觉和感觉输入来实现灵活的手部操作。特别是,我们提出了人类感觉视觉同义症的机器人感觉同义症(Robot Synesthesia),这种方法允许在同一时间同步融合两种感觉输入,提供更加充足的空间信息,使机器人更好地理解自己的动作。这种方法在虚拟环境中培训后,在真实机器人上部署,适用于各种手部对象旋转任务。我们进行了详细的ablations,以证明视觉和感觉的融合可以提高强化学习和Sim2Real性能。项目页面可以在https://yingyuan0414.github.io/visuotactile/ 上查看。
Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation
results: 这篇论文的实验结果显示,使用DIDEX方法可以将模型训练到这些 Pseudo-target 领域中,并且可以在不使用任何真实数据的情况下,大幅提高模型的泛化性能。特别是在 GTA5 和 SYNTHIA 等数据集上,与先前的方法相比,这篇论文可以提高 mIoU 性能的平均提升为 3.8% 和 11.8% respectively。代码可以在 https://github.com/JNiemeijer/DIDEX 上找到。Abstract
When models, e.g., for semantic segmentation, are applied to images that are vastly different from training data, the performance will drop significantly. Domain adaptation methods try to overcome this issue, but need samples from the target domain. However, this might not always be feasible for various reasons and therefore domain generalization methods are useful as they do not require any target data. We present a new diffusion-based domain extension (DIDEX) method and employ a diffusion model to generate a pseudo-target domain with diverse text prompts. In contrast to existing methods, this allows to control the style and content of the generated images and to introduce a high diversity. In a second step, we train a generalizing model by adapting towards this pseudo-target domain. We outperform previous approaches by a large margin across various datasets and architectures without using any real data. For the generalization from GTA5, we improve state-of-the-art mIoU performance by 3.8% absolute on average and for SYNTHIA by 11.8% absolute, marking a big step for the generalization performance on these benchmarks. Code is available at https://github.com/JNiemeijer/DIDEX
摘要
当模型(例如 semantic segmentation 模型)应用于与训练数据不同的图像时,性能会降低很多。适应领域方法可以解决这个问题,但它们需要目标领域的样本。然而,这不一定是可行的,因此领域泛化方法是有用的。我们提出了一种新的扩散基于领域扩展(DIDEX)方法,并使用扩散模型生成一个具有多样化文本提示的 Pseudo-目标领域。与现有方法不同,这允许我们控制生成图像的风格和内容,并引入高度多样化。在第二步,我们将模型适应这个 Pseudo-目标领域,并在不同的 datasets 和架构上超越了先前的方法。我们在 GTA5 和 SYNTHIA 上的 mIoU 性能上提高了3.8%的绝对值和11.8%的绝对值,创造了这些标准 benchmarks 上的新纪录。代码可以在 https://github.com/JNiemeijer/DIDEX 上找到。
Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding
paper_authors: Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi for:* Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs).methods:* Our approach introduces the first training-free aggregation technique that leverages the point cloud’s 3D geometric structure to improve the quality of the transferred VLMs.* Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning.results:* Our approach achieves new state-of-the-art results in all benchmarks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios.Abstract
Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map Vision-Language Models from 2D pixels of rendered or captured views to 3D points, overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end, we introduce the first training-free aggregation technique that leverages the point cloud's 3D geometric structure to improve the quality of the transferred Vision-Language Models. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks. We will release the source code publicly.
摘要
zero-shot 3D点云理解可以通过2D视力语言模型(VLM)实现。现有策略直接将视力语言模型从2D像素 rendering 或捕捉视图中映射到3D点,忽视了点云的内在和表达ible 3D几何结构。相似或靠近的地方可以被利用来增强点云理解,因为它们可能共享semantic信息。为此,我们介绍了首个无需训练的集成技术,利用点云的3D几何结构来提高转移的Vision-Language Models的质量。我们的方法采用迭代的方式,通过几何和semantic点级reasoning进行本地到全局的聚合。我们在三个下游任务中进行了测试,包括分类、部分分割和semantic分割,并使用了多种数据集,包括 sintetic/real-world 和 indoor/outdoor enario。我们的方法在所有测试中达到了新的状态机制记录。我们将会公开发布源代码。
VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior
results: 对比前一代模型,提出的 VividTalk 能够生成高视质量的说话头视频,同时具有各种表情和自然的头部动作,并且在对比中得到了明显的提升。Abstract
Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.
摘要
audio-driven talking head 生成技术在近年来引起了很大的关注,许多努力在lip-sync、表情表达、自然的头部位置生成和高质量视频等方面。然而,到目前为止,没有任何模型能够同时占据或与所有这些指标匹配。在这篇论文中,我们提出了VividTalk,一种两stage通用框架,可以生成高质量的 talking head 视频,满足所有这些要求。 Specifically,在第一个阶段,我们将音频映射到网格,通过学习两种运动,包括非固定表达运动和固定头部运动。为表达运动,我们采用了blendshape和顶点作为中间表示,以最大化模型的表达能力。为自然的头部运动,我们提出了一个新的学习头 pose 代码库,并在两个阶段进行了两个阶段的训练机制。在第二个阶段,我们提出了一个双支持机制和生成器,用于将网格转换为粗粒度运动,并将高质量的视频框架生成出来。广泛的实验表明,我们的VividTalk可以生成高质量的 talking head 视频,lip-sync 和自然增强,与之前的状态对照比较,在对象和主观比较中占据了很大的优势。
Few Clicks Suffice: Active Test-Time Adaptation for Semantic Segmentation
methods: 这个方法使用了 active learning 的思想,在测试阶段内部实现人工智能驱动,通过对测试数据进行适应,以提高模型的性能。
results: 实验结果显示,这个方法可以与对应的标示方法相比,实现更高的平均对应率(mIoU),并且仅需要极少的标示量。对于 ACDC 测试集,这个方法可以超过known SOTA TTA 方法的平均 mIoU 表现,并且仅需要一Click 的标示。Abstract
Test-time adaptation (TTA) adapts the pre-trained models during inference using unlabeled test data and has received a lot of research attention due to its potential practical value. Unfortunately, without any label supervision, existing TTA methods rely heavily on heuristic or empirical studies. Where to update the model always falls into suboptimal or brings more computational resource consumption. Meanwhile, there is still a significant performance gap between the TTA approaches and their supervised counterparts. Motivated by active learning, in this work, we propose the active test-time adaptation for semantic segmentation setup. Specifically, we introduce the human-in-the-loop pattern during the testing phase, which queries very few labels to facilitate predictions and model updates in an online manner. To do so, we propose a simple but effective ATASeg framework, which consists of two parts, i.e., model adapter and label annotator. Extensive experiments demonstrate that ATASeg bridges the performance gap between TTA methods and their supervised counterparts with only extremely few annotations, even one click for labeling surpasses known SOTA TTA methods by 2.6% average mIoU on ACDC benchmark. Empirical results imply that progress in either the model adapter or the label annotator will bring improvements to the ATASeg framework, giving it large research and reality potential.
摘要
test-time adaptation (TTA) 在推理阶段使用无标注测试数据来适应预训练模型,吸引了大量研究的关注,因为它具有实用价值。然而,无法获得标签指导,现有的TTA方法很可能会降低模型的性能或增加计算资源的消耗。同时,与超过标注的TTA方法相比,TTA方法还存在一定的性能差距。作为活动学习的激励,在这种情况下,我们提出了活动测试时适应(ATASeg)框架。specifically,我们在测试阶段引入人工智能在线模式,通过少量标签的查询来促进预测和模型更新。为此,我们提出了简单 yet effective的ATASeg框架,该框架包括两部分:模型适应器和标签注释器。广泛的实验表明,ATASeg可以跨越TTA方法和其标注版本之间的性能差距,只需要极少量标签,甚至一个键盘提交的标签更过了已知SOTA TTA方法的平均mIoU值2.6%在ACDC benchmark上。实验结果表明,ATASeg框架的进步可以通过模型适应器或标签注释器的提升来实现,这对ATASeg框架具有大的研究和实践潜力。
results: 通过 enforcing equivariance to certain groups of transformations (rotations, reflections, and/or translations) on the denoiser,可以提高算法的稳定性和 reconstruction 的质量,并提供了一种简单的算法来实现这一点Abstract
Plug-and-play algorithms constitute a popular framework for solving inverse imaging problems that rely on the implicit definition of an image prior via a denoiser. These algorithms can leverage powerful pre-trained denoisers to solve a wide range of imaging tasks, circumventing the necessity to train models on a per-task basis. Unfortunately, plug-and-play methods often show unstable behaviors, hampering their promise of versatility and leading to suboptimal quality of reconstructed images. In this work, we show that enforcing equivariance to certain groups of transformations (rotations, reflections, and/or translations) on the denoiser strongly improves the stability of the algorithm as well as its reconstruction quality. We provide a theoretical analysis that illustrates the role of equivariance on better performance and stability. We present a simple algorithm that enforces equivariance on any existing denoiser by simply applying a random transformation to the input of the denoiser and the inverse transformation to the output at each iteration of the algorithm. Experiments on multiple imaging modalities and denoising networks show that the equivariant plug-and-play algorithm improves both the reconstruction performance and the stability compared to their non-equivariant counterparts.
摘要
固定并执行算法是一种广泛使用的框架,用于解决各种逆像问题,其中采用隐式定义图像的估计方法。这些算法可以利用强大的预训练的杂变器来解决各种成像任务,免除需要根据每个任务进行训练。然而,固定并执行方法经常表现出不稳定的行为,使得其承诺的 universality 和高质量重建图像的承诺受到限制。在这个工作中,我们证明了对杂变器的约束可以强化固定并执行算法的稳定性和重建质量。我们提供了一个理论分析,解释了约束对性能和稳定性的作用。我们还提出了一种简单的算法,可以在任何现有的杂变器基础上实现约束。实验表明,对着Equivariant plug-and-play算法在不同的成像模式和杂变器网络上的应用显示出较好的重建性和稳定性。
results: 作者们提出了一种新的评价协议来评价 CNP 任务,并在一个新的绘画对象集上进行了验证。结果表明,该方法可以生成具有艺术性和合理性的绘画作品,并且可以在不同的绘画风格和主题下进行可编辑和可复制的创作。Abstract
The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, Collaborative Neural Painting (CNP), to facilitate collaborative art painting generation between humans and machines. Given any number of user-input brushstrokes as the context or just the desired object class, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users' modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research.
摘要
paintings 激发创造力和合理规划。然而,现有的生成AI主要关注生成视觉吸引人的艺术作品,而忽略了绘画过程。我们介绍了一种新任务:协作神经画作(CNP),以便人工智能和人类在合作创作绘画过程中生成共同的艺术作品。给定任何用户输入的拼写划画为背景或仅仅是感兴趣的物品类型,CNP应该生成一个支持完成一幅完整的绘画的序列划画。此外,我们提议使用基于划画序列的画作表示,使其容易进行修改和组合操作。这些划画序列被处理于基于Transformer架构的新注意机制,以模型输入划画和完成划画之间的关系。我们还提出了一种新的遮盲方案,以反映CNP的互动性。 finally,我们引入了一个新的数据集和评估协议,以评估和验证CNP的效果和多样性。我们示出了我们的方法的有效性,并证明CNP任务是未来研究的丰富领域。
Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization
results: 研究发现不同的滤波器可以检测到不同类型的修改,并且可以将这些检测结果融合以提高检测和定位的精度。两种方法(晚期融合和早期融合)都能够达到比州前模型更高的性能,在多个数据集上表现出 competed。Abstract
Recent image manipulation localization and detection techniques usually leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM and Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of merging the outputs of such filters and aim to leverage the complementary nature of the artifacts produced to perform image manipulation localization and detection (IMLD). We propose two distinct methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces early combined features (this is referred to as early fusion). We demonstrate that both approaches achieve competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several datasets.
摘要
现代图像修改检测技术通常利用受到噪声敏感滤波器(如SRM和Bayar滤波)生成的法ensis和踪迹,以探测图像修改。本文显示了不同的滤波器在修改检测中具有不同的优势,可以探测不同类型的修改并生成 complementary的法ensis。因此,我们探讨了将不同滤波器的输出合并以实现图像修改检测(IMLD)的方法。我们提出了两种方法:一种生成每个法ensis的独立特征,然后进行融合( referred to as late fusion),另一种在不同模式输出之前进行混合,生成早期融合的特征( referred to as early fusion)。我们示出了这两种方法在多个 dataset 上的竞争性表现,超越了当前领域的模型。
Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world
results: 实验结果表明,提案的攻击方法可以在数字和物理环境中效果地攻击跨模态检测器,并且比基eline性能更高。Abstract
Currently, many studies have addressed security concerns related to visible and infrared detectors independently. In practical scenarios, utilizing cross-modal detectors for tasks proves more reliable than relying on single-modal detectors. Despite this, there is a lack of comprehensive security evaluations for cross-modal detectors. While existing research has explored the feasibility of attacks against cross-modal detectors, the implementation of a robust attack remains unaddressed. This work introduces the Two-stage Optimized Unified Adversarial Patch (TOUAP) designed for performing attacks against visible-infrared cross-modal detectors in real-world, black-box settings. The TOUAP employs a two-stage optimization process: firstly, PSO optimizes an irregular polygonal infrared patch to attack the infrared detector; secondly, the color QR code is optimized, and the shape information of the infrared patch from the first stage is used as a mask. The resulting irregular polygon visible modal patch executes an attack on the visible detector. Through extensive experiments conducted in both digital and physical environments, we validate the effectiveness and robustness of the proposed method. As the TOUAP surpasses baseline performance, we advocate for its widespread attention.
摘要
当前,许多研究已经关注安全问题 related to 可见和红外探测器独立地。在实际场景中,使用交叉模态探测器进行任务 proves 更可靠于单模态探测器。然而,有一个缺乏全面的安全评估 для交叉模态探测器。现有的研究已经探索了交叉模态探测器的可行性攻击,但是它的实施仍然未得到解决。本文介绍了一种可以在实际世界中,黑盒设置下,对可见红外交叉模态探测器进行攻击的两stage优化的恶作剂(TOUAP)。TOUAP使用了两stage优化过程:首先,使用Particle Swarm Optimization(PSO)优化一个不规则的红外探测器攻击;其次,使用颜色QR码优化,并使用红外探测器攻击的形状信息作为面罩。最后,使用可见模态探测器攻击。通过在数字和物理环境中进行了广泛的实验,我们证明了提议的方法的有效性和可靠性。作为TOUAP超过了基eline性能,我们强烈推荐它的广泛应用。
IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks
results: 研究发现,通过文本控制和扩大数据集大小,可以提高计算机视觉任务的在Context learning性能,包括Foreground Segmentation的AP提高了+10%,Single Object Detection的AP提高了+5%,Colorization的LPIPS下降了大约20%。这些实验结果表明,视觉和语言提示是补做的,使用两者可以实现更好的在Context learning性能。Abstract
In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10\% AP for Foreground Segmentation, over +5\% gains in AP for Single Object Detection, and almost 20\% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance. Project page is available at https://jerryxu.net/IMProv .
摘要
内容学习允许在测试时对模型进行新任务适应。在这篇论文中,我们介绍了IMProv模型,该模型可以通过multimodal提示来进行视觉任务的内容学习。例如,给定一个文本描述(如“左:输入图像,右:前景分割”)和一些输入-输出视觉示例,或者都有,模型在测试时可以通过这些提示来解决新的输入。我们在训练一个带有mask的生成变换器时使用了一个新的图像-文本数据集和一个captioned大规模图像-文本数据集。在推理时,我们通过文本和/或图像任务示例来提示模型,让模型进行填充输出。我们发现,通过文本条件和扩大数据集大小进行训练,可以提高计算机视觉任务的内容学习性能,例如,对于前景分割任务,提高了+10%的AP,对于单个对象检测任务,提高了+5%的AP,并且对于颜色化任务,降低了20%的LPIPS。我们的实验结果表明,视觉和语言提示是补偿的,使用两者可以达到更好的内容学习性能。相关项目页面可以在https://jerryxu.net/IMProv 中找到。
Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations
results: 实验结果显示,提案的方法通常能够在几少案例检测任务中获得更好的性能,特别是在使用反击训练损失时。Abstract
Anomaly detection is a critical and challenging task that aims to identify data points deviating from normal patterns and distributions within a dataset. Various methods have been proposed using a one-class-one-model approach, but these techniques often face practical problems such as memory inefficiency and the requirement of sufficient data for training. In particular, few-shot anomaly detection presents significant challenges in industrial applications, where limited samples are available before mass production. In this paper, we propose a few-shot anomaly detection method that integrates adversarial training loss to obtain more robust and generalized feature representations. We utilize the adversarial loss previously employed in domain adaptation to align feature distributions between source and target domains, to enhance feature robustness and generalization in few-shot anomaly detection tasks. We hypothesize that adversarial loss is effective when applied to features that should have similar characteristics, such as those from the same layer in a Siamese network's parallel branches or input-output pairs of reconstruction-based methods. Experimental results demonstrate that the proposed method generally achieves better performance when utilizing the adversarial loss.
摘要
<>传统的异常检测方法通常采用一个模型一个类的方法,但这些技术常常面临实际问题,如占用内存和训练数据的限制。特别是在工业应用中,有限的样本数可能会导致异常检测 task 难以进行。在这篇论文中,我们提出了一种几招异常检测方法,该方法通过 integration of adversarial training loss 来获得更加robust和泛化的特征表示。我们利用了 previously employed in domain adaptation 的对抗损失,用于对特征分布进行对齐,从而提高特征的Robustness和泛化性。我们 hypothesize adversarial loss 是有效的,当应用于特征具有相似特征,如同一层的Parallel branches 或输入输出对的重建方法。实验结果表明,提议的方法通常在使用对抗损失时能够获得更好的性能。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.
Localizing and Assessing Node Significance in Default Mode Network using Sub-Community Detection in Mild Cognitive Impairment
results: 计算NSS分数后,研究人员发现10个DMN节点的分数差 exceeds 20%,最大值达45.69%和43.08%,这与现有医学文献相一致,同时提供了一个量化的度量,可以对受损节点进行排序。这些发现可能为诊断和治疗提供有价值的指导。Abstract
Our study aims to utilize fMRI to identify the affected brain regions within the Default Mode Network (DMN) in subjects with Mild Cognitive Impairment (MCI), using a novel Node Significance Score (NSS). We construct subject-specific DMN graphs by employing partial correlation of Regions of Interest (ROIs) that make-up the DMN. For the DMN graph, ROIs are the nodes and edges are determined based on partial correlation. Four popular community detection algorithms (Clique Percolation Method (CPM), Louvain algorithm, Greedy Modularity and Leading Eigenvectors) are applied to determine the largest sub-community. NSS ratings are derived for each node, considering (I) frequency in the largest sub-community within a class across all subjects and (II) occurrence in the largest sub-community according to all four methods. After computing the NSS of each ROI in both healthy and MCI subjects, we quantify the score disparity to identify nodes most impacted by MCI. The results reveal a disparity exceeding 20% for 10 DMN nodes, maximally for PCC and Fusiform, showing 45.69% and 43.08% disparity. This aligns with existing medical literature, additionally providing a quantitative measure that enables the ordering of the affected ROIs. These findings offer valuable insights and could lead to treatment strategies aggressively targeting the affected nodes.
摘要
Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection
results: 我们的方法在三个数据集上(XD-Violence、TAD和UCF-Crime)取得了良好的性能,与一些现有的方法相比。Abstract
The goal of weakly supervised video anomaly detection is to learn a detection model using only video-level labeled data. However, prior studies typically divide videos into fixed-length segments without considering the complexity or duration of anomalies. Moreover, these studies usually just detect the most abnormal segments, potentially overlooking the completeness of anomalies. To address these limitations, we propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection, which learns multi-scale temporal features. Specifically, to handle duration variations of abnormal events, we first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths and capturing both local and global visual information across different temporal scales. Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies and erases prominent abnormal segments in order to encourage the model to discover gentle abnormal segments in a video. The proposed method obtains favorable performance compared to several state-of-the-art approaches on three datasets: XD-Violence, TAD, and UCF-Crime. Code will be made available at https://github.com/ArielZc/DE-Net.
摘要
“我们的目标是使用弱监督的方式进行影像异常探测,并将影像分成多个不同长度的段落,以捕捉不同的异常情况。然而,先前的研究通常会将影像分成固定长度的段落,而不考虑异常事件的复杂性或持续时间。此外,这些研究通常只是检测最为异常的段落,可能会忽略了异常事件的完整性。”“为了解决这些限制,我们提出了一个动态抹除网络(DE-Net),用于弱监督的影像异常探测。DE-Net 可以学习多个时间尺度的特征,并且可以在不同的时间尺度上捕捉异常事件的复杂性。 Specifically, 我们首先提出了一个多个时间尺度的模型化模组,可以从不同的时间尺度中提取特征,并且可以捕捉本地和全球的视觉信息。然后,我们设计了一个动态抹除策略,可以动态评估异常事件的完整性,并且可以删除异常事件的主要部分,以便让模型发现柔和的异常事件。”“我们的方法与一些现有的方法进行比较,获得了良好的性能。我们将代码公开在 GitHub 上,供您使用。”
Light Field Imaging in the Restrictive Object Space based on Flexible Angular Plane
results: 该论文通过设计了一个ROS-LF simulate系统,并对其进行了Calibration,验证了在ROS中光场图像的扭曲问题可以通过flexible angular plane和microlens image non-distortion principle来解决。Abstract
In some applications, the object space of light field imaging system is restrictive, such as industrial and medical endoscopes. If the traditional light field imaging system is used in the restrictive object space (ROS) directly but without any specific considerations, the ROS will lead to severe microlens image distortions and then affects light field decoding, calibration and 3D reconstruction. The light field imaging in restrictive object space (ROS-LF) is complicated but significant. In this paper, we first deduce that the reason of the microlens image deviation is the position variation of the angular plane, then we propose the flexible angular plane for ROS-LF, while in the traditional light field the angular plane always coincides with the main lens plane. Subsequently, we propose the microlens image non-distortion principle for ROS-LF and introduce the ROS-LF imaging principle. We demonstrate that the difference is an aperture constant term between the ROS-LF and traditional light field imaging models. At last, we design a ROS-LF simulated system and calibrate it to verify principles proposed in this paper.
摘要
在一些应用中,光场图像系统的对象空间是限制的,如工业和医疗内镜。如果直接在限制对象空间(ROS)中使用传统的光场图像系统而无任何考虑,ROS将导致严重的微镜像偏移,影响光场解码、准确性和3D重建。光场图像在限制对象空间(ROS-LF)复杂但重要。在这篇论文中,我们首先推导出微镜像偏移的原因是光谱的位置变化,然后我们提议用flexible angular plane для ROS-LF,而传统的光场图像系统中的angular plane总是与主镜平面垂直。然后,我们提出了ROS-LF图像非扭曲原理,并介绍了ROS-LF拍摄原理。我们证明ROS-LF和传统光场图像系统拍摄模型之间存在一个镜像常数项。最后,我们设计了ROS-LF模拟系统并对其进行了校准,以验证本文中提出的原理。
Long-Tail Learning with Rebalanced Contrastive Loss
results: 提高了三个主要方面:1. 特征空间均衡,2. 类内压缩,3. 规范,以提高长尾类别准确率。实验结果表明,RCL可以提供更加丰富的特征表示和提高BCL框架的顶部一个精度。同时,RCL也可以作为独立损失函数实现state-of-the-art级别的准确率。Abstract
Integrating supervised contrastive loss to cross entropy-based communication has recently been proposed as a solution to address the long-tail learning problem. However, when the class imbalance ratio is high, it requires adjusting the supervised contrastive loss to support the tail classes, as the conventional contrastive learning is biased towards head classes by default. To this end, we present Rebalanced Contrastive Learning (RCL), an efficient means to increase the long tail classification accuracy by addressing three main aspects: 1. Feature space balancedness - Equal division of the feature space among all the classes, 2. Intra-Class compactness - Reducing the distance between same-class embeddings, 3. Regularization - Enforcing larger margins for tail classes to reduce overfitting. RCL adopts class frequency-based SoftMax loss balancing to supervised contrastive learning loss and exploits scalar multiplied features fed to the contrastive learning loss to enforce compactness. We implement RCL on the Balanced Contrastive Learning (BCL) Framework, which has the SOTA performance. Our experiments on three benchmark datasets demonstrate the richness of the learnt embeddings and increased top-1 balanced accuracy RCL provides to the BCL framework. We further demonstrate that the performance of RCL as a standalone loss also achieves state-of-the-art level accuracy.
摘要
Recently, combining supervised contrastive loss with cross-entropy-based communication has been proposed to address the long-tail learning problem. However, when the class imbalance ratio is high, it is necessary to adjust the supervised contrastive loss to support the tail classes, as the conventional contrastive learning is biased towards the head classes by default. To address this, we propose Rebalanced Contrastive Learning (RCL), an efficient method to improve the long-tail classification accuracy by addressing three main aspects:1. 空间均衡 - 将所有类型的特征空间分配相同。2. 内类紧密性 - 减小同类嵌入的距离。3. 规则 - 使 tail 类的折衔较大,以避免过拟合。RCL 采用类频率基于的 SoftMax 损失平衡和超级vised contrastive learning损失,并使用权值 multiply 的特征来实现内类紧密性。我们在 Balanced Contrastive Learning (BCL) 框架上实现 RCL,BCL 框架已经达到了顶峰性能。我们在三个 benchmark 数据集上进行了实验,证明 RCL 可以提供 Balanced Contrastive Learning 框架中更丰富的嵌入和提高 top-1 平衡性的精度。此外,我们还证明 RCL 作为独立损失也可以达到状态的精度水平。
Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation
results: 通过在40000个公共数据集上使用4个GPU进行训练,得到的比对源DDVM的性能相似。代码和模型已经在https://github.com/DQiaole/FlowDiffusion_pytorch中发布。Abstract
Recently, Google proposes DDVM which for the first time demonstrates that a general diffusion model for image-to-image translation task works impressively well on optical flow estimation task without any specific designs like RAFT. However, DDVM is still a closed-source model with the expensive and private Palette-style pretraining. In this technical report, we present the first open-source DDVM by reproducing it. We study several design choices and find those important ones. By training on 40k public data with 4 GPUs, our reproduction achieves comparable performance to the closed-source DDVM. The code and model have been released in https://github.com/DQiaole/FlowDiffusion_pytorch.
摘要
近些时间,Google提出了DDVM,这是一种用于图像到图像翻译任务的通用扩散模型,它在视觉流估计任务上表现出色,而不需要特定的设计如RAFT。然而,DDVM仍然是一个关闭源代码的模型,需要昂贵的私有alette预训练。在这份技术报告中,我们发布了首个开源DDVM,并研究了一些设计选择。我们在使用40000个公共数据集和4个GPU进行训练后,我们的复制性能与关闭源DDVM相当。代码和模型已经在https://github.com/DQiaole/FlowDiffusion_pytorch中发布。
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
paper_authors: Dixuan Lin, Yixing Peng, Jingke Meng, Wei-Shi Zheng for: 文本描述和图像之间建立关联,以实现图像人重识别。methods: 提出了一种 Cross-Modal Adaptive Dual Association(CADA)方法,包括 Association of text Tokens to image Patches(ATP)和 Association of image Regions to text Attributes(ARA)两部分。results: 实验结果表明,CADA方法可以准确地建立图像和文本之间的bidirectional关联,并且超过了现有方法的性能。Abstract
Text-to-image person re-identification (ReID) aims to retrieve images of a person based on a given textual description. The key challenge is to learn the relations between detailed information from visual and textual modalities. Existing works focus on learning a latent space to narrow the modality gap and further build local correspondences between two modalities. However, these methods assume that image-to-text and text-to-image associations are modality-agnostic, resulting in suboptimal associations. In this work, we show the discrepancy between image-to-text association and text-to-image association and propose CADA: Cross-Modal Adaptive Dual Association that finely builds bidirectional image-text detailed associations. Our approach features a decoder-based adaptive dual association module that enables full interaction between visual and textual modalities, allowing for bidirectional and adaptive cross-modal correspondence associations. Specifically, the paper proposes a bidirectional association mechanism: Association of text Tokens to image Patches (ATP) and Association of image Regions to text Attributes (ARA). We adaptively model the ATP based on the fact that aggregating cross-modal features based on mistaken associations will lead to feature distortion. For modeling the ARA, since the attributes are typically the first distinguishing cues of a person, we propose to explore the attribute-level association by predicting the masked text phrase using the related image region. Finally, we learn the dual associations between texts and images, and the experimental results demonstrate the superiority of our dual formulation. Codes will be made publicly available.
摘要
文本到图像人识别(ReID)的目标是根据文本描述 Retrieves 图像。关键挑战是学习视觉和文本Modalities之间的关系。现有的方法强调学习一个隐藏空间,以降低Modalities的差异,并在两个Modalities之间建立本地匹配。然而,这些方法假设图像到文本和文本到图像的关系是Modality-agnostic,导致不佳的匹配。在这个工作中,我们表明图像到文本和文本到图像的关系之间的差异,并提出了CADA:跨Modalities适应双向关联模型。我们的方法具有一个基于解码器的适应双向关联模块,允许视觉和文本Modalities之间的全面互动,以建立双向和适应的跨Modalities对应关系。具体来说,我们提出了一种双向关联机制:文本Token与图像块的关联(ATP)和图像区域与文本特征的关联(ARA)。我们适应地模型ATP,基于 mistaken associations will lead to feature distortion。为了模型ARA,我们提出了predicting the masked text phrase using the related image region的方法,因为 attribute 是人识别的首要特征。最后,我们学习了图像和文本之间的双向关系,并实验结果表明了我们的双形式的优势。代码将公开上传。
Singular Regularization with Information Bottleneck Improves Model’s Adversarial Robustness
results: 在两个主流数据集上,使用两种流行的模型结构进行评估,并对不同的针对攻击进行评估。结果表明,我们的方法可以提高鲁棒性精度显著。同时,我们证明了我们的方法只需要少量额外参数,并且可以通过区域忠诚分析进行解释。Abstract
Adversarial examples are one of the most severe threats to deep learning models. Numerous works have been proposed to study and defend adversarial examples. However, these works lack analysis of adversarial information or perturbation, which cannot reveal the mystery of adversarial examples and lose proper interpretation. In this paper, we aim to fill this gap by studying adversarial information as unstructured noise, which does not have a clear pattern. Specifically, we provide some empirical studies with singular value decomposition, by decomposing images into several matrices, to analyze adversarial information for different attacks. Based on the analysis, we propose a new module to regularize adversarial information and combine information bottleneck theory, which is proposed to theoretically restrict intermediate representations. Therefore, our method is interpretable. Moreover, the fashion of our design is a novel principle that is general and unified. Equipped with our new module, we evaluate two popular model structures on two mainstream datasets with various adversarial attacks. The results indicate that the improvement in robust accuracy is significant. On the other hand, we prove that our method is efficient with only a few additional parameters and able to be explained under regional faithfulness analysis.
摘要
深度学习模型面临着一个最严重的威胁: adversarial example。许多研究已经被提出来研究和防御 adversarial example,但这些研究缺乏对 adversarial 信息或杂化的分析,这无法揭示 adversarial example 的谜团和正确的解释。在这篇论文中,我们尝试填补这个空白,通过研究 adversarial 信息作为无结构噪声来分析。specifically,我们通过对图像进行 singular value decomposition,将图像分解成多个矩阵,以分析 adversarial 信息的不同攻击。基于这种分析,我们提出了一个新的模块来规范 adversarial 信息,并结合信息瓶颈理论,这是一种可以理论上限制中间表示的方法。因此,我们的方法是可解释的。此外,我们的设计原理是一种新的、通用的原理。通过我们的新模块,我们评估了两种流行的模型结构,在两种主流数据集上进行了多种 adversarial 攻击。结果显示,我们的方法可以显著提高 robust 精度。另一方面,我们证明了我们的方法只需增加一些额外参数,并且可以在区域 faithfulness 分析中解释。
results: 对比之前的完全神经网络生成模型,该方法表现更好,并且可以实现高速和低能耗的生成。Abstract
Spiking neural networks (SNNs) have garnered considerable attention owing to their ability to run on neuromorphic devices with super-high speeds and remarkable energy efficiencies. SNNs can be used in conventional neural network-based time- and energy-consuming applications. However, research on generative models within SNNs remains limited, despite their advantages. In particular, diffusion models are a powerful class of generative models, whose image generation quality surpass that of the other generative models, such as GANs. However, diffusion models are characterized by high computational costs and long inference times owing to their iterative denoising feature. Therefore, we propose a novel approach fully spiking denoising diffusion implicit model (FSDDIM) to construct a diffusion model within SNNs and leverage the high speed and low energy consumption features of SNNs via synaptic current learning (SCL). SCL fills the gap in that diffusion models use a neural network to estimate real-valued parameters of a predefined probabilistic distribution, whereas SNNs output binary spike trains. The SCL enables us to complete the entire generative process of diffusion models exclusively using SNNs. We demonstrate that the proposed method outperforms the state-of-the-art fully spiking generative model.
摘要
神经网络具有快速和低能耗特性,因此吸引了广泛关注。然而,在神经网络中的生成模型方面,研究仍然受限,尤其是推 diffusion 模型。这种模型可以生成高质量图像,但是它们的计算成本和推理时间较长,这是因为它们具有迭代净化特性。为了解决这个问题,我们提出了一种全神经网络推 diffusion 隐式模型(FSDDIM),使用神经网络学习电流流程(SCL)来实现。SCL 填充了 diffusion 模型使用神经网络来估计预先定义的概率分布中的实数参数所需的差距。我们示出了我们提出的方法可以在全神经网络中完成整个生成过程,并且超越了现有的全神经网络生成模型。
SRSNetwork: Siamese Reconstruction-Segmentation Networks based on Dynamic-Parameter Convolution
results: 在七个 dataset 上,包括五个医学dataset 和两个抗雷达图像dataset,我们的 SRSNet consistently achiev 了最佳的分割结果。Abstract
In this paper, we present a high-performance deep neural network for weak target image segmentation, including medical image segmentation and infrared image segmentation. To this end, this work analyzes the existing dynamic convolutions and proposes dynamic parameter convolution (DPConv). Furthermore, it reevaluates the relationship between reconstruction tasks and segmentation tasks from the perspective of DPConv, leading to the proposal of a dual-network model called the Siamese Reconstruction-Segmentation Network (SRSNet). The proposed model is not only a universal network but also enhances the segmentation performance without altering its structure, leveraging the reconstruction task. Additionally, as the amount of training data for the reconstruction network increases, the performance of the segmentation network also improves synchronously. On seven datasets including five medical datasets and two infrared image datasets, our SRSNet consistently achieves the best segmentation results. The code is released at https://github.com/fidshu/SRSNet.
摘要
在这篇论文中,我们提出了一种高性能的深度神经网络用于弱目标图像分割,包括医学图像分割和红外图像分割。为达到这个目的,我们分析了现有的动态核函数,并提出了动态参数核函数(DPConv)。此外,我们从DPConv的角度重新评估了重建任务和分割任务之间的关系,导致我们提出了一种双网络模型称为对冲重建分割网络(SRSNet)。我们的提案的模型不仅是一种通用网络,而且可以增强分割性能而不需要修改结构,利用重建任务。此外,随着重建网络的训练数据量的增加,分割网络的性能也同步提高。在七个数据集中,包括五个医学数据集和两个红外图像数据集,我们的SRSNet都能够一直保持最佳的分割结果。代码可以在https://github.com/fidshu/SRSNet中下载。
MobileUtr: Revisiting the relationship between light-weight CNN and Transformer for efficient medical image segmentation
results: compared to the state-of-the-art methods, 该 paper 的模型 MobileUtr 在五个公共医学图像数据集上的三种模式下表现出色,同时具有轻量级和低计算成本。Abstract
Due to the scarcity and specific imaging characteristics in medical images, light-weighting Vision Transformers (ViTs) for efficient medical image segmentation is a significant challenge, and current studies have not yet paid attention to this issue. This work revisits the relationship between CNNs and Transformers in lightweight universal networks for medical image segmentation, aiming to integrate the advantages of both worlds at the infrastructure design level. In order to leverage the inductive bias inherent in CNNs, we abstract a Transformer-like lightweight CNNs block (ConvUtr) as the patch embeddings of ViTs, feeding Transformer with denoised, non-redundant and highly condensed semantic information. Moreover, an adaptive Local-Global-Local (LGL) block is introduced to facilitate efficient local-to-global information flow exchange, maximizing Transformer's global context information extraction capabilities. Finally, we build an efficient medical image segmentation model (MobileUtr) based on CNN and Transformer. Extensive experiments on five public medical image datasets with three different modalities demonstrate the superiority of MobileUtr over the state-of-the-art methods, while boasting lighter weights and lower computational cost. Code is available at https://github.com/FengheTan9/MobileUtr.
摘要
由于医疗图像的稀缺和特殊表示特性,轻量级的视图转换器(ViT) для医疗图像分割是一项重要挑战,现有研究并未专注于这一问题。这项工作检视了 CNN 和 Transformer 之间的关系,以实现将两者的优点集成到基础设计层次。为了利用 CNN 的适应性,我们抽象了 Transformer 类似的轻量级 CNN 块(ConvUtr)作为 ViT 的质点嵌入,将 Transformer 接受排除噪声、非重复和高度压缩 semantic 信息。此外,我们引入了适应性 Local-Global-Local(LGL)块,以便有效地进行本地到全局信息流交换,最大化 Transformer 的全局上下文信息抽取能力。最后,我们构建了一个高效的医疗图像分割模型(MobileUtr),基于 CNN 和 Transformer。在五个公共医疗图像数据集上进行了三种不同的模式的EXTENSIVE EXPERIMENTS,MobileUtr 对比于当前状态的方法显著优于,同时具有较轻的权重和更低的计算成本。代码可以在 上下载。
Effective Adapter for Face Recognition in the Wild
results: 在零shotSetting下,与基eline的比较达到了3%, 4%, 和7%的提升率,表明方法的有效性。Abstract
In this paper, we tackle the challenge of face recognition in the wild, where images often suffer from low quality and real-world distortions. Traditional heuristic approaches-either training models directly on these degraded images or their enhanced counterparts using face restoration techniques-have proven ineffective, primarily due to the degradation of facial features and the discrepancy in image domains. To overcome these issues, we propose an effective adapter for augmenting existing face recognition models trained on high-quality facial datasets. The key of our adapter is to process both the unrefined and the enhanced images by two similar structures where one is fixed and the other trainable. Such design can confer two benefits. First, the dual-input system minimizes the domain gap while providing varied perspectives for the face recognition model, where the enhanced image can be regarded as a complex non-linear transformation of the original one by the restoration model. Second, both two similar structures can be initialized by the pre-trained models without dropping the past knowledge. The extensive experiments in zero-shot settings show the effectiveness of our method by surpassing baselines of about 3%, 4%, and 7% in three datasets. Our code will be publicly available at https://github.com/liuyunhaozz/FaceAdapter/.
摘要
在这篇论文中,我们解决了人脸识别在野外中的挑战,图像经常受到低质量和现实世界的扭曲影响。传统的决策方法——直接训练模型使用这些受损图像或其修复后的图像——失效,主要是因为人脸特征的损害和图像领域的不一致。为了解决这些问题,我们提出了一种有效的人脸recognition模型Adapter。我们的Adapter的关键是对已有的人脸识别模型进行增强,使其能够处理低质量和修复后的图像。我们的设计包括两个相似的结构,其中一个是固定的,另一个是可学习的。这种设计可以提供两点优点。首先,双输入系统可以减少领域差距,同时提供多个视角来训练人脸识别模型,修复后的图像可以被看作是原始图像的复杂非线性变换。其次,两个相似的结构都可以使用预训练模型进行初始化,不会损失过去的知识。我们的实验结果表明,我们的方法可以在零例外设置下比基eline上出performances,在三个数据集上的比较结果为3%, 4%, 和7%。我们的代码将在https://github.com/liuyunhaozz/FaceAdapter/上公开。
Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection
results: 实验结果显示,LSA 在 Near-OOD 设定下的 OOD 检测表现杰出,较 existing 方法高 $15.26%$ 和 $18.88%$ 在两个 F-OOD 标准 benchmark 上,并且在实际应用中具有更高的应用可行性。Abstract
Full-spectrum out-of-distribution (F-OOD) detection aims to accurately recognize in-distribution (ID) samples while encountering semantic and covariate shifts simultaneously. However, existing out-of-distribution (OOD) detectors tend to overfit the covariance information and ignore intrinsic semantic correlation, inadequate for adapting to complex domain transformations. To address this issue, we propose a Likelihood-Aware Semantic Alignment (LSA) framework to promote the image-text correspondence into semantically high-likelihood regions. LSA consists of an offline Gaussian sampling strategy which efficiently samples semantic-relevant visual embeddings from the class-conditional Gaussian distribution, and a bidirectional prompt customization mechanism that adjusts both ID-related and negative context for discriminative ID/OOD boundary. Extensive experiments demonstrate the remarkable OOD detection performance of our proposed LSA especially on the intractable Near-OOD setting, surpassing existing methods by a margin of $15.26\%$ and $18.88\%$ on two F-OOD benchmarks, respectively.
摘要
全谱出版物检测(F-OOD)目的是正确地识别入数据(ID)样本,同时涵盖语义和相关变换的差异。然而,现有的异常出版物检测器(OOD)往往过拟合偏差信息,忽略内在Semantic correlation,不够适应复杂的领域变换。为解决这个问题,我们提出了具有likelihood感知的SemanticAlignment(LSA)框架,以便将图像文本对应映射到Semantically高可能性区域。LSA包括一种离线的Gaussian抽样策略,可以有效地从类型conditional Gaussian分布中抽取Semantic相关的视觉嵌入,以及一种两向的推荐定制机制,可以调整ID相关的和负面上下文,以提高ID/OOD边界的推荐性。我们进行了广泛的实验, demonstarted that our proposed LSA方法在难以解决的Near-OOD setting中表现出色,与现有方法之间的差异达15.26%和18.88%在两个F-OOD benchmark上,分别。
Simultaneous Alignment and Surface Regression Using Hybrid 2D-3D Networks for 3D Coherent Layer Segmentation of Retinal OCT Images with Full and Sparse Annotations
paper_authors: Hong Liu, Dong Wei, Donghuan Lu, Xiaoying Tang, Liansheng Wang, Yefeng Zheng for:This paper aims to develop a novel framework for 3D retinal layer segmentation in volumetric OCT images based on hybrid 2D-3D convolutional neural networks (CNNs).methods:The proposed framework uses an encoder consisting of 2D convolutions to extract 2D features from individual B-scans, followed by two 3D decoders coupled via a spatial transformer module to produce the alignment displacement vectors and layer segmentation. The framework is trained end-to-end using two losses that utilize the retinal layers’ natural property of being smooth for B-scan alignment and layer segmentation.results:The proposed framework achieves superior performance in terms of both layer segmentation accuracy and cross-B-scan 3D continuity compared to state-of-the-art 2D deep learning methods in both fully and semi-supervised settings. The framework is effective in aligning the B-scans for potential motion correction and offers more clinical values than previous works.Abstract
Layer segmentation is important to quantitative analysis of retinal optical coherence tomography (OCT). Recently, deep learning based methods have been developed to automate this task and yield remarkable performance. However, due to the large spatial gap and potential mismatch between the B-scans of an OCT volume, all of them were based on 2D segmentation of individual B-scans, which may lose the continuity and diagnostic information of the retinal layers in 3D space. Besides, most of these methods required dense annotation of the OCT volumes, which is labor-intensive and expertise-demanding. This work presents a novel framework based on hybrid 2D-3D convolutional neural networks (CNNs) to obtain continuous 3D retinal layer surfaces from OCT volumes, which works well with both full and sparse annotations. The 2D features of individual B-scans are extracted by an encoder consisting of 2D convolutions. These 2D features are then used to produce the alignment displacement vectors and layer segmentation by two 3D decoders coupled via a spatial transformer module. Two losses are proposed to utilize the retinal layers' natural property of being smooth for B-scan alignment and layer segmentation, respectively, and are the key to the semi-supervised learning with sparse annotation. The entire framework is trained end-to-end. To the best of our knowledge, this is the first work that attempts 3D retinal layer segmentation in volumetric OCT images based on CNNs. Experiments on a synthetic dataset and three public clinical datasets show that our framework can effectively align the B-scans for potential motion correction, and achieves superior performance to state-of-the-art 2D deep learning methods in terms of both layer segmentation accuracy and cross-B-scan 3D continuity in both fully and semi-supervised settings, thus offering more clinical values than previous works.
摘要
层 segmentation 是对 retinal optical coherence tomography (OCT) 的量化分析中非常重要的一环。近些年来,深度学习基于的方法被开发出来自动化这个任务,并表现出非常出色的性能。然而,由于 OCT 图像 Volume 中 B-scan 之间的空间差距和可能的匹配问题,所有这些方法都是基于 индивидуаль B-scan 的2D分割,可能会产生缺失连续性和诊断信息的retinal层在3D空间中。此外,大多数这些方法需要 OCT 图像 Volume 的密集标注,这是劳动密集和专家需求的。本文提出了一种新的框架,基于混合2D-3D convolutional neural networks (CNNs),以获取 OCT 图像 Volume 中连续的3D retinal层表面。该框架可以与全部和 incomplete 标注进行semi-supervised learning。抽象层 segmentation 的2D特征被提取出来,并使用两个3D解码器,通过空间转换模块进行相互联系。两个损失函数被提出,利用retinal层的自然性质,即平滑性,进行B-scanAlignment和层分割。整个框架是通过端到端学习进行训练。根据我们所知,这是第一个使用 CNNs 进行 retinal层 segmentation 的3D OCT 图像 Volume 的工作。我们在一个Synthetic dataset和三个公共临床dataset上进行了实验,结果表明,我们的框架可以有效地对 B-scan进行潜在运动 correction,并在完全和半supervised setting中达到了2D deep learning方法的最高性能水平。这些结果表明,我们的方法可以提供更多的临床价值。
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
results: 该研究表明,StableVITON方法可以高效地生成高品质的虚拟试穿图像,并在不同人体图像上表现出优秀的灵活性和稳定性。与基eline方法进行比较,StableVITON方法在质量和量化评价中均取得了优异的结果。Abstract
Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues, we propose StableVITON, learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation, we achieve the sharp attention map, resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation, showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON.
摘要
To address these issues, we propose StableVITON, which learns the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process.To further improve the quality of the generated images, we propose a novel attention total variation loss and apply augmentation. The attention total variation loss helps to achieve a sharp attention map, resulting in a more precise representation of clothing details. Our proposed method outperforms the baselines in both qualitative and quantitative evaluations, demonstrating promising quality in arbitrary person images.The code for StableVITON is available at https://github.com/rlawjdghek/StableVITON.
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection
results: 实验结果表明,我们的方法可以将EXISTING one-stage HOI检测器中应用,并在HICO-DET和V-COCO两个benchmark上达到了状态之巅表现Abstract
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction; however, the interaction representations obtained using this method are entangled and lack interpretability. In contrast, traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner. In this paper, we improve the performance of one-stage methods by enabling them to extract disentangled interaction representations. First, we propose Shunted Cross-Attention (SCA) to extract human appearance, object appearance, and global context features using different cross-attention heads. This is achieved by imposing different masks on the cross-attention maps produced by the different heads. Second, we introduce the Interaction-aware Pose Estimation (IPE) task to learn interaction-relevant human pose features using a disentangled decoder. This is achieved with a novel attention module that accurately captures the human keypoints relevant to the current interaction category. Finally, our approach fuses the appearance feature and pose feature via element-wise addition to form the interaction representation. Experimental results show that our approach can be readily applied to existing one-stage HOI detectors. Moreover, we achieve state-of-the-art performance on two benchmarks: HICO-DET and V-COCO.
摘要
人机交互(HOI)检测是人类图像理解的核心任务。现代一阶方法采用 трансформа器解码器收集图像全域的cue,以便互动预测;然而,这些互动表示具有杂 mixture和无法解释性。相比之下,传统的两阶方法能够细分互动特征,从而提高表达能力。在这篇论文中,我们改进了一阶方法的性能,使其能够提取分离的互动表示。首先,我们提出了分配(SCA),用于提取人体外观、物体外观和全局上下文特征。这是通过不同的 máscara 在不同的 heads 生成的 cross-attention 图表中进行强制分配。其次,我们引入了互动感知pose estimation(IPE)任务,以学习与当前互动类别相关的人体关键点特征。我们使用一种新的注意模块,以准确捕捉人体关键点,并将其与当前互动类别相关。最后,我们将外观特征和pose特征进行元素加法,以形成互动表示。我们的方法可以应用于现有的一阶HOI检测器,并在 HICO-DET 和 V-COCO 两个benchmark上达到了状态的表现。
Regressor-Segmenter Mutual Prompt Learning for Crowd Counting
paper_authors: Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, Qixiang Ye
for: 本研究旨在提高人群计数预测中的精度,解决由注释差异引起的 bias 和不准确性问题。
methods: 本研究提出了自适应提问学习(mPrompt)方法,利用回归器和分割器作为对方的指导,解决由注释差异引起的 bias 和不准确性问题。mPrompt 利用点注释来调整分割器,预测 pseudo head masks,并使用预测的分割masks 作为空间约束,修正偏置点注释。
results: 实验表明,mPrompt 可以显著降低 Mean Average Error (MAE), demonstrating the potential to be general framework for down-stream vision tasks。Abstract
Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios, however, regressors are challenged by uncontrollable annotation variance, which causes density map bias and context information inaccuracy. In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific, mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks, which serve as spatial constraint, to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning, mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE), demonstrating the potential to be general framework for down-stream vision tasks.
摘要
人群计数已经取得了显著的进步,通过训练回归器来预测实例位置。然而,在拥挤的场景下,回归器受到无法控制的注释变量的影响,导致密度地图偏见和上下文信息不准确。在这项研究中,我们提出了互助学习(mPrompt),它利用回归器和分割器作为对方的指导,解决由注释变量引起的偏见和不准确性,同时分别前景和背景。具体来说,mPrompt利用点注释来调整分割器,预测 Pseudo 头部掩蔽,并使用预测的分割mask来约束扭曲的点注释,以提高模型的准确性。mPrompt定义了从提示学习中获得的互信息最大化方法,减轻注释变量的影响,提高下游视觉任务的模型精度。实验表明,mPrompt可以显著减少 Mean Average Error(MAE), demonstrably 成为下游视觉任务的通用框架。
results: 实验表明,GenEM可以生成真实的cryo-EM图像,并且可以提高粒子选择和pose估计模型,最终提高重建分辨率。Abstract
In the past decade, deep conditional generative models have revolutionized the generation of realistic images, extending their application from entertainment to scientific domains. Single-particle cryo-electron microscopy (cryo-EM) is crucial in resolving near-atomic resolution 3D structures of proteins, such as the SARS-COV-2 spike protein. To achieve high-resolution reconstruction, AI models for particle picking and pose estimation have been adopted. However, their performance is still limited as they lack high-quality annotated datasets. To address this, we introduce physics-informed generative cryo-electron microscopy (GenEM), which for the first time integrates physical-based cryo-EM simulation with a generative unpaired noise translation to generate physically correct synthetic cryo-EM datasets with realistic noises. Initially, GenEM simulates the cryo-EM imaging process based on a virtual specimen. To generate realistic noises, we leverage an unpaired noise translation via contrastive learning with a novel mask-guided sampling scheme. Extensive experiments show that GenEM is capable of generating realistic cryo-EM images. The generated dataset can further enhance particle picking and pose estimation models, eventually improving the reconstruction resolution. We will release our code and annotated synthetic datasets.
摘要
过去一个 década,深度条件生成模型已经革命化了生成真实图像的领域,从娱乐领域扩展到科学领域。单个粒子冰电子 микроскопи亮(cryo-EM)是解决近原子粒子结构的三维保守的关键,如SARS-COV-2融合蛋白。为达到高分辨率重建,AI模型 для粒子选择和姿态估计被采用。然而,其性能仍然有限,因为它们缺乏高质量注解 datasets。为此,我们介绍了物理学习生成电子顾 microscope(GenEM),这是首次将物理基于的电子顾 microscope 模拟与生成无对应的噪音翻译相结合,以生成符合物理规则的真实噪音的synthetic cryo-EM数据集。在GenEM中,我们首先通过虚拟样本来模拟电子顾 microscope 图像处理过程。然后,我们利用一种新的mask-guided sampling scheme来实现无对应的噪音翻译。通过对比学习,我们可以在GenEM中生成真实的噪音。我们的实验表明,GenEM可以生成真实的 cryo-EM 图像。生成的数据集可以进一步提高粒子选择和姿态估计模型,最终提高重建分辨率。我们将发布我们的代码和注解的synthetic数据集。
BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection
results: 在 nuScenes benchmark 上,BEVNeXt 在不同设置下比 BEV-based 和 query-based 框架具有更高的性能,实现了 nuScenes 测试集上的最佳结果64.2 NDS。Abstract
Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set.
摘要
近些时候,具有查询基于Transformer解oder的三维物体检测技术在发展。这些查询基于解oder在传统密集BEV(鸟瞰视图)方法的超越。然而,我们认为密集BEV框架仍然重要,因为它们在深度估计和物体地址方面表现出色,能够正确和全面地描述3D场景。本文的目标是解决现有密集BEV基于3D物体检测器的缺陷,我们提出了改进的组件,包括CRF模ulates深度估计模块,长期时间聚合模块和两个阶段对象解码器。这些改进使得我们的“现代化”密集BEV框架得名BEVNeXt。在nuScenes标准测试集上,BEVNeXt在不同的设置下超越了BEV基于和查询基于框架,实现了64.2个NDS的state-of-the-artResult on the nuScenes test set。
Fast and accurate sparse-view CBCT reconstruction using meta-learned neural attenuation field and hash-encoding regularization
results: 在不同部位和设备CBCT扫描中,FACT方法可以提供更好的重建质量和更快的优化速度Abstract
Cone beam computed tomography (CBCT) is an emerging medical imaging technique to visualize the internal anatomical structures of patients. During a CBCT scan, several projection images of different angles or views are collectively utilized to reconstruct a tomographic image. However, reducing the number of projections in a CBCT scan while preserving the quality of a reconstructed image is challenging due to the nature of an ill-posed inverse problem. Recently, a neural attenuation field (NAF) method was proposed by adopting a neural radiance field algorithm as a new way for CBCT reconstruction, demonstrating fast and promising results using only 50 views. However, decreasing the number of projections is still preferable to reduce potential radiation exposure, and a faster reconstruction time is required considering a typical scan time. In this work, we propose a fast and accurate sparse-view CBCT reconstruction (FACT) method to provide better reconstruction quality and faster optimization speed in the minimal number of view acquisitions ($<$ 50 views). In the FACT method, we meta-trained a neural network and a hash-encoder using a few scans (= 15), and a new regularization technique is utilized to reconstruct the details of an anatomical structure. In conclusion, we have shown that the FACT method produced better, and faster reconstruction results over the other conventional algorithms based on CBCT scans of different body parts (chest, head, and abdomen) and CT vendors (Siemens, Phillips, and GE).
摘要
cone beam computed tomography (CBCT) 是一种emerging医疗影像技术,用于可视化患者内部的解剖结构。在CBCT扫描过程中,收集多个视角的投影图,并利用这些图像重构一个tomographic图像。然而,减少CBCT扫描过程中的视角数量,以保持图像重构质量的问题是复杂的,因为这是一个ill-posed inverse problem。最近,一种使用神经透射场(NAF)方法的新方法被提出,这种方法通过采用神经辐射场算法来重构CBCT图像,并且在50个视角下达到了 быстро和有 promise的结果。然而,减少视角数量仍然是可能的,以降低潜在的辐射暴露和提高扫描时间。在这种工作中,我们提出了一种快速和准确的简View CBCT重构方法(FACT),以提供更好的重构质量和更快的优化速度,在最小数量的视角收集(<50个视角)下。在FACT方法中,我们使用了一些扫描(15个)来元学习神经网络和哈希编码器,并利用一种新的Regularization技术来重构解剖结构的细节。结果显示,FACT方法生成了更好的、更快的重构结果,比其他传统的CBCT扫描算法在不同的体部和CT供应商(Siemens、Philips和GE)上。
Adversarial Medical Image with Hierarchical Feature Hiding
results: 实验结果表明,HFC可以更有效地隐藏针对医疗图像的AE,并能够绕过一些国际前沿的医疗AE检测器。这显示出现有的医疗预测模型在面对AE时的不足,并需要在未来更加Robust的防御策略。Abstract
Deep learning based methods for medical images can be easily compromised by adversarial examples (AEs), posing a great security flaw in clinical decision-making. It has been discovered that conventional adversarial attacks like PGD which optimize the classification logits, are easy to distinguish in the feature space, resulting in accurate reactive defenses. To better understand this phenomenon and reassess the reliability of the reactive defenses for medical AEs, we thoroughly investigate the characteristic of conventional medical AEs. Specifically, we first theoretically prove that conventional adversarial attacks change the outputs by continuously optimizing vulnerable features in a fixed direction, thereby leading to outlier representations in the feature space. Then, a stress test is conducted to reveal the vulnerability of medical images, by comparing with natural images. Interestingly, this vulnerability is a double-edged sword, which can be exploited to hide AEs. We then propose a simple-yet-effective hierarchical feature constraint (HFC), a novel add-on to conventional white-box attacks, which assists to hide the adversarial feature in the target feature distribution. The proposed method is evaluated on three medical datasets, both 2D and 3D, with different modalities. The experimental results demonstrate the superiority of HFC, \emph{i.e.,} it bypasses an array of state-of-the-art adversarial medical AE detectors more efficiently than competing adaptive attacks, which reveals the deficiencies of medical reactive defense and allows to develop more robust defenses in future.
摘要
深度学习基于方法可以轻松受到敌意示例(AE)的攻击,这会导致医疗决策中的安全问题。已经发现,传统的敌意攻击如PGD,在分类логи的优化下,容易在特征空间中被识别,从而导致精准的反应防御。为了更好地理解这种现象并重新评估医疗AE的可靠性,我们进行了深入的 investigate。Specifically,我们首先理论上证明,传统的敌意攻击会在固定方向上不断优化易受攻击的特征,从而导致特征空间中的异常表示。然后,我们对医疗图像进行了压力测试,并与自然图像进行比较,发现这种敏感性是一把双刃剑,可以用来隐藏AE。我们然后提出了一种简单 yet effective的层次特征约束(HFC),这是一种新的白盒攻击添加项,可以帮助隐藏敌意特征在目标特征分布中。我们对三个医疗数据集进行了实验,包括2D和3D的多modal数据集。实验结果表明,HFC比其他state-of-the-art adversarial医疗AE探测器更高效,这显示出医疗抗击防御的欠缺,并且可以为未来开发更加Robust的防御。
Multi-task Image Restoration Guided By Robust DINO Features
results: 对多种图像修复任务进行比较,该方法的实验结果表明,DINO-IR可以在多任务图像修复中获得显著的改善,而且与现有的多任务图像修复方法相比,具有更高的效率和稳定性。Abstract
Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. Despite its potential, performance degradation is observed with an increase in the number of tasks, primarily attributed to the distinct nature of each restoration task. Addressing this challenge, we introduce \mbox{\textbf{DINO-IR}, a novel multi-task image restoration approach leveraging robust features extracted from DINOv2. Our empirical analysis shows that while shallow features of DINOv2 capture rich low-level image characteristics, the deep features ensure a robust semantic representation insensitive to degradations while preserving high-frequency contour details. Building on these features, we devise specialized components, including multi-layer semantic fusion module, DINO-Restore adaption and fusion module, and DINO perception contrastive loss, to integrate DINOv2 features into the restoration paradigm. Equipped with the aforementioned components, our DINO-IR performs favorably against existing multi-task image restoration approaches in various tasks by a large margin, indicating the superiority and necessity of reinforcing the robust features for multi-task image restoration.
摘要
多任务图像修复受到了广泛的关注,因为它的自然多任务特性和效率比单任务修复更高。然而,随着任务数量的增加,表现下降是一个普遍存在的问题,主要归结于每个修复任务的独特性。为解决这个挑战,我们介绍了\textbf{\mbox{DINO-IR},一种基于DINOv2的多任务图像修复方法。我们的实验表明,DINOv2的浅层特征 capture了丰富的低级图像特征,而深层特征则保证了不敏感于损害的Semantic Representation,同时保留高频梯度细节。基于这些特征,我们设计了专门的组件,包括多层semantic Fusion模块、DINO-Restore适应和融合模块、DINO感知对比损失等,以把DINOv2特征集成到修复模式中。配备这些组件的DINO-IR在不同任务中与现有的多任务图像修复方法进行比较,表现出了大幅提升,表明了加强 robust特征对多任务图像修复的必要性和优势。
MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model
results: 本研究显示MedXChat模型在多Modal医疗应用中表现出色,在MIMIC数据集上超过参考模型。此外,我们还提出了一种新的文本到CXR合成方法,使用Stable Diffusion(SD)架构中的指令following能力。这种方法与现有模型框架集成得非常好,无需额外参数,同时保持SD的生成力,并同时具有高精度渠道生成的能力。Abstract
Despite the success of Large Language Models (LLMs) in general image tasks, a gap persists in the medical field for a multimodal large model adept at handling the nuanced diversity of medical images. Addressing this, we propose MedXChat, a unified multimodal large model designed for seamless interactions between medical assistants and users. MedXChat encompasses three key functionalities: CXR(Chest X-ray)-to-Report generation, CXR-based visual question-answering (VQA), and Text-to-CXR synthesis. Our contributions are as follows. Firstly, our model showcases exceptional cross-task adaptability, displaying adeptness across all three defined tasks and outperforming the benchmark models on the MIMIC dataset in medical multimodal applications. Secondly, we introduce an innovative Text-to-CXR synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Comprehensive experiments validate MedXChat's synergistic enhancement across all tasks. Our instruction data and model will be open-sourced.
摘要
尽管大语言模型(LLM)在普通图像任务中取得了成功,但在医疗领域仍存在一个多Modal大型模型,能够处理医疗图像的细腻多样性。为解决这个问题,我们提出了MedXChat,一个通用多Modal大型模型,用于医疗助手和用户之间的无缝互动。MedXChat包括三个关键功能:从X脉呼吸(CXR)到报告生成、基于CXR的视觉问答(VQA)和文本到CXR合成。我们的贡献包括以下几点:1. 我们的模型在医疗多Modal应用中表现出色,在MIMIC数据集上超越了标准模型,并在所有三个定义的任务上达到了出色的表现。2. 我们引入了一种创新的文本到CXR合成方法,基于稳定扩散(SD)架构的指令执行能力。这种方法可以轻松地与现有模型框架集成,无需额外参数,因此可以保持SD的生成能力,同时具有高精度渲染医疗图像的能力。3. 我们进行了广泛的实验,证明MedXChat在所有任务上具有相互强化的效果。我们的 instrucion 数据和模型将在未来开源。
Multimodality-guided Image Style Transfer using Cross-modal GAN Inversion
results: 实验和用户研究表明,该方法在文本指导下的图像风格传递任务中实现了状态天平的性能,并且在跨模态风格混合任务中也有出色的效果。Abstract
Image Style Transfer (IST) is an interdisciplinary topic of computer vision and art that continuously attracts researchers' interests. Different from traditional Image-guided Image Style Transfer (IIST) methods that require a style reference image as input to define the desired style, recent works start to tackle the problem in a text-guided manner, i.e., Text-guided Image Style Transfer (TIST). Compared to IIST, such approaches provide more flexibility with text-specified styles, which are useful in scenarios where the style is hard to define with reference images. Unfortunately, many TIST approaches produce undesirable artifacts in the transferred images. To address this issue, we present a novel method to achieve much improved style transfer based on text guidance. Meanwhile, to offer more flexibility than IIST and TIST, our method allows style inputs from multiple sources and modalities, enabling MultiModality-guided Image Style Transfer (MMIST). Specifically, we realize MMIST with a novel cross-modal GAN inversion method, which generates style representations consistent with specified styles. Such style representations facilitate style transfer and in principle generalize any IIST methods to MMIST. Large-scale experiments and user studies demonstrate that our method achieves state-of-the-art performance on TIST task. Furthermore, comprehensive qualitative results confirm the effectiveness of our method on MMIST task and cross-modal style interpolation.
摘要
Image Style Transfer (IST) 是一个跨学科的计算机视觉和艺术领域的研究话题,不断吸引研究者们的关注。与传统的图像指导图像风格传输(IIST)方法不同,latest works开始使用文本指导方式来解决问题,即Text-guided Image Style Transfer(TIST)。相比IIST,这些方法提供更多的自定义风格可能性,特别在enario中风格很难定义的图像风格传输 task中。然而,许多TIST方法会生成不愿意的artefacts在传输图像中。为了解决这个问题,我们提出了一种新的方法,可以实现大幅提高基于文本指导的风格传输。此外,我们的方法还允许多种源和modalities的风格输入,实现MultiModality-guided Image Style Transfer(MMIST)。具体来说,我们实现MMIST通过一种新的交叉modal GAN倒推方法,这种方法可以生成具有指定风格的风格表示。这些风格表示可以帮助实现风格传输,并且在理论上可以普适任何IIST方法到MMIST。大规模的实验和用户研究表明,我们的方法在TIST任务上达到了国际级表现。此外,广泛的质量评估结果也证明了我们的方法在MMIST任务和交叉模式风格 interpolate 任务中的效果。
HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses
results: 可以在任意姿势下生成图像,只需要几个输入,并且提高了人体pose生成性能,比现有的HumanNeRF研究更好,减少了计算复杂度,不需要使用任何加速模块。Abstract
We present HumanNeRF-SE, which can synthesize diverse novel pose images with simple input. Previous HumanNeRF studies require large neural networks to fit the human appearance and prior knowledge. Subsequent methods build upon this approach with some improvements. Instead, we reconstruct this approach, combining explicit and implicit human representations with both general and specific mapping processes. Our key insight is that explicit shape can filter the information used to fit implicit representation, and frozen general mapping combined with point-specific mapping can effectively avoid overfitting and improve pose generalization performance. Our explicit and implicit human represent combination architecture is extremely effective. This is reflected in our model's ability to synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies, HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time (see Figure 1).
摘要
我们现在提出了HumanNeRF-SE,这是一种可以生成多种新的姿势图像的简单输入的方法。之前的HumanNeRF研究需要大型神经网络来适应人体外观和先验知识。后续方法在这个方法的基础上做出了一些改进。相反,我们重新构建了这个方法,将显式和隐式人体表示结合起来,并使用通用和特定映射过程。我们的关键思想是,显式形状可以筛选用于适应隐式表示的信息,而冻结通用映射与点特定映射可以有效避免过拟合和提高姿势泛化性能。我们的显式和隐式人体表示结合体系非常有效。这被反映在我们的模型可以在任意姿势下synthesize图像,只需几张输入图像,并且通过减少计算复杂度而提高图像生成速度,不使用任何加速模块。与现有状态的HumanNeRF研究相比,HumanNeRF-SE具有更好的性能, fewer learnable parameters和更短的训练时间(参见图1)。
RiskBench: A Scenario-based Benchmark for Risk Identification
results: 研究评估了十种算法的威胁检测、预测和决策支持能力,并进行了广泛的实验。未来研究可能包括威胁检测和预测技术的进一步发展。Abstract
Intelligent driving systems aim to achieve a zero-collision mobility experience, requiring interdisciplinary efforts to enhance safety performance. This work focuses on risk identification, the process of identifying and analyzing risks stemming from dynamic traffic participants and unexpected events. While significant advances have been made in the community, the current evaluation of different risk identification algorithms uses independent datasets, leading to difficulty in direct comparison and hindering collective progress toward safety performance enhancement. To address this limitation, we introduce \textbf{RiskBench}, a large-scale scenario-based benchmark for risk identification. We design a scenario taxonomy and augmentation pipeline to enable a systematic collection of ground truth risks under diverse scenarios. We assess the ability of ten algorithms to (1) detect and locate risks, (2) anticipate risks, and (3) facilitate decision-making. We conduct extensive experiments and summarize future research on risk identification. Our aim is to encourage collaborative endeavors in achieving a society with zero collisions. We have made our dataset and benchmark toolkit publicly on the project page: https://hcis-lab.github.io/RiskBench/
摘要
智能驾驶系统目标是实现零Collision mobilility经验,需要跨学科努力提高安全性表现。本工作关注在风险识别方面,包括动态交通参与者和意外事件所导致的风险。尽管社区中已经取得了显著进步,但现有的不同风险识别算法评估使用独立的数据集,导致对比不直接的和阻碍了集体进步。为解决这些限制,我们介绍了\textbf{RiskBench},一个大规模enario-based benchmark для风险识别。我们设计了场景分类和扩展管道,以系统地收集多样化场景下的真实风险。我们评估了十种算法的能力,包括:1)检测和定位风险,2)预测风险,3)促进决策。我们进行了广泛的实验,并总结了未来的风险识别研究。我们的目标是鼓励集体努力,实现零碰撞社会。我们的数据集和benchmark工具箱已经公开在项目页面:https://hcis-lab.github.io/RiskBench/
Adaptive Confidence Threshold for ByteTrack in Multi-Object Tracking
paper_authors: Linh Van Ma, Muhammad Ishfaq Hussain, JongHyun Park, Jeongbae Kim, Moongu Jeon
for: 多bject tracking
methods: 使用ByteTrack算法和自适应信息reshold技术
results: 实现了与传统ByteTrack方法相比的高效性和稳定性Abstract
We investigate the application of ByteTrack in the realm of multiple object tracking. ByteTrack, a simple tracking algorithm, enables the simultaneous tracking of multiple objects by strategically incorporating detections with a low confidence threshold. Conventionally, objects are initially associated with high confidence threshold detections. When the association between objects and detections becomes ambiguous, ByteTrack extends the association to lower confidence threshold detections. One notable drawback of the existing ByteTrack approach is its reliance on a fixed threshold to differentiate between high and low-confidence detections. In response to this limitation, we introduce a novel and adaptive approach. Our proposed method entails a dynamic adjustment of the confidence threshold, leveraging insights derived from overall detections. Through experimentation, we demonstrate the effectiveness of our adaptive confidence threshold technique while maintaining running time compared to ByteTrack.
摘要
我们研究ByteTrack在多对象跟踪领域的应用。ByteTrack是一种简单的跟踪算法,允许同时跟踪多个对象,通过策略地包含低信任值探测。在传统的方法中,对象初始化与高信任值探测相关联。当对象与探测之间的关联变得混乱时,ByteTrack将关联扩展到低信任值探测。现有的ByteTrack方法的一个显著的缺点是它对高和低信任值探测之间的分化采用固定的阈值。为了解决这一限制,我们提出了一种新的和适应的方法。我们的提议方法是动态调整信任阈值,利用总探测中的意见来调整。通过实验,我们证明了我们的适应信任阈值技术的有效性,同时保持与ByteTrack相比的运行时间。
methods: 主要参考了一些 tiny CNN-based SR 方法,参数量少于 5k。提出了改进的多路学习和自定义活化函数。
results: 实验结果表明,TMSR 可以与相关的方法相比,在 5k 参数下获得竞争力强的图像质量(PSNR 和 SSIM)。Abstract
In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the related works under 5k parameters.
摘要
在这篇论文中,我们提出了一种小型多路卷积神经网络(CNN)基于超分辨率(SR)方法,称为TMSR。我们主要参考了一些小于5000参数的tiny CNN基于SR方法。我们的主要贡献是改进了多路学习和自定义活动函数。实验结果显示,TMSR在5000参数以下的相关作品中可以获得竞争力强的图像质量(PSNR和SSIM)。
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm
results: 经过广泛的实验 validate 了我们提出的 SequencePAR 的效果,在多个常用的 pedestrian attribute recognition 数据集上均达到了高精度和稳定性。Abstract
Current pedestrian attribute recognition (PAR) algorithms are developed based on multi-label or multi-task learning frameworks, which aim to discriminate the attributes using specific classification heads. However, these discriminative models are easily influenced by imbalanced data or noisy samples. Inspired by the success of generative models, we rethink the pedestrian attribute recognition scheme and believe the generative models may perform better on modeling dependencies and complexity between human attributes. In this paper, we propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR. It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts. Then, a Transformer decoder is proposed to generate the human attributes by incorporating the visual features and attribute query tokens. The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training. Extensive experiments on multiple widely used pedestrian attribute recognition datasets fully validated the effectiveness of our proposed SequencePAR. The source code and pre-trained models will be released at https://github.com/Event-AHU/OpenPAR.
摘要
现有的步行者特征识别(PAR)算法基于多标签或多任务学习框架,目标是通过特定的分类头来 отличить特征。然而,这些分类模型容易受到不均衡数据或噪声样本的影响。 inspirited by the success of generative models,我们重新思考了人体特征识别的方案,并认为生成模型可能更好地处理人体特征之间的相互关系和复杂性。在这篇论文中,我们提出了一种新的序列生成方法,称为SequencePAR。它使用预训练的CLIP模型提取人体特征,并将属性集embed到询问符token中,以文本提示的指导下。然后,一种Transformer解码器被提出,以将视觉特征和属性询问符共同生成人体特征。在解码模块中,我们引入了masked多头注意层,以防止模型在训练期间记忆下一个属性的情况。我们对多个广泛使用的人体特征识别数据集进行了广泛的实验,并证明了我们提出的SequencePAR的有效性。源代码和预训练模型将在https://github.com/Event-AHU/OpenPAR上发布。
J-Net: Improved U-Net for Terahertz Image Super-Resolution
results: 对于DIV2K+Flickr2K数据集的训练,J-Net在PSNR(峰峰信号噪比)方面与其他THz图像超分辨率方法相比,达到了32.52 dB,超过了其他方法 более1 dB。 J-Net还在实际THz图像上表现出了更好的PSNR和视觉改进。Abstract
Terahertz (THz) waves are electromagnetic waves in the 0.1 to 10 THz frequency range, and THz imaging is utilized in a range of applications, including security inspections, biomedical fields, and the non-destructive examination of materials. However, THz images have low resolution due to the long wavelength of THz waves. Therefore, improving the resolution of THz images is one of the current hot research topics. We propose a novel network architecture called J-Net which is improved version of U-Net to solve the THz image super-resolution. It employs the simple baseline blocks which can extract low resolution (LR) image features and learn the mapping of LR images to highresolution (HR) images efficiently. All training was conducted using the DIV2K+Flickr2K dataset, and we employed the peak signal-to-noise ratio (PSNR) for quantitative comparison. In our comparisons with other THz image super-resolution methods, JNet achieved a PSNR of 32.52 dB, surpassing other techniques by more than 1 dB. J-Net also demonstrates superior performance on real THz images compared to other methods. Experiments show that the proposed J-Net achieves better PSNR and visual improvement compared with other THz image super-resolution methods.
摘要
特拉赫频(THz)波是电磁波的一种频率范围为0.1至10 THz的电磁波,THz成像在各种应用中使用,包括安全检查、医疗领域和不 destrucción材料的检测。然而,THz图像的分辨率受到THz波的长波长影响,因此提高THz图像的分辨率是当前热点研究 topic。我们提出了一种改进版的U-Net网络架构,称之为J-Net,用于解决THz图像超分辨。它使用简单的基线块,可以提取低分辨率(LR)图像特征并有效地学习LR图像到高分辨率(HR)图像的映射。所有训练都在DIV2K+Flickr2K数据集上进行,并使用峰值信号噪声比(PSNR)进行量化比较。在对其他THz图像超分辨方法进行比较时,JNet达到了32.52 dB的PSNR,超过其他技术 более1 dB。J-Net还在真实的THz图像上表现出了更好的PSNR和视觉改进,相比其他方法。实验表明,提议的J-Net在THz图像超分辨方面得到了更好的PSNR和视觉改进。
GaussianHead: Impressive 3D Gaussian-based Head Avatars with Dynamic Hybrid Neural Field
results: 与 estado-of-the-art 技术相比,该算法在自我重建、新视角合成和跨标识重演等任务中实现了优化的视觉效果,同时保持高效的渲染速度(每帧 0.12 秒)。甚至在一些情况下可以清晰地看到嘴巴上的粉刺。代码和更多视频可以在项目主页上找到。Abstract
Previous head avatar methods have mostly relied on fixed explicit primitives (mesh, point) or implicit surfaces (Sign Distance Function) and volumetric neural radiance field, it challenging to strike a balance among high fidelity, training speed, and resource consumption. The recent popularity of hybrid field has brought novel representation, but is limited by relying on parameterization factors obtained through fixed mappings. We propose GaussianHead: an head avatar algorithm based on anisotropic 3D gaussian primitives. We leverage canonical gaussians to represent dynamic scenes. Using explicit "dynamic" tri-plane as an efficient container for parameterized head geometry, aligned well with factors in the underlying geometry and tri-plane, we obtain aligned canonical factors for the canonical gaussians. With a tiny MLP, factors are decoded into opacity and spherical harmonic coefficients of 3D gaussian primitives. Finally, we use efficient differentiable gaussian rasterizer for rendering. Our approach benefits significantly from our novel representation based on 3D gaussians, and the proper alignment transformation of underlying geometry structures and factors in tri-plane eliminates biases introduced by fixed mappings. Compared to state-of-the-art techniques, we achieve optimal visual results in tasks such as self-reconstruction, novel view synthesis, and cross-identity reenactment while maintaining high rendering efficiency (0.12s per frame). Even the pores around the nose are clearly visible in some cases. Code and additional video can be found on the project homepage.
摘要
以前的头像方法主要依靠固定的显式元素(如 mesh 或 point)或隐式表面(如信号距离函数)和 volumes 神经雷达场,具有高品质、快速训练和资源消耗的平衡问题。现在的混合场景的出现带来了新的表示方式,但它受到固定的映射因子的限制。我们提出了 GaussianHead:基于三维泛化函数的头像算法。我们利用可极化的三维高斯函数来表示动态场景。通过使用显式的“动态”三平面来包装参数化的头geometry,我们可以得到与地面结构和三平面的对齐的泛化因子。通过一个小型的 MLP,我们可以从这些因子中解码 opacity 和圆柱卷积级数。最后,我们使用高效的分布式高斯渲染器进行渲染。我们的方法受益于我们的新的表示方式,并且对于地面结构和因子的对齐转换,可以消除由固定映射引入的偏见。相比之下,我们在自我重建、新视图合成和跨标识人物重游戏等任务中实现了最佳的视觉效果,同时保持高效的渲染速度(0.12秒/帧)。甚至在某些情况下,可以清晰地看到鼻子周围的粗糙。代码和额外的视频可以在项目主页上找到。
paper_authors: Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim
for: 这个论文的目的是研究现代大语言模型(LLMs)是否可以适应图像分类任务。
methods: 该论文使用了一种听说挑战任务中的小量指导数据进行训练,以适应图像分类任务。
results: 研究发现,通过轻量级的微调,LLMs可以实现好的图像分类性能,并且比特制的模型CLIP的性能高出13%。此外,该方法还保留了LLM的生成能力。Abstract
Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.
摘要
大型语言模型(LLM)已经成为机器学习问题的强大通用接口。 latest work 已经使用 relativity small amount of instruction-tuning data 将 LLM 应用到生成视觉任务,如图像描述、视觉问答和视觉对话。在这篇论文中,我们 investigate Whether modern LLMs can also be adapted to categorize an image into a set of categories。 First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP。 We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP。 Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way。 Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities。 LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data。Note: "mLLMs" stands for "multimodal large language models".
Universal Segmentation at Arbitrary Granularity with Language Instruction
results: 这种模型在多种任务和设定下表现出色,超过了专家和通用分割模型的性能。Here’s the full text in Simplified Chinese:
for: 这种研究的目的是实现任意粒度semantic level的通用分割。
methods: 这种模型使用语言指令引导的方法来实现分割。
results: 这种模型在多种任务和设定下表现出色,超过了专家和通用分割模型的性能。Abstract
This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years, specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost, which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios, limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end, we present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data, UniLSeg achieves excellent performance on various tasks and settings, surpassing both specialist and unified segmentation models.
摘要
(Simplified Chinese translation)这篇论文目标是实现任意 semantic level 的通用分割。尽管过去几年有所进步,专家化分割方法仍然受到特定任务和数据分布的限制。为了适应新的enario或设置,重新训练新模型的计算成本和时间成本很高,这引发了需要通用和通用分割模型,可以适应不同的粒度。虽然有一些尝试将不同的分割任务或通用化到不同的enario,但是由于定义 paradigm 和输入输出空间的限制,它们很难实现Content的准确理解。为了解决这个问题,我们提出了 UniLSeg,一种可以在任意 semantic level 上进行分割的通用分割模型。为了训练 UniLSeg,我们将原始多个任务的数据重新组织成一个统一的数据格式,图像上的文本描述分割目标作为输入,并输出对应的mask。通过自动生成的注释工具来利用大量未标注数据,UniLSeg在多个任务和设置中表现出色,超过了专家和通用分割模型。
SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System
methods: 提出了一种新的筛子基于VINS框架,named SchurVINS,可以保证高精度和低计算复杂度。技术来自于Explicitly模型Gradient、Hessian和观察 covariance,然后使用Schur complement decomposes the full model into ego-motion residual model和landmark residual model,最后在这两个模型中进行EKF更新。
results: experiments on EuRoC和TUM-VI datasets show that our method notably outperforms state-of-the-art(SOTA)方法在精度和计算复杂度两个方面。Abstract
Accuracy and computational efficiency are the most important metrics to Visual Inertial Navigation System (VINS). The existing VINS algorithms with either high accuracy or low computational complexity, are difficult to provide the high precision localization in resource-constrained devices. To this end, we propose a novel filter-based VINS framework named SchurVINS, which could guarantee both high accuracy by building a complete residual model and low computational complexity with Schur complement. Technically, we first formulate the full residual model where Gradient, Hessian and observation covariance are explicitly modeled. Then Schur complement is employed to decompose the full model into ego-motion residual model and landmark residual model. Finally, Extended Kalman Filter (EKF) update is implemented in these two models with high efficiency. Experiments on EuRoC and TUM-VI datasets show that our method notably outperforms state-of-the-art (SOTA) methods in both accuracy and computational complexity. We will open source our experimental code to benefit the community.
摘要
<> translate_language English Simplified ChineseAccuracy和计算效率是视觉导航系统(VINS)中最重要的度量。现有的VINS算法具有高精度或低计算复杂度,却难以在有限资源的设备中提供高精度定位。为此,我们提出了一种新的筛子基于VINS框架,名为SchurVINS,可以保证高精度和低计算复杂度。技术上,我们首先形式化了完整的差异模型,其中包括梯度、Hessian和观测协relation Matrix的明确表述。然后,我们使用Schur complement decomposition来分解全模型,即自身运动差异模型和标记差异模型。最后,我们在这两个模型中实现了高效的EKF更新。实验结果表明,我们的方法在EuRoC和TUM-VI数据集上显著超过了现有方法的精度和计算复杂度。我们将在实验代码上开源,以便对社区有利。
TextAug: Test time Text Augmentation for Multimodal Person Re-identification
paper_authors: Mulham Fawakherji, Eduard Vazquez, Pasquale Giampa, Binod Bhattarai
for: 提高多模态人识别器的性能
methods: 使用图像领域中常用的数据扩充技术,如剪辑和混合,对文本进行扩充
results: 提出了一种简单 yet effective的文本扩充方法,可以在多模态人识别任务中提高性能Abstract
Multimodal Person Reidentification is gaining popularity in the research community due to its effectiveness compared to counter-part unimodal frameworks. However, the bottleneck for multimodal deep learning is the need for a large volume of multimodal training examples. Data augmentation techniques such as cropping, flipping, rotation, etc. are often employed in the image domain to improve the generalization of deep learning models. Augmenting in other modalities than images, such as text, is challenging and requires significant computational resources and external data sources. In this study, we investigate the effectiveness of two computer vision data augmentation techniques: cutout and cutmix, for text augmentation in multi-modal person re-identification. Our approach merges these two augmentation strategies into one strategy called CutMixOut which involves randomly removing words or sub-phrases from a sentence (Cutout) and blending parts of two or more sentences to create diverse examples (CutMix) with a certain probability assigned to each operation. This augmentation was implemented at inference time without any prior training. Our results demonstrate that the proposed technique is simple and effective in improving the performance on multiple multimodal person re-identification benchmarks.
摘要
多模态人识别正在研究社区中得到广泛的推广,因为它比单modal框架更有效。然而,多模态深度学习的瓶颈是需要大量多模态训练示例。图像领域中使用数据扩展技术,如裁剪、翻转、旋转等,可以提高深度学习模型的通用性。然而,在其他模式中进行数据扩展,如文本,是复杂并需要大量计算资源和外部数据源。在本研究中,我们调查了图像数据扩展技术cutout和cutmix的效iveness在多模态人重识别中。我们的方法将这两种扩展策略合并为一个策略,称为CutMixOut,其中随机从句子中删除单词或子句(cutout),并将多个句子的部分混合成多个多样化的示例(cutmix),每种操作都有一定的概率被采用。这种扩展在推理时实现,无需任何先前训练。我们的结果表明,我们提出的方法是简单而有效的,可以提高多个多模态人重识别标准 bencmarks 的性能。
results: 在多个复杂的视觉语言benchmark上表现出色,比如科学问答和细化视觉分类,与现有方法比较显著提高了零参数图像推理任务中语义理解和细节描述能力Abstract
Aligning the recent large language models (LLMs) with computer vision models leads to large vision-language models (LVLMs), which have paved the way for zero-shot image reasoning tasks. However, LVLMs are usually trained on short high-level captions only referring to sparse focus regions in images. Such a ``tunnel vision'' limits LVLMs to exploring other relevant contexts in complex scenes. To address this challenge, we introduce Question-Driven Visual Exploration (QVix), a novel prompting strategy that enhances the exploratory capabilities of LVLMs in zero-shot reasoning tasks. QVix leverages LLMs' strong language prior to generate input-exploratory questions with more details than the original query, guiding LVLMs to explore visual content more comprehensively and uncover subtle or peripheral details. QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment. Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods, highlighting its effectiveness in bridging the gap between complex visual data and LVLMs' exploratory abilities.
摘要
大量的语言模型(LLMs)与计算机视觉模型结合,可以生成大型视觉语言模型(LVLMs),这些模型可以进行零基础图像推理任务。然而,LVLMs 通常只接受短高级描述,只关注图像中的精细焦点区域。这种“隧道视野”限制了LVLMs的探索能力,使其无法探索复杂场景中的其他相关背景。为解决这个挑战,我们提出了问题驱动的视觉探索(QVix),一种新的提示策略,可以增强LVLMs在零基础推理任务中的探索能力。QVix 利用 LLMs 强大的语言优先来生成更多细节的输入探索问题,使LVLMs可以更全面地探索视觉内容,探测到细节和边缘区域中的细节。QVix 可以扩大视觉场景的探索范围,提高 LVLMs 的推理精度和深度,在视觉问答和视觉推理等任务中表现出色。我们对多个复杂的零基础视觉语言benchmark进行评估, results demonstrate that QVix significantly outperforms existing methods, highlighting its effectiveness in bridging the gap between complex visual data and LVLMs' exploratory abilities。
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
results: 在八个semantic segmentation benchmark上,通过对CLIP进行训练-free的适应,实现了38.2%的平均零shot mIoU,significantly Outperforming现有的SoTA(33.9%)和vanilla CLIP(14.1%)。Abstract
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.
摘要
PixelLM: Pixel Reasoning with Large Multimodal Model
results: 在多个图像理解和理解任务中,PixelLM超过了已知方法,包括MUSE、单refer和多refer分割。 comprehensive ablation confirm each proposed component的有效性。Abstract
While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM is a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a target refinement loss to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.
摘要
大型多modal模型(LMM)已经取得了很大的进步,但是在多个开放世界目标的图像理解任务中生成像素级掩码仍然是一大挑战。为bridge这个差距,我们介绍PixelLM,一种高效和高效的LMM для像素级理解和理解。PixelLM的核心是一种新的轻量级像素解码器和一个完整的分割代码库。解码器可以高效地从代码库Token的隐藏嵌入中生成掩码,这些Token嵌入了细节目标相关的信息。这种设计使PixelLM与流行的LMM结构相协同,并避免了额外成本的分割模型。此外,我们提出了一种目标精度优化loss来提高模型对多个目标的分辨率,从而导致掩码质量明显提高。为进一步推动这个领域的研究,我们构建了高品质多目标理解分割benchmark(MUSE)。PixelLM在多种像素级图像理解和理解任务中表现出色,比较有名的方法在多个benchmark中表现出色,包括MUSE、单refer和多refer分割。完整的拓展证明了每个提posed ком ponent的有效性。所有代码、模型和数据将公共可用。
Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition
results: 经验表明,我们的方法不仅在多种视频benchmark上达到新的SOTA表现,还具有优秀的解释性。Abstract
Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.
摘要
(简化中文)研究开放词汇视频动作识别是一项有前途的项目,目的是在任意的类别集中识别未经见过的动作。现有方法通常是将预训练的图像文本模型适应到视频领域,利用其通用的优势。这些方法通常是通过将视觉特征与时间信息相结合来提高识别已知动作的能力。然而,它们却妥协使用标准的不具有特征的动作描述,因此在遇到新的动作时表现不佳。 drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is crucial for open-vocabulary video action recognition. To achieve this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.(简化中文)研究开放词汇视频动作识别是一项有前途的项目,目的是在任意的类别集中识别未经见过的动作。现有方法通常是将预训练的图像文本模型适应到视频领域,利用其通用的优势。这些方法通常是通过将视觉特征与时间信息相结合来提高识别已知动作的能力。然而,它们却妥协使用标准的不具有特征的动作描述,因此在遇到新的动作时表现不佳。 drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is crucial for open-vocabulary video action recognition. To achieve this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.
Learning Efficient Unsupervised Satellite Image-based Building Damage Detection
methods: 该方法基于预训练的视觉语言基础模型(Grounding DINO、SAM和CLIP),并提出了一种新的自我超vised框架U-BDD++,以 Addressing domain-specific issues associated with satellite imagery。
results: 实验结果表明,该方法可以有效地进行无监督的建筑损害检测,并且可以减少 Labeling effort。Abstract
Existing Building Damage Detection (BDD) methods always require labour-intensive pixel-level annotations of buildings and their conditions, hence largely limiting their applications. In this paper, we investigate a challenging yet practical scenario of BDD, Unsupervised Building Damage Detection (U-BDD), where only unlabelled pre- and post-disaster satellite image pairs are provided. As a pilot study, we have first proposed an advanced U-BDD baseline that leverages pre-trained vision-language foundation models (i.e., Grounding DINO, SAM and CLIP) to address the U-BDD task. However, the apparent domain gap between satellite and generic images causes low confidence in the foundation models used to identify buildings and their damages. In response, we further present a novel self-supervised framework, U-BDD++, which improves upon the U-BDD baseline by addressing domain-specific issues associated with satellite imagery. Furthermore, the new Building Proposal Generation (BPG) module and the CLIP-enabled noisy Building Proposal Selection (CLIP-BPS) module in U-BDD++ ensure high-quality self-training. Extensive experiments on the widely used building damage assessment benchmark demonstrate the effectiveness of the proposed method for unsupervised building damage detection. The presented annotation-free and foundation model-based paradigm ensures an efficient learning phase. This study opens a new direction for real-world BDD and sets a strong baseline for future research.
摘要
现有的建筑物损害检测(BDD)方法总是需要劳动密集的像素级别标注建筑物和其状况,因此很大程度上限制了它们的应用。在这篇论文中,我们研究了一个具有挑战性和实用性的BDD场景,即无监督的建筑物损害检测(U-BDD),只提供了无标注的卫星图像对。作为一个先验研究,我们首先提出了一个高级U-BDD基线,利用预训练的视觉语言基础模型(如Grounding DINO、SAM和CLIP)来解决U-BDD任务。然而,卫星图像的域特性和常见图像之间的域差导致基础模型用于识别建筑物和其损害时的信任度较低。为此,我们进一步提出了一个新的自动生成框架,U-BDD++,用于解决卫星图像特有的领域问题。此外,U-BDD++中的新建筑物提案生成(BPG)模块和CLIP启用的杂乱建筑物提案选择(CLIP-BPS)模块,确保高质量的自动训练。广泛的实验表明,提议的方法可以有效地进行无监督的建筑物损害检测。由于不需要标注,学习阶段可以更加快速和高效。这项研究开启了一新的方向,为实际的BDD做出了重要贡献,并设置了未来研究的强大基线。
Survey on deep learning in multimodal medical imaging for cancer detection
for: The paper is written for researchers and practitioners working in the field of cancer detection and diagnosis, specifically those interested in multimodal cancer detection using deep learning.
methods: The paper focuses on the use of deep learning-based object detection for multimodal cancer detection, specifically investigating over 150 papers in recent years. It discusses various challenges such as data annotation, variance between classes, small-scale lesions, and occlusion, and provides an overview of the advantages and drawbacks of each approach.
results: The paper provides an overview of the current state of the art in multimodal cancer detection using deep learning, including the various datasets and solutions that have been proposed to address the challenges in this field. It also discusses the current scope of work and provides directions for future development.Here is the same information in Simplified Chinese text:
for: 该论文是为了投入 cancer 检测和诊断领域的研究人员和实践者而写的,特别是关注多modal cancer 检测使用深度学习。
methods: 论文主要关注过去几年的150篇论文,探讨了多modal cancer 检测中的各种挑战,如数据标注、类别之间的变异、小规模的肿瘤、和遮挡等问题,并提供了每种方法的优缺点。
results: 论文提供了多modal cancer 检测领域的当前状况,包括各种数据集和解决方案,以及这些方法的优缺点。同时,论文还讨论了当前的范围和未来发展的方向。Abstract
The task of multimodal cancer detection is to determine the locations and categories of lesions by using different imaging techniques, which is one of the key research methods for cancer diagnosis. Recently, deep learning-based object detection has made significant developments due to its strength in semantic feature extraction and nonlinear function fitting. However, multimodal cancer detection remains challenging due to morphological differences in lesions, interpatient variability, difficulty in annotation, and imaging artifacts. In this survey, we mainly investigate over 150 papers in recent years with respect to multimodal cancer detection using deep learning, with a focus on datasets and solutions to various challenges such as data annotation, variance between classes, small-scale lesions, and occlusion. We also provide an overview of the advantages and drawbacks of each approach. Finally, we discuss the current scope of work and provide directions for the future development of multimodal cancer detection.
摘要
多模态肿瘤检测的任务是确定肿瘤的位置和类别,这是癌症诊断的重要研究方法之一。在深度学习的支持下,对象检测在 semantic feature 提取和非线性函数适应方面具有显著的进步。然而,多模态肿瘤检测仍然存在许多挑战,包括肿瘤形态异常、between-class 差异、数据注释困难、图像artefacts 等。在本调查中,我们主要研究过去几年的150多篇论文,探讨了多模态肿瘤检测使用深度学习的方法,特别是数据集和对各种挑战的解决方案,包括数据注释、between-class 差异、小规模肿瘤和遮挡。我们还提供了每种方法的优缺点。最后,我们讨论了目前的工作范围和未来的发展方向。
Multi-View Person Matching and 3D Pose Estimation with Arbitrary Uncalibrated Camera Networks
results: 我们在三个公开数据集和两个室内和室外自然环境中采集的两个数据集上进行了广泛的评估,结果显示我们的方法在跨视图人脸匹配和3D人体 pose 估计方面比其他方法大幅提高,并在不使用摄像头位势或3D 数据 annotaion的情况下达到了最佳性能。Abstract
Cross-view person matching and 3D human pose estimation in multi-camera networks are particularly difficult when the cameras are extrinsically uncalibrated. Existing efforts generally require large amounts of 3D data for training neural networks or known camera poses for geometric constraints to solve the problem. However, camera poses and 3D data annotation are usually expensive and not always available. We present a method, PME, that solves the two tasks without requiring either information. Our idea is to address cross-view person matching as a clustering problem using each person as a cluster center, then obtain correspondences from person matches, and estimate 3D human poses through multi-view triangulation and bundle adjustment. We solve the clustering problem by introducing a "size constraint" using the number of cameras and a "source constraint" using the fact that two people from the same camera view should not match, to narrow the solution space to a small feasible region. The 2D human poses used in clustering are obtained through a pre-trained 2D pose detector, so our method does not require expensive 3D training data for each new scene. We extensively evaluate our method on three open datasets and two indoor and outdoor datasets collected using arbitrarily set cameras. Our method outperforms other methods by a large margin on cross-view person matching, reaches SOTA performance on 3D human pose estimation without using either camera poses or 3D training data, and shows good generalization ability across five datasets of various environment settings.
摘要
跨视图人脸匹配和3D人体姿态估计在多摄像头网络中特别困难,因为摄像头没有外部准确设定。现有方法通常需要大量的3D数据进行神经网络训练或知道摄像头位姿来解决问题。然而,摄像头位姿和3D数据标注通常是昂贵的并不总是可获得。我们提出了一种方法,PME,解决这两个任务不需要任何信息。我们的想法是将跨视图人脸匹配视为一个聚类问题,使每个人作为聚类中心,然后从人匹配中获取对应关系,并通过多视图三角形算法和缓冲调整来估计3D人体姿态。我们解决聚类问题的方法是引入“大小约束”使用摄像头数量,以及“源约束”使用同一个摄像头视图中的两个人不能匹配,以缩小解决空间到一个小可行区域。我们使用预训练的2D姿态检测器来获取2D人姿,因此我们的方法不需要每个新场景都进行昂贵的3D训练数据。我们对三个开源数据集和两个室外室内数据集进行了广泛的评估,并达到了其他方法的大幅超越,在跨视图人脸匹配方面和3D人体姿态估计方面无需使用摄像头位姿或3D训练数据,并且在五个不同环境设置下显示了良好的总体化能力。
Hyperspectral Image Compression Using Sampling and Implicit Neural Representations
results: 对四个标准测试集(Indian Pines、Jasper Ridge、Pavia University 和 Cuprite)进行了评估,并显示了该方法可以在低比特率下实现更好的压缩率,并且比JPEG、JPEG2000和PCA-DCT更快。此外,与学习型方法进行比较,也显示了更好的性能和速度。Abstract
Hyperspectral images, which record the electromagnetic spectrum for a pixel in the image of a scene, often store hundreds of channels per pixel and contain an order of magnitude more information than a similarly-sized RBG color image. Consequently, concomitant with the decreasing cost of capturing these images, there is a need to develop efficient techniques for storing, transmitting, and analyzing hyperspectral images. This paper develops a method for hyperspectral image compression using implicit neural representations where a multilayer perceptron network F with sinusoidal activation functions "learns" to map pixel locations to pixel intensities for a given hyperspectral image I. F thus acts as a compressed encoding of this image, and the original image is reconstructed by evaluating F at each pixel location. We use a sampling method with two factors: window size and sampling rate to reduce the compression time. We have evaluated our method on four benchmarks -- Indian Pines, Jasper Ridge, Pavia University, and Cuprite using PSNR and SSIM -- and we show that the proposed method achieves better compression than JPEG, JPEG2000, and PCA-DCT at low bitrates. Besides, we compare our results with the learning-based methods like PCA+JPEG2000, FPCA+JPEG2000, 3D DCT, 3D DWT+SVR, and WSRC and show the corresponding results in the "Compression Results" section. We also show that our methods with sampling achieve better speed and performance than our method without sampling.
摘要
гиперспектральные изображения,记录 scène 的电磁谱спектrum 每个像素点,通常包含数百个通道每个像素点,与同样大小的 RGB 颜色图像相比,含有一个数量级更多的信息。随着捕获这些图像的成本逐渐下降,需要开发高效的存储、传输和分析 гиперспектраль图像的技术。本文提出了一种基于偏多层感知网络(F)的 гиперспектраль图像压缩方法,其中 F 使用抽象函数来映射图像像素点到图像像素点的Intensity。因此,F acted as a compressed representation of the image, and the original image was reconstructed by evaluating F at each pixel location.我们使用了两个因素:窗口大小和采样率来减少压缩时间。我们对四个标准测试集(印度皮托、杰斯柏林、帕维亚大学和普瓦瑞特)进行了PSNR和SSIM的评估,并显示了我们提出的方法在低比特率下比 JPEG、JPEG2000 和 PCA-DCT 更好的压缩。此外,我们与学习基于方法如 PCA+JPEG2000、FPCA+JPEG2000、3D DCT、3D DWT+SVR 进行了比较,并在 "压缩结果" 部分显示了相应的结果。此外,我们还显示了我们的方法采样 achieve better speed和性能than our method without sampling。
Digital Histopathology with Graph Neural Networks: Concepts and Explanations for Clinicians
results: 使用HoVer-Net和图 convolution 网络在肿瘤预测中显示了可靠的结果,并且在使用H&E镜像进行训练时表现良好。Abstract
To address the challenge of the ``black-box" nature of deep learning in medical settings, we combine GCExplainer - an automated concept discovery solution - along with Logic Explained Networks to provide global explanations for Graph Neural Networks. We demonstrate this using a generally applicable graph construction and classification pipeline, involving panoptic segmentation with HoVer-Net and cancer prediction with Graph Convolution Networks. By training on H&E slides of breast cancer, we show promising results in offering explainable and trustworthy AI tools for clinicians.
摘要
为了解决深度学习在医疗场景中的“黑盒”问题,我们将GCExplainer - 一种自动概念发现解决方案 - 与逻辑解释网络结合使用,以提供全局解释 для图 neural network。我们通过一种通用的图构建和分类管道,包括扫描绘-Net和图 convolutional network,在使用 H&E 报告扫描的乳腺癌预测中进行训练。我们在医生可信的 AI 工具方面取得了有前途的结果。
results: 提出一种新的预测模型ECON,具有高效的微博筛选、自我意识机制和多级关系分析等优势,可以更好地预测股票市场运动和波动。Abstract
Predicting stock market is vital for investors and policymakers, acting as a barometer of the economic health. We leverage social media data, a potent source of public sentiment, in tandem with macroeconomic indicators as government-compiled statistics, to refine stock market predictions. However, prior research using tweet data for stock market prediction faces three challenges. First, the quality of tweets varies widely. While many are filled with noise and irrelevant details, only a few genuinely mirror the actual market scenario. Second, solely focusing on the historical data of a particular stock without considering its sector can lead to oversight. Stocks within the same industry often exhibit correlated price behaviors. Lastly, simply forecasting the direction of price movement without assessing its magnitude is of limited value, as the extent of the rise or fall truly determines profitability. In this paper, diverging from the conventional methods, we pioneer an ECON. The framework has following advantages: First, ECON has an adept tweets filter that efficiently extracts and decodes the vast array of tweet data. Second, ECON discerns multi-level relationships among stocks, sectors, and macroeconomic factors through a self-aware mechanism in semantic space. Third, ECON offers enhanced accuracy in predicting substantial stock price fluctuations by capitalizing on stock price movement. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility.
摘要
预测股市是对投资者和政策制定者而言非常重要,作为经济健康的指标。我们利用社交媒体数据,这是一种强大的公众情感数据源,与官方统计指标共同进行精细化股市预测。然而,先前的研究使用微博数据进行股市预测存在三大挑战。首先,微博质量异相差很大,只有一部分真正反映股市实际情况。其次,只强调历史数据的特定股票而不考虑相关行业,可能导致忽视。最后,简单地预测价格方向而忽略其大小,实际上对利润的评估具有有限的价值。在这篇论文中,我们抛弃传统方法,提出了一种新的ECON框架。ECON框架具有以下优势:一、ECON有高效的微博筛选和解码功能,可以高效地提取和解码大量微博数据。二、ECON通过自我意识机制在semantic空间中掌握多级关系,包括股票、行业和macro经济因素之间的关系。三、ECON可以在股价波动中提高预测精度,通过利用股价运动来补做传统方法的缺陷。我们使用自己特制的数据集,展示了ECON框架在预测股市运动和波动中的 state-of-the-art 性能。
CityTFT: Temporal Fusion Transformer for Urban Building Energy Modeling
paper_authors: Ting-Yu Dai, Dev Niyogi, Zoltan Nagy
for: investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels
methods: data-driven UBEM framework, accurately model the energy demands in urban environments
results: predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98% and RMSE of loads of 13.57 kWhAbstract
Urban Building Energy Modeling (UBEM) is an emerging method to investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels. However, current UBEM methods are mostly physic-based and time-consuming in multiple climate change scenarios. This work proposes CityTFT, a data-driven UBEM framework, to accurately model the energy demands in urban environments. With the empowerment of the underlying TFT framework and an augmented loss function, CityTFT could predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98 \% while RMSE of loads of 13.57 kWh.
摘要
城市建筑能源模拟(UBEM)是一种emerging方法,用于调查城市规划和能源系统,面对城市和社区层次上的增长能源需求。然而,当前UBEM方法多为物理基础的,在多种气候变化enario下时间consuming。这项工作提出了CityTFT,一个基于数据驱动的UBEM框架,以高度准确地模拟城市环境中的能源需求。通过TFT基础和增强的损失函数,CityTFT可以在未看到气候动力学的情况下预测冷却和加热触发器,F1分数为99.98%,载荷差值为13.57 kWh。
Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks
results: 研究发现DINOv2在分类任务中获得了竞争性的结果,而在分割任务中则表现出了非常出色的表现,与已有的医疗影像分析模型相比,DINOv2在这些任务中表现出了明显的优势。Abstract
The integration of deep learning systems into the medical domain has been hindered by the resource-intensive process of data annotation and the inability of these systems to generalize to different data distributions. Foundation models, which are models pre-trained on large datasets, have emerged as a solution to reduce reliance on annotated data and enhance model generalizability and robustness. DINOv2, an open-source foundation model pre-trained with self-supervised learning on 142 million curated natural images, excels in extracting general-purpose visual representations, exhibiting promising capabilities across various vision tasks. Nevertheless, a critical question remains unanswered regarding DINOv2's adaptability to radiological imaging, and the clarity on whether its features are sufficiently general to benefit radiology image analysis is yet to be established. Therefore, this study comprehensively evaluates DINOv2 for radiology, conducting over 100 experiments across diverse modalities (X-ray, CT, and MRI). Tasks include disease classification and organ segmentation on both 2D and 3D images, evaluated under different settings like kNN, few-shot learning, linear-probing, end-to-end fine-tuning, and parameter-efficient fine-tuning, to measure the effectiveness and generalizability of the DINOv2 feature embeddings. Comparative analyses with established medical image analysis models, U-Net and TransUnet for segmentation, and CNN and ViT models pre-trained via supervised, weakly supervised, and self-supervised learning for classification, reveal DINOv2's superior performance in segmentation tasks and competitive results in disease classification. The findings contribute insights to potential avenues for optimizing pre-training strategies for medical imaging and enhancing the broader understanding of DINOv2's role in bridging the gap between natural and radiological image analysis.
摘要
《深度学习系统在医学领域的整合受到了数据标注过程的资源占用和模型无法泛化到不同的数据分布问题的限制。基于大量数据预训练的基础模型,如DINOv2,已经出现为解决这些问题提供了一个解决方案。DINOv2采用自然图像142万个 curae的自我supervised学习预训练,在EXTRACTING general-purpose visual representations方面表现出色,在不同的视觉任务中显示出了扎实的能力。然而,关于DINOv2在医学影像领域的适应性和其特征是否足够普遍以便医学影像分析的问题,还没有得到充分的答案。因此,本研究对DINOv2进行了广泛的评估,在多种多Modalities(X射、CT和MRI)中进行了100多个实验,包括疾病类别和器官分割任务。在不同的设置下,如kNN、少量学习、线性探测、终端练化和参数效率练化等,评估了DINOv2特征嵌入的效果和普遍性。与已有的医学影像分析模型,如U-Net和TransUnet,进行了比较分析,发现DINOv2在分 segmentation任务中表现出色,与其他模型相比,在疾病类别任务中表现也是相对竞争力强。研究结果对推进医学影像分析领域的预训练策略产生了新的思路,并为DINOv2在自然图像和医学影像分析之间的桥接作用提供了更深刻的理解。
Class-Discriminative Attention Maps for Vision Transformers
paper_authors: Lennart Brocki, Neo Christopher Chung
For: The paper is written to introduce a novel post-hoc explanation method called class-discriminative attention maps (CDAM) to explain the predictions of vision transformers (ViT) models.* Methods: The paper uses a self-supervised learning (SSL) training method to train the ViT model, and introduces CDAM to provide explanations of the model’s predictions. CDAM scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head.* Results: The paper shows that CDAM is highly class-discriminative and semantically relevant, and provides implicit regularization of relevance scores. The results also demonstrate that CDAM outperforms alternative explanation methods such as relevance propagation (RP) and token ablation maps (TAM) in terms of class-discriminativity and semantic relevance.Here is the information in Simplified Chinese text:* для: 本文是为了介绍一种新的后期解释方法 called class-discriminative attention maps (CDAM),用于解释vision transformers (ViT)模型的预测。* 方法: 本文使用了一种自动学习 (SSL) 训练方法来训练 ViT 模型,并引入 CDAM 来提供模型预测的解释。CDAM 将注意力分数缩放到类ifier head 的预测中的相关性。* 结果: 本文显示了 CDAM 高度类分强制和semantic relevance,并提供了隐式的正则化注意力分数。结果还表明 CDAM 在类分强制和semantic relevance方面高于替代的解释方法 relevance propagation (RP) 和 token ablation maps (TAM)。Abstract
Interpretability methods are critical components for examining and exploring deep neural networks (DNN), as well as increasing our understanding of and trust in them. Vision transformers (ViT), which can be trained to state-of-the-art performance with a self-supervised learning (SSL) training method, provide built-in attention maps (AM). While AMs can provide high-quality semantic segmentation of input images, they do not account for any signal coming from a downstream classifier. We introduce class-discriminative attention maps (CDAM), a novel post-hoc explanation method that is highly sensitive to the target class. Our method essentially scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. Alternative to classifier outputs, CDAM can also explain a user-defined concept by targeting similarity measures in the latent space of the ViT. This allows for explanations of arbitrary concepts, defined by the user through a few sample images. We investigate the operating characteristics of CDAM in comparison with relevance propagation (RP) and token ablation maps (TAM), an alternative to pixel occlusion methods. CDAM is highly class-discriminative and semantically relevant, while providing implicit regularization of relevance scores. PyTorch implementation: \url{https://github.com/lenbrocki/CDAM} Web live demo: \url{https://cdam.informatism.com/}
摘要
易于理解方法是深度神经网络(DNN)的关键组件,帮助我们更深入理解和信任它们。视觉转换器(ViT)可以通过自我超vised学习(SSL)训练方法来达到状态码表现。而这些转换器提供内置的注意力地图(AM),它们可以为输入图像提供高质量的semantic segmentation。但是,AM不会考虑任何来自下游分类器的信号。我们介绍了一种新的后置解释方法——类别推理注意力地图(CDAM)。这种方法可以高度敏感地评估输入图像中的特定类别,并且可以通过评估 tokens 的相关性来权重调整注意力分数。此外,CDAM 还可以用于解释用户定义的概念,只需要通过一些示例图像来定义。我们对 CDAM 的运行特性进行了比较分析,并与 relevance propagation(RP)和 token ablation maps(TAM)进行了比较。 results 表明,CDAM 具有高度的类别推理和semantic relevance,同时提供了隐式的regularization of relevance scores。PyTorch 实现:\url{https://github.com/lenbrocki/CDAM}网络实时 demo:\url{https://cdam.informatism.com/}
results: 研究发现,通过 displaying peer visual attention regions,可以提高学生的集中度和参与度,但学生仍然保持了适应性,能够根据自己的需要选择是否遵循同学的注意力。总之,帮助学生注意力帮助学生提高学习经验和成绩。这些发现可以帮助设计适应性的在线学习优化方案,以提高学生的注意力和成功。Abstract
Human visual attention is susceptible to social influences. In education, peer effects impact student learning, but their precise role in modulating attention remains unclear. Our experiment (N=311) demonstrates that displaying peer visual attention regions when students watch online course videos enhances their focus and engagement. However, students retain adaptability in following peer attention cues. Overall, guided peer attention improves learning experiences and outcomes. These findings elucidate how peer visual attention shapes students' gaze patterns, deepening understanding of peer influence on learning. They also offer insights into designing adaptive online learning interventions leveraging peer attention modelling to optimize student attentiveness and success.
摘要
人类视觉注意力受社会影响。在教育中,同学影响学生学习,但其具体的作用 Modulating attention 仍然不清楚。我们的实验(N=311)表明,当学生观看在线课程视频时,显示同学视觉注意区域可以提高 их集中力和参与度。然而,学生保留了适应同学注意引导的能力。总之,帮助同学注意力改善学习经验和成绩。这些发现深入了解同学视觉注意shape学生的观察方式,也为设计适应同学注意模拟来优化学生集中度和成功提供了意见。
When is Offline Policy Selection Sample Efficient for Reinforcement Learning?
results: 本文首先证明了OPS问题在最坏情况下与OPE问题等价,从而表明不可能有更高效的OPS方法。然后,提出了一种BE方法 дляOPS,称为可识别BE选择(IBES),该方法可以自动选择自己的超参数。最后,通过对OPE和IBES的实验研究和无线Atari benchmark数据集上OPS的困难性的证明。Abstract
Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.
摘要
<> transtable text into Simplified Chinese.Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.翻译结果:Offline 学习算法常需要仔细调整超参数。因此,在部署之前,我们需要选择一组候选策略。然而,目前对这个Offline Policy Selection(OPS)问题的基本限制还不够了解。在这项工作中,我们想要提供关于sample efficient OPS是可能的时间的 clarity,主要通过OPS与Off-policy Policy Evaluation(OPE)和Bellman Error(BE)估计的连接来实现。我们首先显示了一个困难性结论,即在最坏情况下,OPS与OPE相当困难,通过证明OPE到OPS的减少来证明。这意味着,无论OPS方法可能都不能在最坏情况下比OPE更sample efficient。然后,我们提出了一种BE方法 дляOPS,称为Identifiable BE Selection(IBES),它具有选择自己超参数的直观方法。我们指出,使用IBES进行OPS通常需要更多的要求,但如果满足这些要求,则可以更sample efficient。我们结束于对OPE和IBES进行比较性研究,并通过示出OPS在Offline Atari benchmark数据集上的困难性来结束。
paper_authors: Oliver Limoyo, Abhisek Konar, Trevor Ablett, Jonathan Kelly, Francois R. Hogan, Gregory Dudek
for: autonomously collecting demonstrations for a family of placing tasks
methods: using a combination of tactile sensing and compliant control for grasps, and training a policy directly from visual observations through behavior cloning
results: outperforming policies trained with kinesthetic teaching in terms of performance and data efficiency, while requiring no human supervision.Abstract
We present Learning to Place by Picking (LPP), a method capable of autonomously collecting demonstrations for a family of placing tasks in which objects must be manipulated to specific locations. With LPP, we approach the learning of robotic object placement policies by reversing the grasping process and exploiting the inherent symmetry of the pick and place problems. Specifically, we obtain placing demonstrations from a set of grasp sequences of objects that are initially located at their target placement locations. Our system is capable of collecting hundreds of demonstrations without human intervention by using a combination of tactile sensing and compliant control for grasps. We train a policy directly from visual observations through behaviour cloning, using the autonomously-collected demonstrations. By doing so, the policy can generalize to object placement scenarios outside of the training environment without privileged information (e.g., placing a plate picked up from a table and not at the original placement location). We validate our approach on home robotic scenarios that include dishwasher loading and table setting. Our approach yields robotic placing policies that outperform policies trained with kinesthetic teaching, both in terms of performance and data efficiency, while requiring no human supervision.
摘要
我们介绍了一种名为学习放置(LPP)的方法,可以自主收集放置任务中对象的示范。LPP方法利用机器人对象放置策略的学习,通过对抓取过程的反转和抓取问题的自然对称性来进行学习。特别是,我们从一组对象初始位置为目标放置位置的抓取序列中获得放置示范。我们的系统可以自动收集百多个示范,不需人工干预,通过感知和弹性控制来实现抓取。我们通过视觉观察行为启发来直接训练策略,使其可以在培aument environments中外部泛化,无需特权信息(例如,将桌上搅狗抓起来并不是原始放置位置)。我们验证了我们的方法在家用机器人场景中,包括洗碗和设备。我们的方法可以在表现和数据使用效率方面超越由骨骼教学训练的策略,而且不需人工指导。
Expressive Sign Equivariant Networks for Spectral Geometric Learning
results: 控制synthetic experiment 表明,我们的网络可以实现 theoretically 预测的 beneficial 效果。Abstract
Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models. Code is available at https://github.com/cptq/Sign-Equivariant-Nets.
摘要
近期研究表明,开发尊重特征向量结构和对称的机器学习模型具有Utility。这些研究推动签名对称性,因为任何特征向量v,其负数(-v)也是特征向量。然而,我们显示出签名对称性在建立正交对称模型和学习图中节点位征编码以供链接预测中具有理论上的限制。在这种情况下,我们证明签名对称性带来了优点。为了获得这些优点,我们开发了新的签名对称神经网络架构。我们的模型基于新的签名对称多项式的分析性特征,因此具有可证明表达能力性质。控制的synthetic实验显示,我们的网络可以实现理论上预测的签名对称模型的优点。代码可以在https://github.com/cptq/Sign-Equivariant-Nets中获取。
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics
results: 通过使用 Winoground-T2I 数据集和双目标函数,评估了 T2I 模型和评估指标的性能。发现当前的 T2I 模型在处理复杂的 compositional 类别时存在一些弱点,并提供了一些有价值的探索和改进的方向。Abstract
Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .
摘要
An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph
paper_authors: Steve Fonin Mbouadeu, Martin Lorenzo, Ken Barker, Oktie Hassanzadeh
for: 这篇论文的目的是创建一个基准数据集,用于将新闻标题映射到事件类别中。
methods: 这篇论文使用的方法包括修改了经典实体链接方法,以及将问题视为零例文本分类问题。
results: 论文的评估结果显示,使用修改后的实体链接方法可以达到较高的溢亮率,而使用零例文本分类方法则需要更多的训练数据。Abstract
Mapping ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the mapping. We use the dataset to study two classes of unsupervised methods for this task: 1) adaptations of classic entity linking methods, and 2) methods that treat the problem as a zero-shot text classification problem. For the first approach, we evaluate off-the-shelf entity linking systems. For the second approach, we explore a) pre-trained natural language inference (NLI) models, and b) pre-trained large generative language models. We present the results of our evaluation, lessons learned, and directions for future work. The dataset and scripts for evaluation are made publicly available.
摘要
<>将新闻头条映射到事件相关类别在丰富的知识库中可以是一项重要的组成部分。在这篇论文中,我们提出了一种方法ологи? для创建一个基准数据集,其中新闻头条映射到事件类别在Wikidata中。此外,我们还提供了评估这些方法的资源。我们使用这个数据集来研究两类无监督方法:1)基于类传统实体链接方法的修改,和2)将问题视为零例文本分类问题。对于第一种方法,我们评估了商业化的实体链接系统。对于第二种方法,我们探索了a)预先训练的自然语言推理(NLI)模型,和b)预先训练的大型生成语言模型。我们发布了评估结果,学习的经验,以及未来工作的方向。数据集和评估脚本都公开提供。
paper_authors: Daewon Chae, Nokyung Park, Jinkyu Kim, Kimin Lee
for: 这 paper 的目的是提高个性化文本到图像模型的图像-文本对应能力,使其能够更好地满足用户的需求。
methods: 这 paper 使用了一种新的方法called InstructBooth,它首先使用一小quantity of subject-specific images来个性化文本到图像模型,然后通过反射学习来提高图像-文本对应能力。
results: 对比baseline方法,InstructBooth 显示出了更高的图像-文本对应能力,同时保持了个性化能力。在人类评估中,InstructBooth 也在所有因素上超过 DreamBooth。Abstract
Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often encounter challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to baselines while maintaining personalization ability. In human evaluations, InstructBooth outperforms DreamBooth when considering all comprehensive factors.
摘要
personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often encounter challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to baselines while maintaining personalization ability. In human evaluations, InstructBooth outperforms DreamBooth when considering all comprehensive factors.Here's the translation in Traditional Chinese:personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often encounter challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to baselines while maintaining personalization ability. In human evaluations, InstructBooth outperforms DreamBooth when considering all comprehensive factors.
GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs
paper_authors: Ruijie Wang, Luca Rossetto, Michael Cochez, Abraham Bernstein
For: This paper aims to provide explanation-based multi-hop question answering over knowledge graphs, which is useful in real-world scenarios where users need to understand the reasoning process behind the answers.* Methods: The proposed Graph Neural Network-based Two-Step Reasoning model (GNN2R) uses weak supervision from question-final answer pairs to generate both final answers and reasoning subgraphs efficiently.* Results: The results show that GNN2R outperforms existing state-of-the-art methods in terms of effectiveness, efficiency, and quality of generated explanations.Here is the information in Simplified Chinese text:* For: 这篇论文的目的是提供基于知识图的多步问答,并提供解释,这有用于实际场景中,用户需要理解问答的逻辑过程。* Methods: 提议的图 Néural Network-based Two-Step Reasoning model (GNN2R) 使用问题-答案对的弱监督来生成效率高的解释。* Results: 结果表明,GNN2R 在有效性、效率和解释质量方面超越了现有的状态作法。Abstract
Most current methods for multi-hop question answering (QA) over knowledge graphs (KGs) only provide final conclusive answers without explanations, such as a set of KG entities that is difficult for normal users to review and comprehend. This issue severely limits the application of KG-based QA in real-world scenarios. However, it is non-trivial to solve due to two challenges: First, annotations of reasoning chains of multi-hop questions, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. In this paper, we propose a novel Graph Neural Network-based Two-Step Reasoning model (GNN2R) to solve this issue. GNN2R can provide both final answers and reasoning subgraphs as a rationale behind final answers efficiently with only weak supervision that is available through question-final answer pairs. We extensively evaluated GNN2R with detailed analyses in experiments. The results demonstrate that, in terms of effectiveness, efficiency, and quality of generated explanations, GNN2R outperforms existing state-of-the-art methods that are applicable to this task. Our code and pre-trained models are available at https://github.com/ruijie-wang-uzh/GNN2R.
摘要
现有的多步问答(QA)方法 sobre 知识 graphs(KGs)只提供最终的答案而无法提供解释,如一组Difficult to review and comprehend的 KG entities. 这个问题限制了 KG-based QA 在实际应用中的使用。 However, it is non-trivial to solve due to two challenges: First, multi-hop question reasoning chain annotations, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. 在本文中,我们提出了一种基于图神经网络的Two-Step Reasoning模型(GNN2R)来解决这个问题。 GNN2R可以提供最终的答案和reasoning subgraphs作为答案的证明,高效地使用Only weak supervision available through question-final answer pairs. 我们进行了详细的实验分析,结果表明,在效果、效率和解释质量方面,GNN2R在这种任务上比现有的State-of-the-art方法更高效。 我们的代码和预训练模型可以在 上获取。
Fine-tuning pre-trained extractive QA models for clinical document parsing
paper_authors: Ashwyn Sharma, David I. Feldman, Aneesh Jain for:这篇论文的目的是为了开发一个可以自动分析电子健康纪录(EHRs)中的声明数据,并将其转换为适合实时和回顾分析的数据,以便为患有心脏病(HF)的病人提供更好的跟踪和管理。methods:这篇论文使用了一个 pré-trained 的抽出式Question Answering(QA)transformer 模型,并将其在特定的标签数据上进行了微调。此外,还使用了一些 Running experiments 来评估模型的性能。results:根据结果,这个系统可以将声明数据自动化,并实现了适合实时和回顾分析的数据。此外,这个系统还可以节省了过 1500 小时的时间,并在 12 个月内实现了大规模的自动化。Abstract
Electronic health records (EHRs) contain a vast amount of high-dimensional multi-modal data that can accurately represent a patient's medical history. Unfortunately, most of this data is either unstructured or semi-structured, rendering it unsuitable for real-time and retrospective analyses. A remote patient monitoring (RPM) program for Heart Failure (HF) patients needs to have access to clinical markers like EF (Ejection Fraction) or LVEF (Left Ventricular Ejection Fraction) in order to ascertain eligibility and appropriateness for the program. This paper explains a system that can parse echocardiogram reports and verify EF values. This system helps identify eligible HF patients who can be enrolled in such a program. At the heart of this system is a pre-trained extractive QA transformer model that is fine-tuned on custom-labeled data. The methods used to prepare such a model for deployment are illustrated by running experiments on a public clinical dataset like MIMIC-IV-Note. The pipeline can be used to generalize solutions to similar problems in a low-resource setting. We found that the system saved over 1500 hours for our clinicians over 12 months by automating the task at scale.
摘要
电子健康记录(EHRs)包含大量高维多Modal数据,可以准确反映患者的医疗历史。然而,大多数这些数据都是无结构或半结构化,使其不适合实时和回顾分析。一个远程患者监测(RPM)Program for Heart Failure(HF)患者需要访问临床标志如EF(脱出率)或LVEF(左心室脱出率),以确定参与计划的适应性。本文介绍一种系统,可以解析echo声报告并验证EF值。这种系统可以识别适合参与RPM计划的HF患者。系统的核心是一个已经预训练的提取式QA变换模型,通过自定义标注数据进行微调。我们使用MIMIC-IV-Note公共医疗数据集进行实验,演示了如何准备这种模型 для部署。这种管道可以普遍化到类似问题的低资源环境中。我们发现,这种系统在12个月内将 clinicians 的工作时间减少了超过1500小时。
Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games
results: 对 Minecraft、Minecraft Dungeons 和 Counter-Strike: Global Offensive 等现代游戏进行比较,研究使用公开available的视觉编码器是否可以具备适用于sequential decision making的表示学习。Abstract
Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive.
摘要
видео游戏已经作为决策ommunity的标准 benchmark,但是向现代游戏中训练代理人的成本是大多数研究community的禁制的。现在,随着大视力模型的研究、开发和公共发布,这些成本可能会在社区中卷积。然而,目前还不清楚哪些模型学习了保留Sequential Decision Making中重要信息的表示。为推广现代游戏中Gameplaying Agent的研究,我们展示了使用公共可用的视觉编码器进行仿制学习的系统性研究,与常见的终端到终点的端到端培训方法在 Minecraft、 Minecraft Dungeons 和Counter-Strike: Global Offensive 中进行比较。
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
results: 我们的实验结果表明,VaQuitA在零代码视频问答任务中成功设置新的 bencmark,并且能够生成高质量、多turn的视频对话。Abstract
Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
摘要
At the data level, instead of uniformly sampling frames, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features.Furthermore, we discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
Training Reinforcement Learning Agents and Humans With Difficulty-Conditioned Generators
paper_authors: Sidney Tio, Jimmy Ho, Pradeep Varakantham
for: 这 paper 是用于训练人工智能代理人和人类学习者在参数化环境中的方法。
methods: 这 paper 使用 Parameterized Environment Response Model (PERM),基于 Item Response Theory (IRT) 来对环境 difficulty 和个体能力进行直接模型化,创造一个 Zone of Proximal Development-based 课程。
results: 这 paper 在一个实验研究中表明,使用 PERM 可以在不同的学生中实现高效的训练,并且不需要实时更新 reinforcement learning 算法。Abstract
We adapt Parameterized Environment Response Model (PERM), a method for training both Reinforcement Learning (RL) Agents and human learners in parameterized environments by directly modeling difficulty and ability. Inspired by Item Response Theory (IRT), PERM aligns environment difficulty with individual ability, creating a Zone of Proximal Development-based curriculum. Remarkably, PERM operates without real-time RL updates and allows for offline training, ensuring its adaptability across diverse students. We present a two-stage training process that capitalizes on PERM's adaptability, and demonstrate its effectiveness in training RL agents and humans in an empirical study.
摘要
我们适应Parameterized Environment Response Model(PERM),一种专门训练人工智能学习者和人类学习者在参数化环境中的方法。这种方法受到Item Response Theory(IRT)的启发,将环境困难度与个人能力相对应,创造一个 Zone of Proximal Development-based 课程。特别是,PERM 不需要实时循环学习更新,可以在网上训练,因此适用于多种学生。我们提出了一个两阶段训练过程,利用 PERM 的适应性,并在实验研究中证明其效果。
AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design
results: 研究发现,使用Random Edge Traversal和Objective Sub-Sampling等方法可以强化目标绑定能量,并在多目标情况下同时提高多个目标附着能量。这些结果表明深度学习可以有效应用于潜在催化剂设计问题。Abstract
A central challenge of the clean energy transition is the development of catalysts for low-emissions technologies. Recent advances in Machine Learning for quantum chemistry drastically accelerate the computation of catalytic activity descriptors such as adsorption energies. Here we introduce AdsorbRL, a Deep Reinforcement Learning agent aiming to identify potential catalysts given a multi-objective binding energy target, trained using offline learning on the Open Catalyst 2020 and Materials Project data sets. We experiment with Deep Q-Network agents to traverse the space of all ~160,000 possible unary, binary and ternary compounds of 55 chemical elements, with very sparse rewards based on adsorption energy known for only between 2,000 and 3,000 catalysts per adsorbate. To constrain the actions space, we introduce Random Edge Traversal and train a single-objective DQN agent on the known states subgraph, which we find strengthens target binding energy by an average of 4.1 eV. We extend this approach to multi-objective, goal-conditioned learning, and train a DQN agent to identify materials with the highest (respectively lowest) adsorption energies for multiple simultaneous target adsorbates. We experiment with Objective Sub-Sampling, a novel training scheme aimed at encouraging exploration in the multi-objective setup, and demonstrate simultaneous adsorption energy improvement across all target adsorbates, by an average of 0.8 eV. Overall, our results suggest strong potential for Deep Reinforcement Learning applied to the inverse catalysts design problem.
摘要
中心挑战清能过渡的问题是开发低排放技术的氧化剂。最近的机器学习在量子化学中的进步可以极大地加速计算催化活性指标,如吸附能量。我们介绍了AdsorbRL,一个基于深度强化学习的智能代理,可以根据多目标绑定能量目标,从Open Catalyst 2020和Materials Project数据集中训练。我们使用深度Q学习网络(DQN)来探索所有可能的单元、二元和三元化合物的空间,并且使用非常罕见的奖励来驱动吸附能量的优化。为了约束行动空间,我们引入随机边游走和单个目标DQN agent的训练,并发现这会提高目标绑定能量的均值4.1 eV。我们还扩展了这种方法,使其适用于多目标、目标吸附能量均衡的学习,并训练一个DQN agent来标识多个同时目标吸附物的最佳材料。我们实验了一种新的训练方案,即目标子批量采样,以鼓励探索在多目标设置下。最终,我们的结果表明深度强化学习在反催化剂设计问题中具有强大的潜力。
LineConGraphs: Line Conversation Graphs for Effective Emotion Recognition using Graph Neural Networks
results: 研究人员通过对两个 benchmark 数据集(IEMOCAP 和 MELD)进行测试,发现他们提出的 LineConGAT 模型在情感识别方面的性能比现有方法高,F1 分数为 64.58% 和 76.50%。此外,他们还发现将情感转移信息 embed 到对话情感图中可以进一步提高 ERC 性能。Abstract
Emotion Recognition in Conversations (ERC) is a critical aspect of affective computing, and it has many practical applications in healthcare, education, chatbots, and social media platforms. Earlier approaches for ERC analysis involved modeling both speaker and long-term contextual information using graph neural network architectures. However, it is ideal to deploy speaker-independent models for real-world applications. Additionally, long context windows can potentially create confusion in recognizing the emotion of an utterance in a conversation. To overcome these limitations, we propose novel line conversation graph convolutional network (LineConGCN) and graph attention (LineConGAT) models for ERC analysis. These models are speaker-independent and built using a graph construction strategy for conversations -- line conversation graphs (LineConGraphs). The conversational context in LineConGraphs is short-term -- limited to one previous and future utterance, and speaker information is not part of the graph. We evaluate the performance of our proposed models on two benchmark datasets, IEMOCAP and MELD, and show that our LineConGAT model outperforms the state-of-the-art methods with an F1-score of 64.58% and 76.50%. Moreover, we demonstrate that embedding sentiment shift information into line conversation graphs further enhances the ERC performance in the case of GCN models.
摘要
《情感识别在对话中》(ERC)是情感计算的关键方面,它在医疗、教育、虚拟助手和社交媒体平台上有广泛的实际应用。早期的ERC分析方法使用图神经网络架构模型 speaker和长期上下文信息。然而,在实际应用中,投入 speaker-独立模型是理想的。此外,长期上下文窗口可能会在识别对话中的情感 confusion。为了解决这些限制,我们提出了novel line conversation graph convolutional network(LineConGCN)和 graph attention(LineConGAT)模型 для ERC分析。这些模型 speaker-独立,使用对话Graph constructions strategy,并且在对话中的上下文窗口是短期的,只包括一个前一个和后一个的话语。我们评估了我们提出的模型在两个标准数据集上的表现,IEMOCAP和MELD,并显示了我们的 LineConGAT 模型在现有方法中的 F1 分数为 64.58% 和 76.50%。此外,我们还证明了将情感变化信息embedding into line conversation graphs可以进一步提高 ERC 性能在GCN模型中。
LLMs Accelerate Annotation for Medical Information Extraction
results: 对医疗信息EXTRACTION任务进行了严格评估,显示该方法不仅可以大幅减少人工干预,还可以保持高度准确Abstract
The unstructured nature of clinical notes within electronic health records often conceals vital patient-related information, making it challenging to access or interpret. To uncover this hidden information, specialized Natural Language Processing (NLP) models are required. However, training these models necessitates large amounts of labeled data, a process that is both time-consuming and costly when relying solely on human experts for annotation. In this paper, we propose an approach that combines Large Language Models (LLMs) with human expertise to create an efficient method for generating ground truth labels for medical text annotation. By utilizing LLMs in conjunction with human annotators, we significantly reduce the human annotation burden, enabling the rapid creation of labeled datasets. We rigorously evaluate our method on a medical information extraction task, demonstrating that our approach not only substantially cuts down on human intervention but also maintains high accuracy. The results highlight the potential of using LLMs to improve the utilization of unstructured clinical data, allowing for the swift deployment of tailored NLP solutions in healthcare.
摘要
电子健康记录中的临床笔记具有不结构化的特点,常常隐藏着重要的病人信息,从而使得访问或解释困难。为了抽取这些隐藏的信息,需要特殊的自然语言处理(NLP)模型。然而,训练这些模型需要大量的标注数据,这是一项耗时和成本巨大的过程,当 solely 依靠人类专家进行标注时。在这篇论文中,我们提出一种方法,该方法结合大型自然语言模型(LLM)和人类专家知识来生成医疗文本标注的基准数据。通过在人类标注人员和LLM之间进行协同工作,我们可以减少人类标注劳动,并使得医疗NLP解决方案的速速投入。我们严格地评估了我们的方法在医疗信息抽取任务中的性能,结果表明,我们的方法不仅可以减少人类干预,同时也可以保持高度准确。这些结果表明,使用LLM可以改善医疗数据的利用效率,并允许快速部署适应性强的NLP解决方案。
paper_authors: Shiqian Li, Kewen Wu, Chi Zhang, Yixin Zhu
for: 评估智能代理人的物理逻辑能力,尤其是在动态事件中。
methods: 引入I-PHYRE框架,让代理人同时展示适当的物理逻辑、多步观念与现场干预能力。
results: 发现现有的学习算法与人类表现之间存在明显的差距,强调对于增强代理人的物理逻辑能力进行更多研究。Abstract
Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.
摘要
当前的评估协议主要评估物理逻辑在静止场景中,创造了评估代理人能力与动态事件交互的空白。当代方法允许代理人修改初始场景配置和观察后果,但缺乏实时与事件交互的能力。为此,我们介绍I-PHYRE框架,挑战代理人同时展现出直观物理逻辑、多步规划和场景内干预能力。在这里,直观物理逻辑指的是快速、粗略地理解物理问题解决复杂问题的能力;多步表示需要考虑每一次干预后果,因为每一次干预都可能对后续选择产生重要影响;场景内干预则是指在场景中快速 manipulate物品,小差别的时间差可能导致任务失败。我们将场景分为四个游戏分割,探讨代理人学习和总结基本的物理逻辑原则的能力,通过与代表性场景的互动学习。我们的探讨包括三种规划策略,并对多种监督和奖励代理人进行零容量普适性评估。结果显示现有的学习算法与人类性能之间存在显著差距,强调需要更多的研究,以增强代理人的互动物理逻辑能力。环境和基准将公开发布。
PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness
results: 在三个大规模自动驾驶数据集上超过所有基线,并且在voxel-wise和instance-wise uncertainty estimation中提供更好的性能。Abstract
We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the non-empty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for robotics applications, we instead propose an efficient ensembling to estimate both voxel-wise and instance-wise uncertainties along PSC. This is achieved by building on a multi-input multi-output (MIMO) strategy, while improving performance and yielding better uncertainty for little additional compute. Additionally, we introduce a technique to aggregate permutation-invariant mask predictions. Our experiments demonstrate that our method surpasses all baselines in both Panoptic Scene Completion and uncertainty estimation on three large-scale autonomous driving datasets. Our code and data are available at https://astra-vision.github.io/PaSCo .
摘要
我们提出了具有更高级别理解的三维场景完成任务(PSC),该任务扩展了最近受欢迎的 semantic scene completion(SSC)任务,并添加了实例级信息以生成更加 ricther的场景理解。我们的 PSC 提议使用了一种混合mask-based技术,在非空多Scale的完成中进行逐 voxel 的推断。而 SSC 文献中忽略了uncertainty,这对 роботиCS应用非常重要,我们则提议一种高效的ensemble estimate both voxel-wise和实例-wise的uncertainty,通过一种多输入多出力(MIMO)策略,从而提高性能并且更好地估计uncertainty,只需要少量的额外计算。此外,我们还介绍了一种对 permutation-invariant mask prediction进行集成的技术。我们的实验表明,我们的方法在三个大规模自主驾驶数据集上都超越了所有基准值, both PSC 和 uncertainty estimation。我们的代码和数据可以在 https://astra-vision.github.io/PaSCo 上获取。
Latent Feature-Guided Diffusion Models for Shadow Removal
paper_authors: Kangfu Mei, Luis Figueroa, Zhe Lin, Zhihong Ding, Scott Cohen, Vishal M. Patel
for: 提高阴影下的图像纹理还原的精度
methods: 使用扩散模型,通过 conditioning on 学习的秘密特征空间,避免使用受限的传统方法
results: 与前一代方法相比,提高了13%的RMSE在AISTD数据集上,并在DESOBA数据集上提高了82%的RMSEHere’s a breakdown of each point:
for: The paper is written to improve the accuracy of texture restoration in shadowy images.
methods: The method used in the paper is based on diffusion models, which are conditioned on a learned latent feature space that inherits the characteristics of shadow-free images. This approach avoids the limitations of conventional methods that only condition on degraded images. Additionally, the paper proposes fusing noise features with the diffusion network to alleviate potential local optima during training.
results: The paper demonstrates the effectiveness of the proposed approach, which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset and by 82% in terms of RMSE on the DESOBA dataset.Abstract
Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that inherits the characteristics of shadow-free images, thus avoiding the limitation of conventional methods that condition on degraded images only. Additionally, we propose to alleviate potential local optima during training by fusing noise features with the diffusion network. We demonstrate the effectiveness of our approach which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset. Further, we explore instance-level shadow removal, where our model outperforms the previous best method by 82% in terms of RMSE on the DESOBA dataset.
摘要
“恢复阴影下的 texture 是一个长期困难的问题,因为从阴影图像中推断出 shadow-free 场景很困难。在这篇论文中,我们提议使用传播模型,因为它们提供了一个有前途的方法来慢慢地细化阴影区域中的细节。我们的方法可以透过 conditioning 在学习的潜在特征空间中,从而避免传统方法对受损图像的限制。此外,我们还提议使用杂音特征与扩散网络的融合,以避免训练过程中的地方最佳化问题。我们的方法在 AISTD 资料集上比前一个最好的方法高13%的 RMSE 表现出色,并且在 DESOBA 资料集上比前一个最好的方法高82%的 RMSE 表现出色。”
Guarding Barlow Twins Against Overfitting with Mixed Samples
paper_authors: Wele Gedara Chaminda Bandara, Celso M. De Melo, Vishal M. Patel
For: The paper focuses on improving the performance of self-supervised learning (SSL) using the Barlow Twins algorithm, and explores a new method called Mixed Barlow Twins to address the challenge of feature overfitting.* Methods: The paper uses the Barlow Twins algorithm to pre-train a network on a large dataset, and introduces a new regularization term to the original objective function to improve sample interaction during training.* Results: The paper reports improved performance on several benchmark datasets (CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet) when using the Mixed Barlow Twins method compared to the original Barlow Twins algorithm, indicating that the proposed method can mitigate feature overfitting and enhance downstream performance.Here is the simplified Chinese translation of the three key points:* For: 这篇论文关注提高自然语言处理(SSL)的性能,使用Barlow Twins算法进行预训练,并提出一种新的方法called Mixed Barlow Twins来解决特征泛化问题。* Methods: 论文使用Barlow Twins算法在大量数据上进行预训练,并提出一种新的正则化项来提高预训练过程中的样本交互。* Results: 论文在多个 benchmark 数据集(CIFAR-10、CIFAR-100、TinyImageNet、STL-10、ImageNet)上发现,使用Mixed Barlow Twins方法比原始 Barlow Twins 算法更高的表现,表明该方法可以避免特征泛化并提高下游性能。Abstract
Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing for the above objective forces the network to learn useful representations, while avoiding noisy or constant features, resulting in improved downstream task performance with limited adaptation. Despite Barlow Twins' proven effectiveness in pre-training, the underlying SSL objective can inadvertently cause feature overfitting due to the lack of strong interaction between the samples unlike the contrastive learning approaches. From our experiments, we observe that optimizing for the Barlow Twins objective doesn't necessarily guarantee sustained improvements in representation quality beyond a certain pre-training phase, and can potentially degrade downstream performance on some datasets. To address this challenge, we introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples. This results in an additional regularization term to the original Barlow Twins objective, assuming linear interpolation in the input space translates to linearly interpolated features in the feature space. Pre-training with this regularization effectively mitigates feature overfitting and further enhances the downstream performance on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet datasets. The code and checkpoints are available at: https://github.com/wgcban/mix-bt.git
摘要
然而,SSL 目标可能会导致特征过拟合,因为样本之间的强相互作用不够,与对比学习方法不同。我们的实验表明,仅仅优化 Barlow Twins 目标函数不能保证持续改善表示质量,可能会在某些数据集上降低下游性能。为解决这个挑战,我们引入混合巴罗姐妹(Mixed Barlow Twins),通过在 Barlow Twins 训练中 linearly interpolated samples 来提高样本之间的交互。这会增加一个额外的正则化项到原始 Barlow Twins 目标函数中,假设输入空间中的线性插值翻译为特征空间中的线性插值。在这种情况下,我们发现预训练时使用这种正则化可以有效避免特征过拟合,并进一步提高 CIFAR-10、CIFAR-100、TinyImageNet、STL-10 和 ImageNet 等数据集上的下游性能。我们的代码和检查点可以在 GitHub 上找到:https://github.com/wgcban/mix-bt.git。
paper_authors: Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski
results: 比较traditional super-resolution和outpainting方法,显示我们的方法最有效地生成多scale内容。Abstract
We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.
摘要
我们提出了一种使用文本到图像模型来生成多个尺度的一致内容,以实现极端semantic zoom,例如从一个宽角景观 forest 到一个树枝上坐落的一只昆虫的macroshot。我们通过联合多尺度扩散采样方法来保证不同尺度之间的一致性,同时保持每个采样过程的完整性。由于每个生成的尺度受到不同的文本提示指导,我们的方法可以更深入地进行 zoom than traditional super-resolution methods,这些方法可能会在不同尺度上创建新的上下文结构。我们与其他图像超解析和填充技术进行比较,并显示了我们的方法在生成多尺度内容上最为有效。
Competition-Level Problems are Effective LLM Evaluators
results: GPT-4 模型在 Codeforces 问题上经历了一个 “峰值衰落” 现象,即在2021 年九月以来的问题上表现下降,这表明数据污染的可能性,以及现有 LLM 解决复杂逻辑问题的挑战。Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, the peiceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification, unfortunately none of them is able to consistently mitigate the challenges. Through our work, we emphasis the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.
摘要
大型语言模型(LLM)已经表现出了吸引人的理解能力,然而还有一些争议和数据污染问题。这篇论文的目的是评估 LLM 的理解能力,具体来说是解决 Codeforces 上的最新竞赛程度问题,这些问题需要深刻的理解和坚强的理解能力。我们首先对 GPT-4 的Zero-shot表现进行了全面的评估,考虑了问题发布时间、难度和错误类型等方面。结果显示,GPT-4 在9月2021年以后的问题上表现了一个“峰值下降”的趋势,这表明可能存在数据污染问题,以及解决无法见到的复杂逻辑问题的挑战。我们进一步 explore了多种方法,如精度调整、链式思维提示和简化问题描述,但 None of them 能够一直稳定地 Mitigate 这些挑战。通过我们的工作,我们强调了 Codeforces 上的问题作为评估 LLM 的真正理解能力的出色数据源,并促进未来 LLM 的发展。
EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Motion Generation
methods: 该 paper 使用了 Conditional Denoising Diffusion GAN 来模型多modal 数据分布,以便在 fewer sample steps 中生成高质量的人体动作。
results: 该 paper 实现了在生成过程中快速生成高质量的人体动作,并且在训练过程中使用了 motion geometric loss 来提高动作质量和训练效率。Abstract
We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Although previous motion diffusion models have shown impressive results, they struggle to achieve fast generation while maintaining high-quality human motions. Motion latent diffusion has been proposed for efficient motion generation. However, effectively learning a latent space can be non-trivial in such a two-stage manner. Meanwhile, accelerating motion sampling by increasing the step size, e.g., DDIM, typically leads to a decline in motion quality due to the inapproximation of complex data distributions when naively increasing the step size. In this paper, we propose EMDM that allows for much fewer sample steps for fast motion generation by modeling the complex denoising distribution during multiple sampling steps. Specifically, we develop a Conditional Denoising Diffusion GAN to capture multimodal data distributions conditioned on both control signals, i.e., textual description and denoising time step. By modeling the complex data distribution, a larger sampling step size and fewer steps are achieved during motion synthesis, significantly accelerating the generation process. To effectively capture the human dynamics and reduce undesired artifacts, we employ motion geometric loss during network training, which improves the motion quality and training efficiency. As a result, EMDM achieves a remarkable speed-up at the generation stage while maintaining high-quality motion generation in terms of fidelity and diversity.
摘要
我们介绍高效运动扩散模型(EMDM),用于快速高品质人体动作生成。过往的动作扩散模型已经展示了杰出的结果,但是它们在维持高品质人体动作的同时又很难实现快速生成。动作扩散对于快速生成进行了建议。然而,实际上学习隐藏空间可以是非常困难的,特别是在这种两阶段的方式下。此外,通过增加步长,例如DDIM,可以快速增加动作样本,但是这通常会导致动作质量下降,因为简化复杂的数据分布时的步骤大小增加。在这篇文章中,我们提出EMDM,它允许在快速生成过程中使用许多步骤的少数步骤,以便快速增加动作样本。具体而言,我们开发了受控扩散GAN,以捕捉受控的数据分布,包括控制信号和去噪时间步。通过模型复杂的数据分布,我们可以使用更多的步骤大小和更少的步骤来快速生成动作,很大的提高生成过程的速度。为了有效地捕捉人体动作和减少不愿的错误,我们在网络训练中使用动作几何损失,这有助于提高动作质量和训练效率。因此,EMDM可以实现快速生成过程中的很大速度增加,同时保持高品质动作生成。
DiffiT: Diffusion Vision Transformers for Image Generation
results: 该论文的实验结果表明,DiffiT surprisingly有效地生成高质量图像,并在多种类征conditional和无条件合成任务中实现了状态之最(SOTA)的成绩。在幂空间中,DiffiT achieve了一个新的SOTA FID分数1.73在ImageNet-256 dataset上。Abstract
Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT
摘要
“扩散模型具有强大的表达力和高品质样本,它们在不同领域中产生了许多新的应用和用途。在样本生成方面,这些模型靠扩散网络进行噪声除除掉噪音,但这些网络架构的研究仍然不够完善。在这篇论文中,我们研究了扩散模型在扩散基本学习中的效iveness,具体而言,我们提出了一个新的模型,称为扩散视觉网络(DiffiT)。这个模型包括一个混合层次架构,包括U型对应网络。我们引入了一个新的时间相依性自我注意模组,让注意层可以在不同阶段的噪音除除过程中进行有效的适应。我们还引入了秘密DiffiT,它是一个包含我们所提出的自我注意层的transformer模型,用于高分辨率图像生成。我们的结果显示,DiffiT surprisingly有效地将高质量图像生成出来,并在许多类别条件和无条件实验中实现了新的SOTA指标。在对应空间中,DiffiT实现了一个新的SOTA FID分数1.73,在ImageNet-256 dataset上。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the original text in English, and some cultural references or expressions may be adjusted or omitted to ensure clarity and accuracy in the target language.
Hot PATE: Private Aggregation of Distributions for Diverse Task
results: 在多种多样化任务中,hot PATE 可以保持隐私和各种响应的多样性,同时实现与基准“冷”PATE 相当或甚至超过的隐私利用性质。Abstract
The Private Aggregation of Teacher Ensembles (PATE) framework~\cite{PapernotAEGT:ICLR2017} is a versatile approach to privacy-preserving machine learning. In PATE, teacher models are trained on distinct portions of sensitive data, and their predictions are privately aggregated to label new training examples for a student model. Until now, PATE has primarily been explored with classification-like tasks, where each example possesses a ground-truth label, and knowledge is transferred to the student by labeling public examples. Generative AI models, however, excel in open ended \emph{diverse} tasks with multiple valid responses and scenarios that may not align with traditional labeled examples. Furthermore, the knowledge of models is often encapsulated in the response distribution itself and may be transferred from teachers to student in a more fluid way. We propose \emph{hot PATE}, tailored for the diverse setting. In hot PATE, each teacher model produces a response distribution and the aggregation method must preserve both privacy and diversity of responses. We demonstrate, analytically and empirically, that hot PATE achieves privacy-utility tradeoffs that are comparable to, and in diverse settings, significantly surpass, the baseline ``cold'' PATE.
摘要
PRIVATE AGGREGATION OF TEACHER ENSembles (PATE) 框架(PapernotAEGT: ICLR2017)是一种灵活的隐私保护机器学习方法。在 PATE 中,教师模型在不同的敏感数据上进行训练,并将其预测结果私有地聚合以标注新的学生模型训练示例。 Until now, PATE 主要被应用于分类类型任务,每个示例具有明确的标签,并将知识传递给学生模型。然而,生成 AI 模型在开放结构的任务中表现出色,其中每个示例可能有多个有效的回答和场景,这些场景可能与传统的标签示例不符。此外,模型的知识通常被封装在响应分布中,可以从教师模型传递给学生模型的更加流畅的方式。我们提议使用热 PATE,适用于多元的设定。在热 PATE 中,每个教师模型生成响应分布,并且聚合方法必须保持隐私和响应分布的多样性。我们通过分析和实验表明,热 PATE 可以与基准“冷” PATE 的隐私-用途质量做比较,并在多元的设定中显著超越基准。
SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
paper_authors: Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, Jonathon Luiten
for: dense simultaneous localization and mapping (SLAM) in embodied scene understanding
methods: using 3D Gaussians for high-quality reconstruction and real-time rendering of scenes with a single unposed monocular RGB-D camera
results: achieves up to 2X state-of-the-art performance in camera pose estimation, map construction, and novel-view synthesis, while allowing real-time rendering of a high-resolution dense 3D mapAbstract
Dense simultaneous localization and mapping (SLAM) is pivotal for embodied scene understanding. Recent work has shown that 3D Gaussians enable high-quality reconstruction and real-time rendering of scenes using multiple posed cameras. In this light, we show for the first time that representing a scene by 3D Gaussians can enable dense SLAM using a single unposed monocular RGB-D camera. Our method, SplaTAM, addresses the limitations of prior radiance field-based representations, including fast rendering and optimization, the ability to determine if areas have been previously mapped, and structured map expansion by adding more Gaussians. We employ an online tracking and mapping pipeline while tailoring it to specifically use an underlying Gaussian representation and silhouette-guided optimization via differentiable rendering. Extensive experiments show that SplaTAM achieves up to 2X state-of-the-art performance in camera pose estimation, map construction, and novel-view synthesis, demonstrating its superiority over existing approaches, while allowing real-time rendering of a high-resolution dense 3D map.
摘要
dense同时地址和地图(SLAM)是场景理解的关键。最近的工作表明,使用多个姿态摄像头捕捉的3D гаус矩阵可以实现高质量的重建和实时渲染场景。在这种情况下,我们首次表明,使用场景的3D гаус矩阵可以实现精密的SLAM,使用单个不 pose的RGB-D摄像头。我们的方法(SplaTAM)超越了先前基于辐射场的表示方法的限制,包括快速渲染和优化、判断地区是否已经被映射,以及结构化地图扩展。我们使用在线跟踪和地图管理管道,并特意使用下面的 Gaussian 表示法和渲染导航的微分学习。广泛的实验表明,SplaTAM可以达到2倍的状态艺术性能在摄像头姿态估计、地图构建和新视图合成方面,这demonstrates its superiority over existing approaches, while allowing real-time rendering of a high-resolution dense 3D map。
TPPoet: Transformer-Based Persian Poem Generation using Minimal Data and Advanced Decoding Techniques
results: 研究显示,使用提出的解码方法可以生成更加准确和意义的波斯古典诗歌,并且在自动和人工评价中都有显著的优势。Abstract
Recent advances in language models (LMs), have demonstrated significant efficacy in tasks related to the arts and humanities. While LMs have exhibited exceptional performance across a wide range of natural language processing tasks, there are notable challenges associated with their utilization on small datasets and their ability to replicate more creative human capacities. In this study, we aim to address these challenges by training a Persian classical poetry generation model using a transformer architecture on a specialized dataset with no pretraining. Additionally, we propose a novel decoding method to enhance coherence and meaningfulness in the generated poetry, effectively managing the tradeoff between diversity and quality. Furthermore, the results of our training approach and the proposed decoding method are evaluated through comprehensive set of automatic and human evaluations and showed its superior capability to generate coherent and meaningful poetry in compare to other decoding methods and an existing Persian large language model (LLM).
摘要
for: 这 paper 的主要目标是提高 Code Language Models (LLMs) 的质量和多样性,以便在代码生成、多语言编程和数据科学程序完成等领域取得更好的性能。
methods: 这 paper 使用了一种新的数据生成方法 called OSS-Instruct,它使用了大量的开源代码片断来生成高质量的指令数据,以减少 LLMS 中的遗传性偏见。此外, paper 还使用了 Evol-Instruct 等其他数据生成方法来进一步提高 Magicoder 的性能。
results: Magicoder 和 MagicoderS 在多种编程 bencmarks 上均表现出色,包括 Python 文本生成、多语言编程和数据科学程序完成等。尤其是 MagicoderS-CL-7B 基于 CodeLlama 甚至超越了知名的 ChatGPT 在 HumanEval+ 中的表现(66.5 vs. 65.9 in pass@1)。总的来说,OSS-Instruct 开启了一个新的低偏度高质量指令调整方向,使用了丰富的开源参考来生成更多、更真实、更可控的数据。Abstract
We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs by empowering them with a wealth of open-source references for the production of more diverse, realistic, and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks, including Python text-to-code generation, multilingual coding, and data-science program completion. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall, OSS-Instruct opens a new direction for low-bias and high-quality instruction tuning using abundant open-source references.
摘要
我们介绍Magicoder,一系列 completelly开源(代码、参数和数据)的大型语言模型(LLMs),用于代码,它在参数数量不超过7B的情况下,与顶尖代码模型之间减小了差距。 Magicoder模型在75K个合成指令数据上进行训练,使用OSS-Instruct,一种新的方法,通过使用开源代码片段来生成高质量的指令数据 для代码。我们的主要动机是解决 LLMS 生成的synthetic数据的内在偏见问题,通过激发LLMS 使用丰富的开源参考来生成更多、更真实和更控制的数据。 OSS-Instruct 的正交性和其他数据生成方法如 Evol-Instruct ,使得我们可以构建更加强大的 MagicoderS。 Magicoder 和 MagicoderS 在各种编程benchmark上表现出色,包括Python文本到代码生成、多语言编程和数据科学程序完成。尤其是 MagicoderS-CL-7B 基于 CodeLlama ,even surpasses the prominent ChatGPT on HumanEval+(66.5 vs. 65.9 in pass@1)。总之,OSS-Instruct 开启了一个新的方向,用于低偏见和高质量的指令调整,使用丰富的开源参考。
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
results: 实验表明,该方法可以在80%以上的提问中逃脱目标LLM,并且只需要少量的查询。这比前一个黑盒方法减少了许多。Abstract
While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an LLM to iteratively refine candidate (attack) prompts using tree-of-thoughts reasoning until one of the generated prompts jailbreaks the target. Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks. Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.
摘要
大型语言模型(LLM)显示了多样化的功能,但它们仍然生成了危险、偏见和恶势力内容,如人类设计的监狱break。在这项工作中,我们提出了Tree of Attacks with Pruning(TAP),一种自动生成监狱break的方法,只需要黑盒访问目标LLM。TAP利用目标LLM进行迭代的候选(攻击)提示使用树思维理解,直到一个生成的提示破坏了目标。关键是在发送提示之前,TAP会评估它们,并将不可能导致破坏的提示排除。使用树思维理解,TAP可以尝试大量的提示和排除不可能导致破坏的提示,从而减少发送到目标的总数量。在实验中,我们发现TAP可以在80%以上的提示中Generate监狱breakstate-of-the-art LLM(包括GPT4和GPT4-Turbo),只需要一小部分的查询。这大大超过了之前的黑盒方法的状态。
Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction
results: 研究发现,使用Random Forest回归模型可以达到0.94的拟合度(r^2),误差(ME)为0.03。这些模型使用了食品和农业组织联合国数据,以及世界银行气候变化数据目录进行训练和测试。此外,每个参数都进行了分析,以了解农业产量如何受到不同因素的影响。Abstract
The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K- Nearest Neighbors, and Random Forest) to predict crop yields in 196 countries. Given 4 key training parameters, pesticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r^2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data.
摘要
《预测国际农作物产量是农业研究中的一项非常重要目标。因此,这项研究采用了6种回归模型(线性、树型、梯度下降、梯度聚合、k-最近邻居和Random Forest)来预测196个国家的农作物产量。通过4个关键的训练参数(药品(吨)、雨量(毫米)、温度(摄氏度)和收成(公斤/亩)),我们的Random Forest回归模型实现了决定系数(r^2)为0.94,错误范围(ME)为0.03。模型在食品和农业组织联合国数据和世界银行气候变化数据目录上训练和测试。此外,每个参数都进行了分析,以便理解哪些因素如何影响总产量。我们采用了不同于常用的深度学习(DL)和机器学习(ML)模型的不同方法,并使用最新的数据来实现这项研究。现有学术研究可以启发于最佳的农业研究模型,特别是使用联合国数据。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.
TriDeNT: Triple Deep Network Training for Privileged Knowledge Distillation in Histopathology
results: 在各种不同的配对数据上,TriDeNT方法都可以超越其他当前的方法,在下游任务中提高性能,最高提高101%。Abstract
Computational pathology models rarely utilise data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilising privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distil knowledge from scarce or costly data during training, to create significantly better models for routine inputs.
摘要
Computational pathology models rarely utilize data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilizing privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics, and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distill knowledge from scarce or costly data during training, to create significantly better models for routine inputs.Here's the word-for-word translation of the text in Simplified Chinese:计算pathology模型很少利用不可用于推理的数据。这意味着大多数模型无法学习高度有用的数据,如额外的immunohistochemical(IHC)染色和空间表达ómics。我们介绍TriDeNT,一种新的自我超vised方法,可以在训练时使用不可用的特权数据来提高性能。我们在不同的对应数据集中证明TriDeNT的有效性,包括immunohistochemistry、空间表达ómics和专家核体注解。在所有情况下,TriDeNT在下游任务中表现出了101%的提高。此外,我们还提供了这些模型学习的质量和量化测量,以及与基线之间的差异。TriDeNT提供了一种新的方法,可以在训练时从罕见或高昂的数据中提取知识,以创建更好的模型。
Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images
paper_authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee
for: 提高图像识别模型的性能和对于领域外泛化的能力
methods: 使用生成模型进行精细调教,并通过提高生成的数据质量来提高模型性能
results: 通过增加生成数据,模型性能得到明显提高,可以达到6倍于原始ImageNet大小的水平,表明生成数据可以提高图像识别模型的性能和对于领域外泛化的能力Abstract
Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.
摘要
We present a new framework that leverages off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images.Our framework consistently enhances recognition model performance with more synthetic data, up to 6 times of the original ImageNet size, showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.
Authoring Worked Examples for Java Programming with Human-AI Collaboration
results: 这个论文的研究发现,使用这种人工智能和人类协作的方法可以提供高质量的代码解释,并且可以减少教师的工作量。Abstract
Worked examples (solutions to typical programming problems presented as a source code in a certain language and are used to explain the topics from a programming class) are among the most popular types of learning content in programming classes. Most approaches and tools for presenting these examples to students are based on line-by-line explanations of the example code. However, instructors rarely have time to provide line-by-line explanations for a large number of examples typically used in a programming class. In this paper, we explore and assess a human-AI collaboration approach to authoring worked examples for Java programming. We introduce an authoring system for creating Java worked examples that generates a starting version of code explanations and presents it to the instructor to edit if necessary. We also present a study that assesses the quality of explanations created with this approach.
摘要
工作示例(程序编程课程中常见的学习内容,通常以某种编程语言的代码形式出现)是编程课程中最受欢迎的学习内容之一。大多数approach和工具用于向学生展示这些示例都基于代码行首解释。然而,教师很少有时间为大量常用的示例进行线条线接释。在这篇论文中,我们探讨和评估一种人工智能协作授课示例作者系统,用于生成Java编程示例。我们介绍了一种生成示例代码解释的授课系统,并将其提供给教师编辑。我们还展示了评估这种方法生成的解释质量的研究。
For: The paper evaluates the ability of state-of-the-art large language models (LLMs) to solve PhD-level to research-level computational physics problems.* Methods: The paper uses well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains.* Results: The paper finds that current SOTA LLMs (GPT4) fail most of the problems, but about 40% of the solutions could plausibly get a passing grade. The paper identifies several failure modes of GPT4 in the computational physics domain and provides a snapshot of current computational capabilities in classical physics.Here are the three points in Simplified Chinese text:* For: 这篇论文评估了当今最先进的大语言模型(LLMs)在物理学和天文学领域的计算能力。* Methods: 论文使用了已经详细documented和广泛使用的包来激发语言模型在物理和天文学领域的编程能力。* Results: 论文发现当今最先进的LLMs(GPT4)对大多数问题失败,但是约40%的解决方案可能会得到合格的评估。论文确定了GPT4在物理计算领域的失败模式,并提供了当今计算能力在古典物理领域的概述。Abstract
[Abridged abstract] Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute $\sim 50$ original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more "educational" Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40\% of the solutions could plausibly get a passing grade. About $70-90 \%$ of the code lines produced are necessary, sufficient and correct (coding \& physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.
摘要
[摘要] 大型自然语言模型(LLM)可以解决一些大学生到研究生水平的物理教科书问题,并且具备编程能力。将这两个能力相结合,一天可能会使AI系统能够模拟和预测物理世界。 我们对当今最高标准(SOTA)LLM进行评估,应用在物理和天文物理领域中广泛使用的包装来引发编程能力。我们提供了50个原创和复杂的天体力学(REBOUND)、星体物理(MESA)、1D流体动力学(Dedalus)和非线性动力学(SciPy)问题。由于我们的问题没有唯一解,我们对LLM的表现进行评估,包括 coding、物理、必要性和充分性四种类型的错误计数以及更“教育”的Pass-Fail指标,旨在捕捉问题的核心物理元素。如期望,当今SOTA LLM(GPT4)零容量失败了大多数我们的问题,但约40%的解决方案可能会获得合格分。大约70-90%的代码行都是必要的、充分的和正确的(编程和物理)。物理和编程错误是最常见的,有些代码行是不必要的或不充分的。我们发现在物理领域的计算力学问题上,GPT4存在许多失败模式。我们的探索工作提供了当今物理计算能力的快照,并指出了AI系统在物理模拟能力方面的明确改进目标。
Fine-Tuning Language Models for Context-Specific SQL Query Generation
results: 研究发现,通过 fine-tuning 三个开源 LLM(Starcoder Plus、Code-Llama 和 Mistral),可以在零容量设定下达到高度的查询准确率,其中 Code-Llama 在 Snowflake SQL 和 GoogleSQL 语言方言上的准确率分别为 81.58% 和 82.66%。这些结果表明,适应域pecific任务的 LLM fine-tuning 是一种有效的方法,并且可能为抽象数据库通过自然语言界面提供更好的访问方式。Abstract
The ability to generate SQL queries from natural language has significant implications for making data accessible to non-specialists. This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language into SQL queries within the retail domain. We introduce models specialized in generating SQL queries, trained on synthetic datasets tailored to the Snowflake SQL and GoogleSQL dialects. Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints. The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GPT-4, with Code-Llama achieving the highest accuracy rates, at 81.58% for Snowflake SQL and 82.66% for GoogleSQL. These results underscore the effectiveness of fine-tuning LLMs on domain-specific tasks and suggest a promising direction for enhancing the accessibility of relational databases through natural language interfaces.
摘要
《使用自然语言生成SQL查询的能力对非专家来说具有重要意义,以提高数据的可访问性。这篇论文提出了一种新的方法,利用开源大型自然语言模型(LLM)进行自然语言转换为SQL查询。我们在零售领域使用GPT-4生成上下文特定的数据集,然后使用LoRa技术进行三种开源LLM(Starcoder Plus、Code-Llama和Mistral)的特化。结果显示,在零 shot 环境下,我们的特化模型在 Snowflake SQL 和 GoogleSQL 方面的准确率比基线GPT-4高,Code-Llama 模型的准确率达到了 81.58% 和 82.66%。这些结果表明,针对域特定任务进行特化的 LLM 可以提高 relational database 的可访问性,并且建议在自然语言 интерфей斯上扩展 relational database 的访问权限。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.
Integrating AI into CCTV Systems: A Comprehensive Evaluation of Smart Video Surveillance in Community Space
results: 我们的实验表明,该系统在社区学院中表现了强健性,可以处理16个CCTV摄像头的视频流,并在21小时的运行时间内保持了16.5帧每秒的吞吐量和30ms的终端到终端延迟。Abstract
This article presents an AI-enabled Smart Video Surveillance (SVS) designed to enhance safety in community spaces such as educational and recreational areas, and small businesses. The proposed system innovatively integrates with existing CCTV and wired camera networks, simplifying its adoption across various community cases to leverage recent AI advancements. Our SVS system, focusing on privacy, uses metadata instead of pixel data for activity recognition, aligning with ethical standards. It features cloud-based infrastructure and a mobile app for real-time, privacy-conscious alerts in communities. This article notably pioneers a comprehensive real-world evaluation of the SVS system, covering AI-driven visual processing, statistical analysis, database management, cloud communication, and user notifications. It's also the first to assess an end-to-end anomaly detection system's performance, vital for identifying potential public safety incidents. For our evaluation, we implemented the system in a community college, serving as an ideal model to exemplify the proposed system's capabilities. Our findings in this setting demonstrate the system's robustness, with throughput, latency, and scalability effectively managing 16 CCTV cameras. The system maintained a consistent 16.5 frames per second (FPS) over a 21-hour operation. The average end-to-end latency for detecting behavioral anomalies and alerting users was 26.76 seconds.
摘要
这篇文章介绍了一种基于人工智能的智能视频监测(SVS)系统,旨在增强社区空间中的安全性,包括教育和休闲场所以及小型企业。提案的系统通过与现有的CCTV和有线相机网络集成,使其在不同的社区案例中易于采用,并利用最新的人工智能进步。我们的SVS系统注重隐私,使用元数据而不是像素数据进行活动识别,符合伦理标准。它具有云端基础设施和移动应用程序,为社区中的实时、隐私感知警报提供云端基础设施和移动应用程序。这篇文章也突破常见的实验室评估,全面评估了SVS系统的AI驱动视处理、统计分析、数据库管理、云通信和用户通知等方面。它还是首次评估术语检测系统的综合性性能,这是鉴定公共安全事件的关键。为了评估SVS系统,我们在社区学院中实现了该系统,这是一个 идеal的示范场景,能够展示提案的系统能力。我们在这种设置下的发现表明了系统的可靠性,处理16个CCTV摄像头,并保持了16.5帧每秒的稳定性。系统的平均术语检测和用户通知的综合延迟为26.76秒。
Know Your Audience: Do LLMs Adapt to Different Age and Education Levels?
results: 我们发现不同的LLMs具有大量的阅读性差异,而且LLM答案的阅读性通常不符合目标读者群体的理解水平。这些结果表明LLMs需要更好地适应不同的读者群体,以提高其可理解性。这些结果还指出,现有的LLMs在教育场景中需要进一步改进适应性,以满足不同年龄和教育水平的读者需求。Abstract
Large language models (LLMs) offer a range of new possibilities, including adapting the text to different audiences and their reading needs. But how well do they adapt? We evaluate the readability of answers generated by four state-of-the-art LLMs (commercial and open-source) to science questions when prompted to target different age groups and education levels. To assess the adaptability of LLMs to diverse audiences, we compare the readability scores of the generated responses against the recommended comprehension level of each age and education group. We find large variations in the readability of the answers by different LLMs. Our results suggest LLM answers need to be better adapted to the intended audience demographics to be more comprehensible. They underline the importance of enhancing the adaptability of LLMs in education settings to cater to diverse age and education levels. Overall, current LLMs have set readability ranges and do not adapt well to different audiences, even when prompted. That limits their potential for educational purposes.
摘要
Near-real-time Earthquake-induced Fatality Estimation using Crowdsourced Data and Large-Language Models
paper_authors: Chenguang Wang, Davis Engler, Xuechun Li, James Hou, David J. Wald, Kishor Jaiswal, Susu Xu
for: 这篇论文是为了提高全球地震引起的人员亡伤预测的准确性和时效性而写的。
methods: 这篇论文使用了多语言社交媒体数据,combined with hierarchical casualty extraction models, physical constraint-aware dynamic truth discovery models, and Bayesian updating loss projection models to improve the accuracy and timeliness of earthquake-induced human loss forecasting.
results: 研究人员在使用这种 Framework 后,可以在全球多个地震事件中实时获得比较准确的人员亡伤预测结果,并且速度和准确性与美国地质调查局(USGS)的手动方法相当。Abstract
When a damaging earthquake occurs, immediate information about casualties is critical for time-sensitive decision-making by emergency response and aid agencies in the first hours and days. Systems such as Prompt Assessment of Global Earthquakes for Response (PAGER) by the U.S. Geological Survey (USGS) were developed to provide a forecast within about 30 minutes of any significant earthquake globally. Traditional systems for estimating human loss in disasters often depend on manually collected early casualty reports from global media, a process that's labor-intensive and slow with notable time delays. Recently, some systems have employed keyword matching and topic modeling to extract relevant information from social media. However, these methods struggle with the complex semantics in multilingual texts and the challenge of interpreting ever-changing, often conflicting reports of death and injury numbers from various unverified sources on social media platforms. In this work, we introduce an end-to-end framework to significantly improve the timeliness and accuracy of global earthquake-induced human loss forecasting using multi-lingual, crowdsourced social media. Our framework integrates (1) a hierarchical casualty extraction model built upon large language models, prompt design, and few-shot learning to retrieve quantitative human loss claims from social media, (2) a physical constraint-aware, dynamic-truth discovery model that discovers the truthful human loss from massive noisy and potentially conflicting human loss claims, and (3) a Bayesian updating loss projection model that dynamically updates the final loss estimation using discovered truths. We test the framework in real-time on a series of global earthquake events in 2021 and 2022 and show that our framework streamlines casualty data retrieval, achieving speed and accuracy comparable to manual methods by USGS.
摘要
当地震发生时,立即获取灾情信息非常重要,以便在紧急应急响应和援助机构在首几个小时和天数内进行时间敏感的决策。例如,美国地质调查局(USGS)开发的Prompt Assessment of Global Earthquakes for Response(PAGER)系统可以在全球任何重要地震事件中提供约30分钟的预测。传统的灾害中人亡数量估计方法通常基于手动收集的早期灾害报告,这是一项劳动密集和慢的过程,具有显著的时间延迟。近年来,一些系统使用关键词匹配和主题分析来从社交媒体上提取有关灾害的信息。然而,这些方法在多语言文本中存在复杂的语义 semantics 和社交媒体平台上的不可预知和常更改的灾害报告问题。在这项工作中,我们介绍一个端到端框架,以提高全球地震引起的人员亡失预测的时间性和准确性。我们的框架包括以下三个组成部分:1. 基于大语言模型、提示设计和少量学习的层次灾害提取模型,用于从社交媒体上提取人员亡失数量的量化声明。2. 基于物理约束和动态真实发现模型,用于从巨量繁杂、可能冲突的人员亡失声明中提取真实的人员亡失数量。3. 基于发现真实的损失更新损失预测模型,用于在发现真实的人员亡失数量基础上动态更新最终损失预测。我们在2021和2022年global earthquake事件中进行了实时测试,并显示了我们的框架可以快速地提取灾害数据,同时保持与手动方法相当的准确性。
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
results: 实验结果表明,TimeChat在不同的视频理解任务中具有强大的零基础时间本地化和推理能力,比如YouCook2上的+9.2 F1分和+2.8 CIDEr,QVHighlights上的+5.8 HIT@1,以及Charades-STA上的+27.5 R@1(IoU=0.5)。相比现有的视频大语言模型,TimeChat具有普适的视频助手潜力,适用于长形视频理解任务和真实用户需求。Abstract
This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.
摘要
这个工作提出了TimeChat,一种时间敏感多Modal大语言模型,特地设计用于长视频理解。我们的模型包括两个关键的建筑性贡献:(1)一个具有时间戳的帧编码器,将视觉内容与每帧时间戳绑定在一起,以及(2)一个滑块视频Q-Former,生成视频代码序列的变长度以适应不同长度的视频。此外,我们还构建了一个具有6个任务和125W instances的指令调整数据集,以进一步提高TimeChat的指令遵从性能。实验结果表明,TimeChat在不同的视频理解任务上,如紧密描述、时间安全和突出点检测等,具有强大的零shot时间本地化和逻辑能力。例如,它在YouCook2上 achieves +9.2 F1分数和+2.8 CIDEr,在QVHighlights上 achieves +5.8 HIT@1,在Charades-STA上 achieves +27.5 R@1(IoU=0.5),相比之前的视频大语言模型,具有潜在的多功能视频助手功能,满足现实的用户需求。
VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation
results: 在GTA5 synthetic dataset上达到了7.6% mIoU的领域总体化SOTA,并在Cityscapes-to-ACDC benchmark上达到了76.48% mIoU,比前一个SOTA方法提高6.9% mIoU。其中,我们的方法在Cityscapes测试集上表现出了强大的领域总体化能力,即86.1% mIoU。Abstract
Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.
摘要
领域总结 (DG) 对 Deep Neural Network (DNN) 的感知仍然是一个主要挑战,由于光照、天气或地理位置变化引起的领域差异。在这项工作中,我们提议VLTSeg,用于增强领域总结的semantic segmentation中的领域总结。我们的方法利用了视觉语言模型的内在Semantic robustness。首先,我们通过将传统的视觉Only backbone替换为CLIP和EVA-CLIP预训练encoder,发现在DG领域,视觉语言预训练显著超过了supervised和self-supervised视觉预训练。我们因此提议一种新的视觉语言方法 для领域总结,提高了领域总结SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset。我们还显示了视觉语言分割模型的优秀总结能力,在受欢迎的Cityscapes-to-ACDCbenchmark上达到76.48% mIoU,比前一个SOTA方法测试集上的6.9% mIoU高。此外,我们的方法还显示了强大的领域总结能力,Cityscapes测试集上的86.1% mIoU,与当前的首位相同。
Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models
methods: 该方法包括两个阶段。在第一阶段,Agent通过自己的过去经验来学习世界模型,并通过最大化ELBO来理解自己的身体。在第二阶段,Agent receives some observation-only demonstrations of an expert performing a novel task, and tries to imitate the expert’s behavior by defining a policy as an inference model and maximizing the evidence of the demonstration under the policy and world model。
results: 我们通过对DeepMind Control Suite的Walker和Cheetahembodiment进行验证,发现我们的方法在零 shot imitation中表现出优于状态之前的基elines。Abstract
Unlike most reinforcement learning agents which require an unrealistic amount of environment interactions to learn a new behaviour, humans excel at learning quickly by merely observing and imitating others. This ability highly depends on the fact that humans have a model of their own embodiment that allows them to infer the most likely actions that led to the observed behaviour. In this paper, we propose Action Inference by Maximising Evidence (AIME) to replicate this behaviour using world models. AIME consists of two distinct phases. In the first phase, the agent learns a world model from its past experience to understand its own body by maximising the ELBO. While in the second phase, the agent is given some observation-only demonstrations of an expert performing a novel task and tries to imitate the expert's behaviour. AIME achieves this by defining a policy as an inference model and maximising the evidence of the demonstration under the policy and world model. Our method is "zero-shot" in the sense that it does not require further training for the world model or online interactions with the environment after given the demonstration. We empirically validate the zero-shot imitation performance of our method on the Walker and Cheetah embodiment of the DeepMind Control Suite and find it outperforms the state-of-the-art baselines. Code is available at: https://github.com/argmax-ai/aime.
摘要
不同于大多数强化学习代理需要无realistic的环境互动来学习新的行为,人类却能够快速学习,只需通过观察和复制他人的行为。这种能力几乎完全取决于人类 possessing a model of their own embodiment,允许他们对所见行为的可能性进行推测。在这篇论文中,我们提出了Action Inference by Maximising Evidence(AIME)来复制这种行为。AIME包括两个不同阶段。在第一阶段,代理人 learns a world model from its past experience,以便理解自己的身体,并通过最大化ELBO来实现。而在第二阶段,代理人被给予一些 observation-only 专家完成新任务的示例,并尝试通过模仿专家的行为来复制。AIME实现这一点通过定义策略为一个推理模型,并在策略和世界模型下最大化证据的方式来实现。我们的方法是 "zero-shot" 的,即不需要进一步训练世界模型或在环境上进行在线互动。我们在 Walker 和 Cheetah 的 DeepMind Control Suite 中进行了验证,并发现我们的方法在零shot imitating性能方面高于当前的基elines。代码可以在以下链接中找到:https://github.com/argmax-ai/aime。
Towards Learning a Generalist Model for Embodied Navigation
results: 实验结果显示,我们的通用模型在CVDN、SOON和ScanQA等测试中均 achieve state-of-the-art表现, Specifically, 它在CVDN上比前一代最佳方法高出29%的进步。此外,我们的模型也在未见任务上表现出强大的普遍性。Abstract
Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.
摘要
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
paper_authors: Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Eric Sun, Yue Zhang for: This paper explores the intersection of large language models (LLMs) with security and privacy, and investigates how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs.methods: The paper uses a comprehensive literature review to categorize findings into “The Good” (beneficial LLM applications), “The Bad” (offensive applications), and “The Ugly” (vulnerabilities and their defenses).results: The paper finds that LLMs have proven to enhance code and data security, outperforming traditional methods, but they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. The paper identifies areas that require further research efforts, such as research on model and parameter extraction attacks and safe instruction tuning.Abstract
Large Language Models (LLMs), such as GPT-3 and BERT, have revolutionized natural language understanding and generation. They possess deep language comprehension, human-like text generation capabilities, contextual awareness, and robust problem-solving skills, making them invaluable in various domains (e.g., search engines, customer support, translation). In the meantime, LLMs have also gained traction in the security community, revealing security vulnerabilities and showcasing their potential in security-related tasks. This paper explores the intersection of LLMs with security and privacy. Specifically, we investigate how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs. Through a comprehensive literature review, the paper categorizes findings into "The Good" (beneficial LLM applications), "The Bad" (offensive applications), and "The Ugly" (vulnerabilities and their defenses). We have some interesting findings. For example, LLMs have proven to enhance code and data security, outperforming traditional methods. However, they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. We have identified areas that require further research efforts. For example, research on model and parameter extraction attacks is limited and often theoretical, hindered by LLM parameter scale and confidentiality. Safe instruction tuning, a recent development, requires more exploration. We hope that our work can shed light on the LLMs' potential to both bolster and jeopardize cybersecurity.
摘要
大型自然语言模型(LLM),如GPT-3和BERT,已经革命化了自然语言理解和生成。它们拥有深刻的语言理解能力、人类化的文本生成能力、上下文意识和问题解决能力,使其在多个领域(如搜索引擎、客户支持、翻译)中成为不可或缺的工具。同时,LLM也在安全社区中得到了更多的关注,揭示了它们在安全相关任务中的潜在力量。本文探讨了LLM与安全和隐私之间的交叉关系。我们特别 investigate了LLM在安全和隐私方面的积极影响,可能的风险和威胁,以及LLM内置的漏洞和防御策略。通过对相关文献进行全面的检索和分类,我们将发现结果分为“好”(有利LLM应用)、“坏”(攻击性应用)和“丑”(漏洞和防御策略)。我们发现了一些有趣的发现,例如LLM可以帮助加强代码和数据安全,超越传统方法。但是,它们也可以被用于多种攻击(特别是用户级攻击),因为它们具有人类化的思维能力。我们认为需要进一步的研究,例如模型和参数抽象攻击的研究尚未充分,模型参数秘密性和安全性也需要进一步的探讨。我们希望通过这个研究,能够突出LLM在安全领域的潜在能力和风险。
SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention
paper_authors: Isabel Leal, Krzysztof Choromanski, Deepali Jain, Avinava Dubey, Jake Varley, Michael Ryoo, Yao Lu, Frederick Liu, Vikas Sindhwani, Quan Vuong, Tamas Sarlos, Ken Oslund, Karol Hausman, Kanishka Rao
for: scales up Robotics Transformers (RT) for on-robot deployment
methods: up-training, converting pre-trained or fine-tuned Transformer-based robotic policies of quadratic time complexity into efficient linear-attention counterparts
results: speeds up RT-2 models and Point Cloud Transformer (PCT) robotic policies operating on large point cloudsAbstract
We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (including massive billion-parameter vision-language-action models or VLAs), into their efficient linear-attention counterparts maintaining high quality. We demonstrate the effectiveness of SARA-RT by speeding up: (a) the class of recently introduced RT-2 models, the first VLA robotic policies pre-trained on internet-scale data, as well as (b) Point Cloud Transformer (PCT) robotic policies operating on large point clouds. We complement our results with the rigorous mathematical analysis providing deeper insight into the phenomenon of SARA.
摘要
我们提出了自适应稳定注意力(SARA-RT),一种新的思路来解决将罗宾特斯坦福(RT)扩展到机器人上的趋势挑战。SARA-RT利用我们所提出的新方法,即更新,将预训练或已经精心调整的transformer基于的机器人政策转换为其高品质的线性注意力对象,具有quadratic时间复杂度(包括巨大的vision-language-action模型或VLA)。我们显示了SARA-RT的有效性,通过对RT-2模型和Point Cloud Transformer(PCT) robotic政策进行加速。我们附加了严格的数学分析,对SARA现象提供了更深入的理解。
Learning-Based Approaches to Predictive Monitoring with Conformal Statistical Guarantees
results: 本文提出了一种可靠的预测方法,可以在runtime中检测系统中的违反要求。此外,本文还提供了一种基于conformal prediction的uncertainty quantification方法,可以确定不可靠的预测结果。Abstract
This tutorial focuses on efficient methods to predictive monitoring (PM), the problem of detecting at runtime future violations of a given requirement from the current state of a system. While performing model checking at runtime would offer a precise solution to the PM problem, it is generally computationally expensive. To address this scalability issue, several lightweight approaches based on machine learning have recently been proposed. These approaches work by learning an approximate yet efficient surrogate (deep learning) model of the expensive model checker. A key challenge remains to ensure reliable predictions, especially in safety-critical applications. We review our recent work on predictive monitoring, one of the first to propose learning-based approximations for CPS verification of temporal logic specifications and the first in this context to apply conformal prediction (CP) for rigorous uncertainty quantification. These CP-based uncertainty estimators offer statistical guarantees regarding the generalization error of the learning model, and they can be used to determine unreliable predictions that should be rejected. In this tutorial, we present a general and comprehensive framework summarizing our approach to the predictive monitoring of CPSs, examining in detail several variants determined by three main dimensions: system dynamics (deterministic, non-deterministic, stochastic), state observability, and semantics of requirements' satisfaction (Boolean or quantitative).
摘要
Our recent work on predictive monitoring is one of the first to propose learning-based approximations for CPS verification of temporal logic specifications, and the first to apply conformal prediction (CP) for rigorous uncertainty quantification. CP-based uncertainty estimators provide statistical guarantees regarding the generalization error of the learning model, and can be used to determine unreliable predictions that should be rejected.Our approach to PM is based on a general and comprehensive framework that summarizes our method for monitoring CPSs. The framework considers three main dimensions: system dynamics (deterministic, non-deterministic, stochastic), state observability, and semantics of requirements satisfaction (Boolean or quantitative). By examining these variants in detail, we can determine the most appropriate approach for a given system and requirement.In this tutorial, we will present our approach to PM and discuss the key challenges and opportunities in this area. We will also provide examples of how our method can be applied to real-world systems and highlight the benefits of using machine learning-based approaches for PM.
Foundations for Transfer in Reinforcement Learning: A Taxonomy of Knowledge Modalities
results: 这篇论文的结果表明,现代人工智能系统的总体能力和可扩展性问题具有很大的挑战和机会,并且需要采取多种方法和技术来提高这些系统的总体能力和可扩展性。Abstract
Contemporary artificial intelligence systems exhibit rapidly growing abilities accompanied by the growth of required resources, expansive datasets and corresponding investments into computing infrastructure. Although earlier successes predominantly focus on constrained settings, recent strides in fundamental research and applications aspire to create increasingly general systems. This evolving landscape presents a dual panorama of opportunities and challenges in refining the generalisation and transfer of knowledge - the extraction from existing sources and adaptation as a comprehensive foundation for tackling new problems. Within the domain of reinforcement learning (RL), the representation of knowledge manifests through various modalities, including dynamics and reward models, value functions, policies, and the original data. This taxonomy systematically targets these modalities and frames its discussion based on their inherent properties and alignment with different objectives and mechanisms for transfer. Where possible, we aim to provide coarse guidance delineating approaches which address requirements such as limiting environment interactions, maximising computational efficiency, and enhancing generalisation across varying axes of change. Finally, we analyse reasons contributing to the prevalence or scarcity of specific forms of transfer, the inherent potential behind pushing these frontiers, and underscore the significance of transitioning from designed to learned transfer.
摘要
现代人工智能系统在扩大能力和需要资源的同时也在增长。 Earlier successes 主要集中在受限的设置下,但最近的基础研究和应用呈现出创造更加一般性的趋势。这种变化的背景下,总是需要对知识的总结、传递和适应进行调整和优化。在反射学习(RL)领域,知识表达的形式包括动力学和奖励模型、价值函数、策略和原始数据。本文系统地列举这些形式,并根据它们的内在性和与不同目标和机制的相互作用进行框架。尽可能提供粗略的指导,以限制环境互动、 maximize 计算效率和在不同变化车轨上增强通用性。最后,我们分析了特定形式的传递的原因和潜在力量,以及将这些前iers推进的重要性。
Federated Active Learning for Target Domain Generalisation
results: 我们进行了多次实验,包括FDG with/without AL和联合学习基eline,并与多个现代方法进行比较。我们的广泛的量化实验结果表明,FEDALV在精度和效率方面与多个前期方法相比,具有较高的性能。FEDALV可以在仅从限制的源客户数据中抽取5%的数据来训练目标客户的模型,并且实现了全训练目标准确性。Abstract
In this paper, we introduce Active Learning framework in Federated Learning for Target Domain Generalisation, harnessing the strength from both learning paradigms. Our framework, FEDALV, composed of Active Learning (AL) and Federated Domain Generalisation (FDG), enables generalisation of an image classification model trained from limited source domain client's data without sharing images to an unseen target domain. To this end, our FDG, FEDA, consists of two optimisation updates during training, one at the client and another at the server level. For the client, the introduced losses aim to reduce feature complexity and condition alignment, while in the server, the regularisation limits free energy biases between source and target obtained by the global model. The remaining component of FEDAL is AL with variable budgets, which queries the server to retrieve and sample the most informative local data for the targeted client. We performed multiple experiments on FDG w/ and w/o AL and compared with both conventional FDG baselines and Federated Active Learning baselines. Our extensive quantitative experiments demonstrate the superiority of our method in accuracy and efficiency compared to the multiple contemporary methods. FEDALV manages to obtain the performance of the full training target accuracy while sampling as little as 5% of the source client's data.
摘要
在本文中,我们引入了活动学习框架在联合学习中实现目标领域泛化,利用活动学习和联合领域泛化的优势。我们的框架名为FEDALV,由活动学习(AL)和联合领域泛化(FDG)组成,可以在限制来源领域客户端数据不共享图像情况下,使一个图像分类模型在未经见过的目标领域进行泛化。为此,我们的FDG部分名为FEDA,包括在训练期间两个优化更新,一个在客户端和另一个在服务器端。客户端上引入的损失是降低特征复杂性和condition对齐,而服务器端的正则化限制了source和目标领域获得的自由能量偏差。剩下的FEDAL部分是AL变量预算,它向服务器请求和抽取目标客户端最有用的本地数据来进行训练。我们进行了多个实验,包括FDG与和 без AL,并与传统FDG和联合活动学习基eline进行比较。我们的广泛的量化实验结果表明,FEDALV在精度和效率方面与多种当代方法相比,具有显著的优势。FEDALV可以在只 sampling 5%的源客户端数据情况下,达到全训练目标精度。
paper_authors: Gabriel della Maggiora, Luis Alberto Croquevielle, Nikita Desphande, Harry Horsley, Thomas Heinis, Artur Yakimovich
for: 这 paper 的目的是提出一种学习杂度表示的方法,以提高 inverse problems 中 diffusion models 的性能。
methods: 这 paper 使用了一种新的方法,即在训练过程中学习杂度表示,以避免手动调整杂度表示的困难和时间消耗。
results: 这 paper 在两个不相关的 inverse problems 中进行了测试,得到了比或超过先前方法和微调 diffusion models 的结果。这表明了学习杂度表示可以在训练过程中稳定地学习,并且可以适应不同的应用场景。Abstract
Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.
摘要
<>Diffusion models have gained popularity in inverse problems due to their ability to produce realistic solutions and good mathematical properties. However, the choice of variance schedule is crucial and sensitive, which can be time-consuming and may not guarantee optimal results. We propose a novel approach to learn the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, adapting to different applications with minimal overhead. We test our approach in two unrelated inverse problems, super-resolution microscopy and quantitative phase imaging, achieving comparable or superior results to previous methods and fine-tuned diffusion models. Our conclusion is that fine-tuning the schedule by experimentation should be avoided, as it can be learned during training in a stable way that yields better results.>>>
Deep Reinforcement Learning for Community Battery Scheduling under Uncertainties of Load, PV Generation, and Energy Prices
For: 本研究旨在透过深度强化学习策略(soft actor-critic algorithm),对于社区电池系统的 scheduling 问题进行解决,以应对分布式能源资源(DERs)的增加,促进可再生能源的整合,减少峰值负载,并提高电网可靠性。* Methods: 本研究使用了深度强化学习(RL)策略,包括 soft actor-critic 算法,来调整社区电池系统的运作,以应对不确定因素,如太阳能光伏(PV)生成、本地需求和实时能源价格。* Results: 研究结果显示,RL 策略可以有效地解决社区电池系统的调整问题,并且在不同的 RL 算法中,soft actor-critic 算法最终得到了最佳性能。Abstract
In response to the growing uptake of distributed energy resources (DERs), community batteries have emerged as a promising solution to support renewable energy integration, reduce peak load, and enhance grid reliability. This paper presents a deep reinforcement learning (RL) strategy, centered around the soft actor-critic (SAC) algorithm, to schedule a community battery system in the presence of uncertainties, such as solar photovoltaic (PV) generation, local demand, and real-time energy prices. We position the community battery to play a versatile role, in integrating local PV energy, reducing peak load, and exploiting energy price fluctuations for arbitrage, thereby minimizing the system cost. To improve exploration and convergence during RL training, we utilize the noisy network technique. This paper conducts a comparative study of different RL algorithms, including proximal policy optimization (PPO) and deep deterministic policy gradient (DDPG) algorithms, to evaluate their effectiveness in the community battery scheduling problem. The results demonstrate the potential of RL in addressing community battery scheduling challenges and show that the SAC algorithm achieves the best performance compared to RL and optimization benchmarks.
摘要
随着分布式能源资源(DERs)的普及,社区电池已经出现为支持可再生能源集成、减少峰值负荷和改善电网可靠性的有力解决方案。本文提出了一种基于深度强化学习(RL)策略,中心是软行为评估算法(SAC),用于社区电池系统的调度。我们将社区电池定位为一个多余的角色,以集成本地PV能源,减少峰值负荷,并利用实时能源价格波动进行走私贸易,以最小化系统成本。为了提高探索和融合在RL培训中,我们利用随机网络技术。本文进行了不同RL算法,包括PPO和DDPG算法的比较研究,以评估它们在社区电池调度问题中的效果。结果表明RL可以有效地解决社区电池调度挑战,而SAC算法在RL和优化准 benchmarks 中表现最佳。
Correlation and Unintended Biases on Univariate and Multivariate Decision Trees
results: 论文发现,尽管多变量DT在理论上更具表现力,但是单变量DT在实际应用中却能够达到相似的表现。研究人员认为,这可能是因为存在的数据集偏见导致的。Abstract
Decision Trees are accessible, interpretable, and well-performing classification models. A plethora of variants with increasing expressiveness has been proposed in the last forty years. We contrast the two families of univariate DTs, whose split functions partition data through axis-parallel hyperplanes, and multivariate DTs, whose splits instead partition data through oblique hyperplanes. The latter include the former, hence multivariate DTs are in principle more powerful. Surprisingly enough, however, univariate DTs consistently show comparable performances in the literature. We analyze the reasons behind this, both with synthetic and real-world benchmark datasets. Our research questions test whether the pre-processing phase of removing correlation among features in datasets has an impact on the relative performances of univariate vs multivariate DTs. We find that existing benchmark datasets are likely biased towards favoring univariate DTs.
摘要
Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario
methods: 这个模型使用了一种名为chain of thought(CoT)的示例来解锁大语言模型的潜力,并且使用了一个新的图像问答集来评估模型的性能。
results: 实验结果表明,在使用CoT示例的情况下,模型在回答复杂问题时的准确率得到了大幅提高。这些结果提供了后续关于VQA的研究基础。Abstract
Visual question answering (VQA) is a fundamental and essential AI task, and VQA-based disaster scenario understanding is a hot research topic. For instance, we can ask questions about a disaster image by the VQA model and the answer can help identify whether anyone or anything is affected by the disaster. However, previous VQA models for disaster damage assessment have some shortcomings, such as limited candidate answer space, monotonous question types, and limited answering capability of existing models. In this paper, we propose a zero-shot VQA model named Zero-shot VQA for Flood Disaster Damage Assessment (ZFDDA). It is a VQA model for damage assessment without pre-training. Also, with flood disaster as the main research object, we build a Freestyle Flood Disaster Image Question Answering dataset (FFD-IQA) to evaluate our VQA model. This new dataset expands the question types to include free-form, multiple-choice, and yes-no questions. At the same time, we expand the size of the previous dataset to contain a total of 2,058 images and 22,422 question-meta ground truth pairs. Most importantly, our model uses well-designed chain of thought (CoT) demonstrations to unlock the potential of the large language model, allowing zero-shot VQA to show better performance in disaster scenarios. The experimental results show that the accuracy in answering complex questions is greatly improved with CoT prompts. Our study provides a research basis for subsequent research of VQA for other disaster scenarios.
摘要
“视觉问答(VQA)是人工智能的基础和必需任务,而VQA基于灾害情景的理解是一个热点研究领域。例如,我们可以通过VQA模型提问灾害图像,并且答案可以帮助我们判断灾害中是否有人或物被affected。然而,过去的VQA模型 для灾害损害评估有一些缺点,例如候选答案的有限性、单一的问题类型和现有模型的回答能力有限。在这篇论文中,我们提出了一种零shot VQA模型名为零shot VQA for Flood Disaster Damage Assessment(ZFDDA)。它是一种不需要预训练的VQA模型,同时我们为灾害情景建立了自由式洪水灾害图像问答集(FFD-IQA)来评估我们的VQA模型。这个新的数据集扩展了问题类型,包括自由形、多选和是否问题。同时,我们扩展了之前的数据集,总共包含2,058张图像和22,422个问题-元数据真实答案。最重要的是,我们使用了Well-designed chain of thought(CoT)示例来解锁大语言模型的潜力,使零shot VQA在灾害情景中表现更好。实验结果表明,在复杂问题中的答案准确率得到了大幅提高。我们的研究提供了对后续VQA для其他灾害情景的研究基础。”
Modular Control Architecture for Safe Marine Navigation: Reinforcement Learning and Predictive Safety Filters
results: 结果表明,PSF可以保持安全性,而不影响RL Agent的学习率和性能。作者们对一个模拟的Cybership II模型进行了marine navigation的应用,RL Agent被训练了路径跟踪和碰撞避免,而PSF监测和修改控制动作以确保安全性。Abstract
Many autonomous systems face safety challenges, requiring robust closed-loop control to handle physical limitations and safety constraints. Real-world systems, like autonomous ships, encounter nonlinear dynamics and environmental disturbances. Reinforcement learning is increasingly used to adapt to complex scenarios, but standard frameworks ensuring safety and stability are lacking. Predictive Safety Filters (PSF) offer a promising solution, ensuring constraint satisfaction in learning-based control without explicit constraint handling. This modular approach allows using arbitrary control policies, with the safety filter optimizing proposed actions to meet physical and safety constraints. We apply this approach to marine navigation, combining RL with PSF on a simulated Cybership II model. The RL agent is trained on path following and collision avpodance, while the PSF monitors and modifies control actions for safety. Results demonstrate the PSF's effectiveness in maintaining safety without hindering the RL agent's learning rate and performance, evaluated against a standard RL agent without PSF.
摘要
多种自主系统面临安全挑战,需要强大的封闭控制来处理物理限制和安全约束。现实世界中的自主船只例如,会遇到非线性动力学和环境干扰。使用强化学习来适应复杂enario,但标准框架确保安全和稳定性却缺乏。预测安全筛(PSF)提供了一种有 promise的解决方案,确保学习基于控制的Constraint satisfaction,而不需要显式处理约束。这种归类approach允许使用任意控制策略,安全筛优化提议的控制动作,以满足物理和安全约束。我们在 marine navigation 中应用这种approach,combine RL with PSF on a simulated Cybership II model。RL agent 被训练以 seguir ruta y evitar colisiones,而 PSF 监视和修改控制动作以确保安全。结果表明 PSF 能够 effectively maintain safety without hindering RL agent's learning rate and performance, compared to a standard RL agent without PSF.
Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking
paper_authors: Jihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee
for: 这篇论文主要是为了研究对话状态跟踪(DST)在语音模式下的应用。
methods: 作者使用了搅拌和端到端模型,并使用自己制作的假音频数据进行训练。
results: 实验结果表明,使用假音频数据训练的模型可以在真实人声数据上进行普遍化表现。这些发现可以帮助解决对话状态跟踪在语音模式下的实际应用问题。数据和代码可以在https://github.com/JihyunLee1/E2E-DST上获取。Abstract
Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST.
摘要
对话状态跟踪(Dialogue State Tracking,DST)在任务导向对话系统中扮演着关键角色,但之前的研究主要集中在文本Modalities上,主要是因为人类Audio数据的缺乏。我们解决这个问题,通过调查人工Audio数据来研究Audio基于DST。为此,我们开发了链式和端到端模型,使用我们自己的人工Audio数据进行训练,并在实际人类speech数据上进行测试。为了适应Audio模式的评估,我们引入了一种新的PhoneMeF1来捕捉发音相似性。实验结果表明,使用人工Audio数据进行训练后,模型可以在人类voice数据上进行泛化表现。通过消除人类Speech数据收集的依赖关系,这些发现打开了大量实践上的前进之路,具有很大的实际应用前途。数据和代码可以在https://github.com/JihyunLee1/E2E-DST上下载。
Integrated Drill Boom Hole-Seeking Control via Reinforcement Learning
results: 在钻孔过程中提高钻孔精度,并提高钻孔效率和时间效率Abstract
Intelligent drill boom hole-seeking is a promising technology for enhancing drilling efficiency, mitigating potential safety hazards, and relieving human operators. Most existing intelligent drill boom control methods rely on a hierarchical control framework based on inverse kinematics. However, these methods are generally time-consuming due to the computational complexity of inverse kinematics and the inefficiency of the sequential execution of multiple joints. To tackle these challenges, this study proposes an integrated drill boom control method based on Reinforcement Learning (RL). We develop an integrated drill boom control framework that utilizes a parameterized policy to directly generate control inputs for all joints at each time step, taking advantage of joint posture and target hole information. By formulating the hole-seeking task as a Markov decision process, contemporary mainstream RL algorithms can be directly employed to learn a hole-seeking policy, thus eliminating the need for inverse kinematics solutions and promoting cooperative multi-joint control. To enhance the drilling accuracy throughout the entire drilling process, we devise a state representation that combines Denavit-Hartenberg joint information and preview hole-seeking discrepancy data. Simulation results show that the proposed method significantly outperforms traditional methods in terms of hole-seeking accuracy and time efficiency.
摘要
智能钻孔杆自动控制技术可以提高钻孔效率,降低风险和减轻人工操作员的劳动负担。现有的大多数智能钻孔杆控制方法基于倒数 Kinematics 的层次控制框架。然而,这些方法通常是时间consuming的,因为 inverse kinematics 的计算复杂度高,并且顺序执行多个 JOINTS 的不效率。为了解决这些挑战,本研究提出了基于强化学习(RL)的集成钻孔杆控制方法。我们开发了一个集成的钻孔杆控制框架,利用参数化策略直接生成每个 JOINTS 的控制输入,利用 JOINTS 姿态和目标孔信息。通过将钻孔任务定义为 Markov 决策过程,我们可以直接使用现代主流 RL 算法学习一个钻孔策略,从而废弃 inverse kinematics 解决方案,并促进多 JOINTS 合作控制。为了在整个钻孔过程中提高钻孔精度,我们设计了一个包含 Denavit-Hartenberg JOINTS 信息和预览孔差数据的状态表示。 simulation 结果表明,我们的方法可以舒展 traditional 方法,在钻孔精度和时间效率方面显著提高。
Learning Machine Morality through Experience and Interaction
paper_authors: Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
for: 本研究旨在嵌入伦理到自动化智能系统中,以 Ensure safety of next-generation Artificial Intelligence (AI) systems.
methods: 本研究使用了 Learning from experience (Reinforcement Learning) 方法,以提供伦理原则给学习代理人。
results: 研究发现, hybrid 方法可以创建更适应、更可控和更可解释的智能代理人,并可以表达类古典伦理框架。Abstract
Increasing interest in ensuring safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents. Traditionally, this has been done by imposing explicit top-down rules or hard constraints on systems, for example by filtering system outputs through pre-defined ethical rules. Recently, instead, entirely bottom-up methods for learning implicit preferences from human behavior have become increasingly popular, such as those for training and fine-tuning Large Language Models. In this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines - modeled as a continuum, and argue that the majority of popular techniques lie at the extremes - either being fully hard-coded, or entirely learned, where no explicit statement of any moral principle is required. Given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create adaptable and robust, yet more controllable and interpretable agents. In particular, we present three case studies of recent works which use learning from experience (i.e., Reinforcement Learning) to explicitly provide moral principles to learning agents - either as intrinsic rewards, moral logical constraints or textual principles for language models. For example, using intrinsic rewards in Social Dilemma games, we demonstrate how it is possible to represent classical moral frameworks for agents. We also present an overview of the existing work in this area in order to provide empirical evidence for the potential of this hybrid approach. We then discuss strategies for evaluating the effectiveness of moral learning agents. Finally, we present open research questions and implications for the future of AI safety and ethics which are emerging from this framework.
摘要
增加对下一代人工智能系统的安全性需求新的方法来嵌入伦理到自动化代理人中。传统上,这是通过设置明确的顶层规则或硬件约束来实现的,例如通过预定的伦理规则来筛选系统输出。然而,最近,全部从底层学习来学习人类行为中的偏好而得到的方法在普及。在这篇论文中,我们提供了已有的伦理在机器中引入方法的系统化 - 模型为一个维度,并 argue that大多数流行的方法在极端 - ether是完全硬编程或完全学习,而无需显式表达任何伦理原则。considering the relative strengths and weaknesses of each type of methodology, we argue that更多的hybrid解决方案是需要创建适应性强、可控性好、可解释性好的代理人。specifically, we present three case studies of recent works which use learning from experience (i.e., reinforcement learning) to explicitly provide moral principles to learning agents - either as intrinsic rewards, moral logical constraints or textual principles for language models. for example, using intrinsic rewards in social dilemma games, we demonstrate how it is possible to represent classical moral frameworks for agents. we also present an overview of the existing work in this area in order to provide empirical evidence for the potential of this hybrid approach.finally, we discuss strategies for evaluating the effectiveness of moral learning agents, and present open research questions and implications for the future of AI safety and ethics which are emerging from this framework.
Energy-based Potential Games for Joint Motion Forecasting and Control
results: 分析表明,使用游戏理论层可以提高多种神经网络背景下的预测性能,并增加解释性。Abstract
This work uses game theory as a mathematical framework to address interaction modeling in multi-agent motion forecasting and control. Despite its interpretability, applying game theory to real-world robotics, like automated driving, faces challenges such as unknown game parameters. To tackle these, we establish a connection between differential games, optimal control, and energy-based models, demonstrating how existing approaches can be unified under our proposed Energy-based Potential Game formulation. Building upon this, we introduce a new end-to-end learning application that combines neural networks for game-parameter inference with a differentiable game-theoretic optimization layer, acting as an inductive bias. The analysis provides empirical evidence that the game-theoretic layer adds interpretability and improves the predictive performance of various neural network backbones using two simulations and two real-world driving datasets.
摘要
这个工作使用游戏理论作为数学框架来Address交互模型化在多智能体运动预测和控制中。虽然它具有可读性,但在实际的 робоTopics中应用游戏理论,如自动驾驶,会遇到不知道游戏参数的挑战。为解决这些挑战,我们建立了连接梯度游戏、优化和能量基本模型的连接,从而示出了现有方法可以在我们提议的能量基本游戏形式下被统一。基于这个连接,我们引入了一种新的综合学习应用程序,该应用程序将神经网络用于游戏参数推理与梯度游戏理论层,作为一种导向假设。分析表明,游戏理论层可以提供可读性并改善各种神经网络背景下的预测性能,通过使用两个 simulations和两个实际驾驶数据集进行验证。
for: 本研究旨在将统计中引入的冠 Distribution 函数应用于多riteria 决策(MCDM)领域。
methods: 本研究使用了多种Weighted sum scalarization 方法,并证明这些方法可以视为一种升级的Weighted sum scalarization。
results: 研究发现了不同类型的排名反转现象,并解释了这可能是一种有用的分析排名过程的方式。Abstract
Recently introduced cone distribution functions from statistics are turned into multi-criteria decision making (MCDM) tools. It is demonstrated that this procedure can be considered as an upgrade of the weighted sum scalarization insofar as it absorbs a whole collection of weighted sum scalarizations at once instead of fixing a particular one in advance. Moreover, situations are characterized in which different types of rank reversal occur, and it is explained why this might even be useful for analyzing the ranking procedure. A few examples will be discussed and a potential application in machine learning is outlined.
摘要
最近引入的 cone distribution functions 从统计学被转化为多riteria decision making (MCDM) 工具。这种过程可以视为weighted sum scalarization的升级,因为它可以同时吸收多种weighted sum scalarization而不需要先选择特定的一个。此外,文中还描述了不同类型的排名反转现象,并解释了这可能是分析排名过程的有用工具。文中还会提供一些示例,并描述了机器学习的潜在应用。
LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics
results: 相比A和强化学习路径规划,LLM A更有效率,在搜索空间方面更快,并且可以实现与A相似的路径,而且比RL更好。此外,LLM A的交互性也使其成为在合作人类-机器人任务中的优秀工具。Abstract
This research focuses on how Large Language Models (LLMs) can help with path planning for mobile embodied agents such as robots, in a human-in-the-loop and interactive manner. A novel framework named LLM A*, aims to leverage the commonsense of LLMs, and the utility-optimal A* is proposed to facilitate few-shot near-optimal path planning. Prompts are used to 1) provide LLMs with essential information like environment, cost, heuristics, etc.; 2) communicate human feedback to LLMs on intermediate planning results. This makes the whole path planning process a `white box' and human feedback guides LLM A* to converge quickly compared to other data-driven methods such as reinforcement learning-based (RL) path planning. In addition, it makes code-free path planning practical, henceforth promoting the inclusiveness of artificial intelligence techniques. Comparative analysis against A* and RL shows that LLM A* is more efficient in terms of search space and achieves an on-a-par path with A* and a better path than RL. The interactive nature of LLM A* also makes it a promising tool for deployment in collaborative human-robot tasks.
摘要
Provide LLMs with essential information like environment, cost, heuristics, etc.2. Communicate human feedback to LLMs on intermediate planning results.This makes the whole path planning process a “white box” and human feedback guides LLM A* to converge quickly compared to other data-driven methods such as reinforcement learning-based (RL) path planning. In addition, it makes code-free path planning practical, henceforth promoting the inclusiveness of artificial intelligence techniques. Comparative analysis against A* and RL shows that LLM A* is more efficient in terms of search space and achieves an on-a-par path with A* and a better path than RL. The interactive nature of LLM A* also makes it a promising tool for deployment in collaborative human-robot tasks.Translation in Simplified Chinese:这些研究集中于如何使用大型自然语言模型(LLM)来帮助移动 embedding 代理人类如机器人进行路径规划,并在人类与机器人之间进行交互式的方式。一个新的框架名为 LLM A,旨在利用 LLM 的通用情况,并提出了基于实用 A 的几步最佳路径规划。提示用于:1. 为 LLM 提供环境、成本、规则等重要信息。2. 将人类反馈传递给 LLM 进行中间规划结果的沟通。这使得整个路径规划过程变成了一个 “白色盒”,人类反馈导引 LLM A* 快速 converges 到其他数据驱动方法,如强化学习基于的路径规划。此外,它还使得无需编程的路径规划实现得以进行,从而促进人工智能技术的包容性。相比 A* 和 RL 的分析表明,LLM A* 在搜索空间上更高效,并实现了与 A* 和 RL 的相当的路径。此外,LLM A* 的交互性也使其在人类和机器人之间的合作任务中表现出了承诺。
Contrastive Learning-Based Spectral Knowledge Distillation for Multi-Modality and Missing Modality Scenarios in Semantic Segmentation
results: CSK-Net 在三个公共 benchmarking 数据集上超过了 state-of-the-art 模型,并在缺失模式下 achieve 性能提升,无需额外计算成本。Abstract
Improving the performance of semantic segmentation models using multispectral information is crucial, especially for environments with low-light and adverse conditions. Multi-modal fusion techniques pursue either the learning of cross-modality features to generate a fused image or engage in knowledge distillation but address multimodal and missing modality scenarios as distinct issues, which is not an optimal approach for multi-sensor models. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A Novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.
摘要
提高 semantic segmentation 模型的性能使用多спектル信息非常重要,尤其是在低光照和不利条件下。多模态融合技术通常采取 Either learning cross-modality features to generate a fused image or engage in knowledge distillation, but these approaches do not fully address multimodal and missing modality scenarios. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.Here's the translation in Traditional Chinese:提高 semantic segmentation 模型的性能使用多спектル信息非常重要,尤其是在低光照和不利条件下。多模态融合技术通常采取 Either learning cross-modality features to generate a fused image or engage in knowledge distillation, but these approaches do not fully address multimodal and missing modality scenarios. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.
Developing Linguistic Patterns to Mitigate Inherent Human Bias in Offensive Language Detection
results: 该方法可以在多种语言上提高粗鲁语言识别 tasks 的准确率,并减少社交媒体上的粗鲁内容。Abstract
With the proliferation of social media, there has been a sharp increase in offensive content, particularly targeting vulnerable groups, exacerbating social problems such as hatred, racism, and sexism. Detecting offensive language use is crucial to prevent offensive language from being widely shared on social media. However, the accurate detection of irony, implication, and various forms of hate speech on social media remains a challenge. Natural language-based deep learning models require extensive training with large, comprehensive, and labeled datasets. Unfortunately, manually creating such datasets is both costly and error-prone. Additionally, the presence of human-bias in offensive language datasets is a major concern for deep learning models. In this paper, we propose a linguistic data augmentation approach to reduce bias in labeling processes, which aims to mitigate the influence of human bias by leveraging the power of machines to improve the accuracy and fairness of labeling processes. This approach has the potential to improve offensive language classification tasks across multiple languages and reduce the prevalence of offensive content on social media.
摘要
Natural language-based deep learning models require extensive training with large, comprehensive, and labeled datasets. However, manually creating such datasets is not only costly but also prone to errors. Moreover, the presence of human bias in offensive language datasets is a major concern for deep learning models.To address these challenges, we propose a linguistic data augmentation approach to reduce bias in labeling processes. By leveraging the power of machines, we aim to mitigate the influence of human bias and improve the accuracy and fairness of labeling processes. This approach has the potential to improve offensive language classification tasks across multiple languages and reduce the prevalence of offensive content on social media.
CZL-CIAE: CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation
paper_authors: Yuntao Shou, Wei Ai, Tao Meng, Keqin Li
for: zero-shot age estimation for improving efficiency and accuracy of various applications such as age verification and secure access control, and promoting research on multi-modal and zero-shot learning in the social media field.
methods: CLIP model for extracting image features and text semantic information, and a new Transformer architecture (FourierFormer) for fusing image and text semantic information, and reversible age estimation with end-to-end error feedback.
results: better age prediction results through extensive experiments on multiple data sets.Abstract
Zero-shot age estimation aims to learn feature information about age from input images and make inferences about a given person's image or video frame without specific sample data. The development of zero-shot age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.), while also promoting research on multi-modal and zero-shot learning in the social media field. For example, zero-sample age estimation can be used to create social networks focused on specific age groups. However, existing methods mainly focus on supervised, labeled age estimation learning, and the prediction effect of zero-shot learning is very poor. To tackle the above issues, we propose a novel CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation (CZL-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CZL-CIAE has achieved better age prediction results.
摘要
zero-shot年龄估计目标是从输入图像中学习出年龄信息并对给定人的图像或视频帧进行无特定样本数据的推理。发展zero-shot年龄估计可以提高不同应用程序(如年龄验证和安全访问控制等)的效率和准确率,同时也会促进社交媒体领域的多Modal和Zero-shot学习研究。例如,zero-sample年龄估计可以创建专门targeting specific age groups的社交网络。然而,现有方法主要集中在supervised、标注年龄估计学习上,zero-shot学习预测效果很差。为解决以上问题,我们提出了一种基于CLIP模型的Zero-shot学习方法 для Correcting Inverse Age Estimation(CZL-CIAE)。 Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture(i.e., FourierFormer)to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions。通过对多个数据集进行广泛的实验,CZL-CIAE实现了更好的年龄预测结果。
A Comprehensive Literature Review on Sweet Orange Leaf Diseases
paper_authors: Yousuf Rayhan Emon, Md Golam Rabbani, Dr. Md. Taimur Ahad, Faruk Ahmed
for: 这个研究旨在开发一个自动化的护叶病诊断系统,以早期检测和诊断甘蔗叶病,提高农业生产效率。
methods: 这个研究使用了不同的图像处理技术和机器学习模型,包括视transformer(ViT)、神经网络(CNN)、CNN with SoftMax和RBF SVM、Hybrid CNN-SVM、HLB-ConvMLP、EfficientNet-b0、YOLOv5、YOLOv7、卷积神经网络等。这些机器学习模型在不同的数据集上进行测试,并成功地检测到病诊断。
results: 本研究通过对不同机器学习模型的比较,发现这些模型在检测甘蔗叶病方面的精度、准确率、回归率等指标均达到了一定的水平。这些模型在实际应用中具有广泛的应用前景和潜在的商业价值。Abstract
Sweet orange leaf diseases are significant to agricultural productivity. Leaf diseases impact fruit quality in the citrus industry. The apparition of machine learning makes the development of disease finder. Early detection and diagnosis are necessary for leaf management. Sweet orange leaf disease-predicting automated systems have already been developed using different image-processing techniques. This comprehensive literature review is systematically based on leaf disease and machine learning methodologies applied to the detection of damaged leaves via image classification. The benefits and limitations of different machine learning models, including Vision Transformer (ViT), Neural Network (CNN), CNN with SoftMax and RBF SVM, Hybrid CNN-SVM, HLB-ConvMLP, EfficientNet-b0, YOLOv5, YOLOv7, Convolutional, Deep CNN. These machine learning models tested on various datasets and detected the disease. This comprehensive review study related to leaf disease compares the performance of the models; those models' accuracy, precision, recall, etc., were used in the subsisting studies
摘要
甘蔗叶病病菌对农业生产力有重要影响。叶病菌会影响柑橘业的果质。机器学习的出现使得叶病菌发现的开发成为可能。早期检测和诊断是叶管理的必要条件。已经开发了使用不同图像处理技术的柑橘叶病预测自动系统。这个全面的文献综述基于叶病和机器学习方法在检测损害叶片的图像分类方面。不同机器学习模型的优缺点,包括视觉转移(ViT)、神经网络(CNN)、CNN with SoftMax和RBF SVM、гибрид CNN-SVM、HLB-ConvMLP、EfficientNet-b0、YOLOv5、YOLOv7、卷积神经网络等,这些机器学习模型在不同的数据集上进行测试,检测到了病菌。本全面的文献综述研究与叶病相关,对各种机器学习模型的性能进行了比较,包括准确率、特异性、回归率等指标。这些指标在现有的研究中被广泛应用。
Model-based Deep Learning for Beam Prediction based on a Channel Chart
paper_authors: Taha Yassine, Baptiste Chatelier, Vincent Corlay, Matthieu Crussière, Stephane Paquelet, Olav Tirkkonen, Luc Le Magoarou
for: Channel charting builds a map of the radio environment in an unsupervised way, which can be used for various applications such as beam prediction.
methods: Advanced model-based neural network architectures are proposed for both channel charting and beam prediction.
results: Promising results are yielded on realistic synthetic channels.Here’s the simplified Chinese text:
results: 在实际synthetic通道上进行评估,获得了良好的结果。Abstract
Channel charting builds a map of the radio environment in an unsupervised way. The obtained chart locations can be seen as low-dimensional compressed versions of channel state information that can be used for a wide variety of applications, including beam prediction. In non-standalone or cell-free systems, chart locations computed at a given base station can be transmitted to several other base stations (possibly operating at different frequency bands) for them to predict which beams to use. This potentially yields a dramatic reduction of the overhead due to channel estimation or beam management, since only the base station performing charting requires channel state information, the others directly predicting the beam from the chart location. In this paper, advanced model-based neural network architectures are proposed for both channel charting and beam prediction. The proposed methods are assessed on realistic synthetic channels, yielding promising results.
摘要
通道映射建立了无监督的广播环境地图。得到的地图位置可以看作是压缩后的通道状态信息,可以用于许多应用程序,包括扫描预测。在不受管控或免费系统中,计算出的地图位置可以被传输到几个其他基站(可能在不同频率带上运行),以便他们根据地图位置预测要使用哪些扫描。这可能导致通道估计或扫描管理的开销减少,因为只有执行通道映射的基站需要通道状态信息,其他基站直接从地图位置预测扫描。在这篇论文中,提出了先进的模型基于神经网络架构,用于通道映射和扫描预测。提出的方法在实际的synthetic通道上进行评估,获得了抢手的结果。
Cybersecurity threats in FinTech: A systematic review
results: 本研究的深入分析为各类利益相关者提供了不可或缺的洞察,揭示了现有的挑战和有效的防御策略,同时也提供了未来研究的方向。Abstract
The rapid evolution of the Smart-everything movement and Artificial Intelligence (AI) advancements have given rise to sophisticated cyber threats that traditional methods cannot counteract. Cyber threats are extremely critical in financial technology (FinTech) as a data-centric sector expected to provide 24/7 services. This paper introduces a novel and refined taxonomy of security threats in FinTech and conducts a comprehensive systematic review of defensive strategies. Through PRISMA methodology applied to 74 selected studies and topic modeling, we identified 11 central cyber threats, with 43 papers detailing them, and pinpointed 9 corresponding defense strategies, as covered in 31 papers. This in-depth analysis offers invaluable insights for stakeholders ranging from banks and enterprises to global governmental bodies, highlighting both the current challenges in FinTech and effective countermeasures, as well as directions for future research.
摘要
随着智能一切运动和人工智能技术的快速发展,财务科技领域面临着高度复杂的网络威胁,传统方法无法应对。 FinTech 是一个数据导向的领域,需要提供 24/7 服务,网络安全问题尤为抢夺。这篇论文提出了一种新的和精细的安全威胁分类法,并进行了全面的系统性文献综述。通过 PRISMA 方法应用于 74 篇选择的研究和主题分析,我们确定了 11 种中心的网络威胁,其中 43 篇论文详细描述了这些威胁,并指出了 9 种防御策略,这些策略在 31 篇论文中得到了详细的描述。这份深入的分析为 FinTech 领域的投资者、银行和企业提供了宝贵的指导,同时也为未来研究提供了方向。
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
paper_authors: Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou
For: The paper aims to enable pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining.* Methods: The proposed method, called X-Adapter, keeps a frozen copy of the old model to preserve the connectors of different plugins, and adds trainable mapping layers to bridge the decoders from models of different versions for feature remapping.* Results: X-Adapter demonstrates universal compatibility with various plugins and enables plugins of different versions to work together, expanding the functionalities of diffusion community. The proposed method is evaluated through extensive experiments, showing its effectiveness in facilitating wider application in the upgraded foundational diffusion model.Here’s the simplified Chinese version of the three key points:* For: 论文目标是让预训练的插件(例如ControlNet、LoRA)与升级的文本到图像扩散模型(例如SDXL)无需进一步 retraining 工作。* Methods: 提议的方法是 X-Adapter,它保留了老版本模型的冻结 копиー,以保持不同插件的连接器,并添加了可训练的映射层,将模型不同版本的解码器桥接起来。* Results: X-Adapter 实现了不同插件的通用兼容性,并使得不同版本的插件可以共同工作,扩展扩散社区的功能。 论文通过广泛的实验证明了 X-Adapter 的有效性,以便在升级基础扩散模型中进行更广泛的应用。Abstract
We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
摘要
我们介绍X-Adapter,一种通用升级器,可以使得预训练的插件(例如ControlNet、LoRA)与升级的文本到图像扩散模型(例如SDXL)直接兼容,无需进一步重新训练。我们实现了这一目标 by 训练一个额外的网络,用于控制冰结的升级模型的新文本图像数据对。在详细的实现方式上,X-Adapter保留了老版本模型的冰结版本,以保持不同插件之间的连接。此外,X-Adapter添加了可训练的映射层,用于将模型不同版本之间的各种器件映射。这些映射后的特征将用作升级模型的导航。为了增强X-Adapter的导航能力,我们采用了一种空文本训练策略 для升级模型。之后,我们还提出了一种两Stage denoising策略,以协调X-Adapter和升级模型的初始缺陷。感谢我们的策略,X-Adapter可以 universal兼容不同的插件,同时也使得不同版本的插件可以共同工作,从而扩展扩散社区的功能。为了证明我们的方法的效果,我们进行了广泛的实验,实验结果表明,X-Adapter可以为升级基础扩散模型提供更广泛的应用。
Divide-and-Conquer Strategy for Large-Scale Dynamic Bayesian Network Structure Learning
For: The paper is written for researchers and practitioners who work with large-scale Bayesian networks, particularly in the fields of gene expression analysis, healthcare, and traffic prediction.* Methods: The paper introduces a novel divide-and-conquer strategy for large-scale structure learning of dynamic Bayesian networks (DBNs), which is originally developed for static Bayesian networks (BNs). The approach leverages prior knowledge of 2-time-sliced Bayesian networks (2-TBNs) to enhance performance.* Results: The paper shows that the proposed approach significantly improves the scalability and accuracy of 2-TBN structure learning. Experimental results demonstrate substantial improvements over existing algorithms in both computational efficiency and structure learning accuracy, with an average runtime reduction of 93.65% and an average improvement of 74.45% and 110.94% in two accuracy metrics, respectively.Abstract
Dynamic Bayesian Networks (DBNs), renowned for their interpretability, have become increasingly vital in representing complex stochastic processes in various domains such as gene expression analysis, healthcare, and traffic prediction. Structure learning of DBNs from data is challenging, particularly for datasets with thousands of variables. Most current algorithms for DBN structure learning are adaptations from those used in static Bayesian Networks (BNs), and are typically focused on small-scale problems. In order to solve large-scale problems while taking full advantage of existing algorithms, this paper introduces a novel divide-and-conquer strategy, originally developed for static BNs, and adapts it for large-scale DBN structure learning. In this work, we specifically concentrate on 2 Time-sliced Bayesian Networks (2-TBNs), a special class of DBNs. Furthermore, we leverage the prior knowledge of 2-TBNs to enhance the performance of the strategy we introduce. Our approach significantly improves the scalability and accuracy of 2-TBN structure learning. Experimental results demonstrate the effectiveness of our method, showing substantial improvements over existing algorithms in both computational efficiency and structure learning accuracy. On problem instances with more than 1,000 variables, our approach improves two accuracy metrics by 74.45% and 110.94% on average , respectively, while reducing runtime by 93.65% on average.
摘要
动态感知网络(DBN),因其可读性而著称,在各个领域,如基因表达分析、医疗和交通预测等,已成为表示复杂的随机过程的重要工具。然而,从数据中学习 DBN 结构是具有挑战性,特别是面临千个变量的数据集。现有大多数 DBN 结构学习算法都是从静止感知网络(BN)中改进而来,主要是针对小规模问题进行优化。为解决大规模问题而且充分利用现有算法,本文提出了一种新的分治策略,原来用于静止 BN。在这种情况下,我们专门关注 2 个时间层 Bayesian Networks(2-TBN),一种特殊的 DBN 类型。此外,我们利用 2-TBN 的先验知识来提高我们的策略性能。我们的方法可以在计算效率和结构学习准确性两个方面带来显著改进。实验结果表明,我们的方法在计算效率和结构学习准确性两个方面均有显著改进,相比 existed 算法,在更大的问题集中可以提高两个精度指标的平均提升率为 74.45% 和 110.94%,同时降低了平均运行时间的提升率为 93.65%。
Learning Multi-graph Structure for Temporal Knowledge Graph Reasoning
results: 实验结果表明,LMS 模型在五个事件基数据集上表现出色,超越了现有推理模型,说明模型多 graph 视角的 TKG 推理具有优势。Abstract
Temporal Knowledge Graph (TKG) reasoning that forecasts future events based on historical snapshots distributed over timestamps is denoted as extrapolation and has gained significant attention. Owing to its extreme versatility and variation in spatial and temporal correlations, TKG reasoning presents a challenging task, demanding efficient capture of concurrent structures and evolutional interactions among facts. While existing methods have made strides in this direction, they still fall short of harnessing the diverse forms of intrinsic expressive semantics of TKGs, which encompass entity correlations across multiple timestamps and periodicity of temporal information. This limitation constrains their ability to thoroughly reflect historical dependencies and future trends. In response to these drawbacks, this paper proposes an innovative reasoning approach that focuses on Learning Multi-graph Structure (LMS). Concretely, it comprises three distinct modules concentrating on multiple aspects of graph structure knowledge within TKGs, including concurrent and evolutional patterns along timestamps, query-specific correlations across timestamps, and semantic dependencies of timestamps, which capture TKG features from various perspectives. Besides, LMS incorporates an adaptive gate for merging entity representations both along and across timestamps effectively. Moreover, it integrates timestamp semantics into graph attention calculations and time-aware decoders, in order to impose temporal constraints on events and narrow down prediction scopes with historical statistics. Extensive experimental results on five event-based benchmark datasets demonstrate that LMS outperforms state-of-the-art extrapolation models, indicating the superiority of modeling a multi-graph perspective for TKG reasoning.
摘要
temporal Knowledge Graph (TKG) 预测基于历史快照分布在时间戳上的未来事件,称为拟合,在预测领域中备受关注。由于 TKG 的极高灵活性和时空相关性,TKG 预测Task 是一项挑战性的任务,需要效果地捕捉兼施 concurrent 结构和演化交互之间的关系。 existing 方法尽管在这个方向下进步了很多,但还没有充分利用 TKG 的内在表达semantics,包括时间戳之间的实体关系和时间信息的周期性。这些限制使得它们无法彻底反映历史依赖关系和未来趋势。为了解决这些缺点,本文提出了一种创新的逻辑方法,即 Learning Multi-graph Structure (LMS)。具体来说,LMS 包括三个独立模块,每个模块专注于 TKG 中不同方面的图结构知识,包括同步和演化模式、时间戳之间的查询相关性和时间戳之间的semantic Dependencies。此外,LMS 还包括一个适应门户,用于在时间戳上归一化实体表示,以及一个时间感知计算和时间感知解码器,以在历史统计信息的基础上强制时间约束事件,缩小预测范围。广泛的实验结果表明,LMS 在五个事件基本数据集上表现出色,超越了现有的拟合模型, indicating that modeling a multi-graph perspective for TKG reasoning is superior.
Rethinking Adversarial Training with Neural Tangent Kernel
results: 本文的研究显示,通过利用NTK的观察来改进现有的AT方法,可以提高AT的效果和稳定性。这些发现可能为深度学习安全领域的发展提供新的思路和方法。Abstract
Adversarial training (AT) is an important and attractive topic in deep learning security, exhibiting mysteries and odd properties. Recent studies of neural network training dynamics based on Neural Tangent Kernel (NTK) make it possible to reacquaint AT and deeply analyze its properties. In this paper, we perform an in-depth investigation of AT process and properties with NTK, such as NTK evolution. We uncover three new findings that are missed in previous works. First, we disclose the impact of data normalization on AT and the importance of unbiased estimators in batch normalization layers. Second, we experimentally explore the kernel dynamics and propose more time-saving AT methods. Third, we study the spectrum feature inside the kernel to address the catastrophic overfitting problem. To the best of our knowledge, it is the first work leveraging the observations of kernel dynamics to improve existing AT methods.
摘要
<>传统的深度学习安全研究中有一个重要和吸引人的话题是对抗训练(Adversarial Training,AT),它在深度学习中展现了一些神秘和奇怪的特性。在最近的研究中,基于神经唐氏核函数(Neural Tangent Kernel,NTK)的研究使得我们可以更深入地了解AT的过程和特性。在这篇论文中,我们进行了深入的AT过程和特性的研究,包括NTK演化。我们发现了以下三个新发现,它们在前一次的研究中未得到注意:1. 数据归一化对AT的影响和批处理层中的偏置估计的重要性。2. 我们通过实验探索了核动态的特性,并提出了更加高效的AT方法。3. 我们研究了核内部的谱特征,以解决遍历过程中的极端适应性问题。根据我们所知,这是首次通过观察核动态来改进现有的AT方法的研究。>>
Data Management For Large Language Models: A Survey
for: This paper aims to provide a comprehensive overview of current research in data management for Large Language Models (LLMs), covering various aspects of data management strategy design, including data quantity, data quality, domain/task composition, etc.
methods: The paper reviews and discusses the existing research on data management for LLMs, including the rationale behind management strategy selection, the consequential effects of data management, and methodologies for evaluating curated datasets.
results: The paper provides a comprehensive overview of current research in data management for LLMs, highlighting the challenges and limitations of existing approaches and outlining promising directions for future development. The paper also provides a collection of the latest papers on data management for LLMs, which can serve as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices.Abstract
Data plays a fundamental role in the training of Large Language Models (LLMs). Effective data management, particularly in the formulation of a well-suited training dataset, holds significance for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning phases. Despite the considerable importance of data management, the current research community still falls short in providing a systematic analysis of the rationale behind management strategy selection, its consequential effects, methodologies for evaluating curated datasets, and the ongoing pursuit of improved strategies. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various noteworthy aspects of data management strategy design: data quantity, data quality, domain/task composition, etc. Looking toward the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.
摘要
大数据在大语言模型(LLM)的训练中扮演着基本性的角色。有效的数据管理,特别是在预训练和精度调整阶段的数据集的设计方面,对于提高模型性能和训练效率具有重要意义。尽管现有研究社区对数据管理策略的选择和其后果的系统分析仍然缺乏,但是随着数据管理的探索,研究者们在这个领域的兴趣也在不断增加。本文提供了 LLM 预训练和精度调整阶段中数据管理的全面概述,涵盖了不同的值得注意的数据管理策略:数据量、数据质量、领域/任务组合等。借鉴未来的挑战,我们还提出了可能的发展方向,因此,这篇文章可作为实践者们构建强大 LLM 的效果性数据管理实践的指南。相关最新的论文集可以在 GitHub 上找到:https://github.com/ZigeW/data_management_LLM。
Rethinking Urban Mobility Prediction: A Super-Multivariate Time Series Forecasting Approach
results: SUMformer模型在三个真实世界数据集上与现有的州前方法进行比较,实现了城市流动性模式建模和长期预测的出色表现,并且在computational costs和运算效率方面具有明显的改善。Abstract
Long-term urban mobility predictions play a crucial role in the effective management of urban facilities and services. Conventionally, urban mobility data has been structured as spatiotemporal videos, treating longitude and latitude grids as fundamental pixels. Consequently, video prediction methods, relying on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have been instrumental in this domain. In our research, we introduce a fresh perspective on urban mobility prediction. Instead of oversimplifying urban mobility data as traditional video data, we regard it as a complex multivariate time series. This perspective involves treating the time-varying values of each grid in each channel as individual time series, necessitating a thorough examination of temporal dynamics, cross-variable correlations, and frequency-domain insights for precise and reliable predictions. To address this challenge, we present the Super-Multivariate Urban Mobility Transformer (SUMformer), which utilizes a specially designed attention mechanism to calculate temporal and cross-variable correlations and reduce computational costs stemming from a large number of time series. SUMformer also employs low-frequency filters to extract essential information for long-term predictions. Furthermore, SUMformer is structured with a temporal patch merge mechanism, forming a hierarchical framework that enables the capture of multi-scale correlations. Consequently, it excels in urban mobility pattern modeling and long-term prediction, outperforming current state-of-the-art methods across three real-world datasets.
摘要
长期城市流动预测对城市设施和服务的有效管理起着关键作用。传统上,城市流动数据被视为空间时间视频,将 longitude 和 latitude 网格视为基本像素。因此,视频预测方法,基于卷积神经网络(CNN)和视觉变换器(ViT),在这个领域中得到了广泛的应用。在我们的研究中,我们提出了城市流动预测的新视角。而不是将城市流动数据简化为传统视频数据,我们将其视为复杂的多变量时间序列。这个视角需要对时间变化、交叉变量关系和频率域理解进行仔细的分析,以确保准确和可靠的预测。为解决这个挑战,我们提出了超多变量城市流动变换器(SUMformer)。SUMformer 使用特制的注意力机制来计算时间和交叉变量关系,并将计算量减少到最小化。此外,SUMformer 还使用低频滤波器提取关键信息,以便进行长期预测。此外,SUMformer 采用时间补充机制,形成层次结构,以便捕捉多 scales 的相关性。因此,它在城市流动模式建模和长期预测方面表现出色,在三个真实世界数据集上超越当前状态的先进方法。
Hulk: A Universal Knowledge Translator for Human-Centric Tasks
paper_authors: Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang for:This paper proposes a multimodal human-centric generalist model called Hulk, which can address most mainstream human-centric tasks simultaneously without task-specific fine-tuning.methods:The proposed method uses two general heads, one for discrete representations and the other for continuous representations, to integrate knowledge across a wide range of tasks and treat human-centric tasks as modality translation.results:Experimental results on 11 benchmarks across 8 human-centric tasks demonstrate the superiority of the proposed method, surpassing previous methods substantially. The code will be available on GitHub.Here’s the simplified Chinese text:for:这篇论文提出了一种多模态人Centric总体模型,名为Hulk,可以同时解决大多数人Centric任务,无需特定任务 fine-tuning。methods:该方法使用两个通用头,一个为数据表示,另一个为位置坐标,将不同任务的知识集成到一起,将人Centric任务看作多模态翻译。results:对11个benchmark数据集的8种人Centric任务进行了广泛的实验,证明提出的方法的优越性,比前方法有substantial提升。代码将在GitHub上公开。Abstract
Human-centric perception tasks, e.g., human mesh recovery, pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, most of them only excel in 2D vision tasks or require extensive fine-tuning for practical deployment in real-world scenarios. These limitations severely restrict their usability across various downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing most of the mainstream tasks simultaneously without task-specific finetuning, covering 2D vision, 3D vision, skeleton-based, and vision-language tasks. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. To validate the effectiveness of our proposed method, we conduct comprehensive experiments on 11 benchmarks across 8 human-centric tasks. Experimental results surpass previous methods substantially, demonstrating the superiority of our proposed method. The code will be available on https://github.com/OpenGVLab/HumanBench.
摘要
人体中心视觉任务,如人体网格恢复、人脸检测、skeleton基于动作识别和姿势估计,在metaverse和体育分析等领域有广泛的应用。近年来,开发人体中心基本模型的需求增加,以便涵盖多种人体中心视觉任务。虽然许多人体中心基本模型取得了成功,但大多数其只能在2D视觉任务中达到优秀表现,或者需要大量的细化调整才能在实际场景中应用。这些限制了它们在不同下游任务和情况下的可用性。为了解决这些问题,我们提出了Hulk,首个多模态人体中心通用模型,能同时解决大多数主流任务,不需要任务特定的细化调整,涵盖2D视觉、3D视觉、skeleton基于和视语任务。途径是将各种任务特定的头部压缩到两个通用头部,一个用于整数表示,另一个用于连续表示。两个头部的输出可以进一步堆叠成四个不同的输入和输出模式。这种均衡的表示方式使得Hulk能将人体中心任务看作模态翻译,结合各种任务的知识。为验证我们提出的方法的效果,我们在11个标准准的测试 benchmark 上进行了广泛的实验。实验结果明显超越了先前的方法,表明了我们的方法的优越性。代码将在https://github.com/OpenGVLab/HumanBench 上提供。
Risk-Controlling Model Selection via Guided Bayesian Optimization
results: 这个研究在许多任务上展示了其效iveness,包括低错率预测、公平预测、处理假 positives、调节率和压缩在生成模型中、以及降低计算成本。Abstract
Adjustable hyperparameters of machine learning models typically impact various key trade-offs such as accuracy, fairness, robustness, or inference cost. Our goal in this paper is to find a configuration that adheres to user-specified limits on certain risks while being useful with respect to other conflicting metrics. We solve this by combining Bayesian Optimization (BO) with rigorous risk-controlling procedures, where our core idea is to steer BO towards an efficient testing strategy. Our BO method identifies a set of Pareto optimal configurations residing in a designated region of interest. The resulting candidates are statistically verified and the best-performing configuration is selected with guaranteed risk levels. We demonstrate the effectiveness of our approach on a range of tasks with multiple desiderata, including low error rates, equitable predictions, handling spurious correlations, managing rate and distortion in generative models, and reducing computational costs.
摘要
Note:* 准确率 (precision) is translated as 准确率* 公平性 (fairness) is translated as 公平性* Robustness is translated as Robustness* inference cost is translated as 推理成本* Pareto optimal configurations is translated as Pareto优化的配置* statistically verified is translated as 统计上验证* guaranteed risk levels is translated as garantizado risk levels* desiderata is translated as 需求
ResEnsemble-DDPM: Residual Denoising Diffusion Probabilistic Models for Ensemble Learning
results: 实验结果表明,我们的 ResEnsemble-DDPM 可以进一步提高现有模型的性能,并且其 ensemble learning 策略可以普遍应用于其他图像生成下游任务中,获得竞争力强的成绩。Abstract
Nowadays, denoising diffusion probabilistic models have been adapted for many image segmentation tasks. However, existing end-to-end models have already demonstrated remarkable capabilities. Rather than using denoising diffusion probabilistic models alone, integrating the abilities of both denoising diffusion probabilistic models and existing end-to-end models can better improve the performance of image segmentation. Based on this, we implicitly introduce residual term into the diffusion process and propose ResEnsemble-DDPM, which seamlessly integrates the diffusion model and the end-to-end model through ensemble learning. The output distributions of these two models are strictly symmetric with respect to the ground truth distribution, allowing us to integrate the two models by reducing the residual term. Experimental results demonstrate that our ResEnsemble-DDPM can further improve the capabilities of existing models. Furthermore, its ensemble learning strategy can be generalized to other downstream tasks in image generation and get strong competitiveness.
摘要
Here's the Simplified Chinese translation:现在,许多图像分割任务使用了减噪扩散概率模型。然而,现有的端到端模型已经显示出了非常出色的能力。而不是使用减噪扩散概率模型 solo, THEN 将这两种模型结合在一起可以进一步提高图像分割性能。基于这个想法,我们隐式地引入了剩余项到扩散过程中,并提出了 ResEnsemble-DDPM,这种模型通过ensemble learning来协调减噪扩散模型和端到端模型。这两种模型的输出分布与实际值分布互相对应,因此我们可以通过减少剩余项来集成这两种模型。实验结果表明,我们的 ResEnsemble-DDPM 可以进一步提高现有模型的能力。此外,它的 ensemble learning 策略可以应用于其他图像生成下游任务,并且具有强竞争力。
Jellyfish: A Large Language Model for Data Preprocessing
results: 该模型在使用本地、单个、低价的 GPU 上运行,并且可以具有高度的自然语言理解能力,使得用户可以手动编写 DP 任务的指令。与现有方法不同的是,该模型在调试过程中获得域知识,并且在推理过程中可以选择注入任务和数据集特定的知识。该模型还具有解释器,可以解释其输出决策。Abstract
In this paper, we present Jellyfish, an open-source LLM as a universal task solver for DP. Built on the Llama 2 13B model, Jellyfish is instruction-tuned with the datasets of several typical DP tasks including error detection, data imputation, schema matching, and entity matching, and delivers generalizability to other tasks. Remarkably, Jellyfish can operate on a local, single, and low-priced GPU with its 13 billion parameters, ensuring data security and enabling further tuning. Its proficiency in understanding natural language allows users to manually craft instructions for DP tasks. Unlike many existing methods that heavily rely on prior knowledge, Jellyfish acquires domain knowledge during its tuning process and integrates optional knowledge injection during inference. A distinctive feature of Jellyfish is its interpreter, which elucidates its output decisions. To construct Jellyfish, we develop a series of pre-tuning and DP-tuning techniques. Jellyfish is equipped with an instance serializer, which automatically translates raw data into model prompts, and a knowledge injector, which optionally introduces task- and dataset-specific knowledge to enhance DP performance. Our evaluation of Jellyfish, using a range of real datasets, shows its competitiveness compared to state-of-the-art methods and its strong generalizability to unseen tasks. Jellyfish's performance rivals that of GPT series models, and its interpreter offers enhanced reasoning capabilities compared to GPT-3.5. Furthermore, our evaluation highlights the effectiveness of the techniques employed in constructing Jellyfish. Our model is available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish .
摘要
在这篇论文中,我们介绍了一种开源的LLM模型,即Jellyfish,作为数据处理(DP)任务的通用解决方案。Jellyfish基于Llama 2 13B模型,通过对多种典型DP任务的数据集进行指令优化,包括错误检测、数据填充、schema匹配和实体匹配,并可以扩展到其他任务。值得一提的是,Jellyfish可以在本地、单个和低价的GPU上运行,确保数据安全性和进一步的调参。它的自然语言理解能力使得用户可以通过手动制定DP任务的指令来使其进行工作。与许多现有方法不同的是,Jellyfish在调参过程中获得域知识,并在推理过程中可选择注入域知识以提高DP性能。Jellyfish的一个特点是其解释器,它解释了其输出决策的原理。为建立Jellyfish,我们开发了一系列的预训练和DP训练技术。Jellyfish具有自动将原始数据转换为模型提示的实例序列化器,以及可选地注入任务和数据集特定的知识以提高DP性能的知识注入器。我们对Jellyfish进行了一系列的实验,使用了一些真实的数据集,并证明了它的竞争力和对未看到任务的强大普适性。Jellyfish的性能与GPT系列模型相当,而其解释器提供了与GPT-3.5相比的加强逻辑能力。此外,我们的实验还证明了建立Jellyfish所使用的技术的有效性。Jellyfish模型可以在Hugging Face上获取:https://huggingface.co/NECOUDBFM/Jellyfish。
STADEE: STAtistics-based DEEp Detection of Machine Generated Text
results: 测试在多个 dataset 和enario (域内、域外、野外)下,STADEE 表现出优于传统统计方法和 fine-tune PLMs,尤其在域外和野外设置下,具有效果和普适性。 F1 分数为 87.05% 在域内,高于传统统计方法和 fine-tune PLMs。Abstract
We present STADEE, a \textbf{STA}tistics-based \textbf{DEE}p detection method to identify machine-generated text, addressing the limitations of current methods that rely heavily on fine-tuning pre-trained language models (PLMs). STADEE integrates key statistical text features with a deep classifier, focusing on aspects like token probability and cumulative probability, crucial for handling nucleus sampling. Tested across diverse datasets and scenarios (in-domain, out-of-domain, and in-the-wild), STADEE demonstrates superior performance, achieving an 87.05% F1 score in-domain and outperforming both traditional statistical methods and fine-tuned PLMs, especially in out-of-domain and in-the-wild settings, highlighting its effectiveness and generalizability.
摘要
我们介绍STADEE,一种基于统计学的深度检测方法,用于识别机器生成文本,现有方法强调精度调整预训练语言模型(PLM)的限制。STADEE将关键的统计文本特征与深度分类器结合,强调Token概率和总概率等方面,对核心采样进行了优化。在多个 dataset 和场景(域内、域外、野外)进行了测试,STADEE示出了优于传统统计方法和精度调整 PLM 的性能,达到了87.05%的 F1 分数在域内,并在域外和野外情况下表现更好,这 highlights 其效iveness 和普适性。
Analyze Drivers’ Intervention Behavior During Autonomous Driving – A VR-incorporated Approach
results: 研究发现,驾驶员在干预过程中的行为具有某些特点,这些特点可以被用来改进自动驾驶系统的表现。此外,这种整合和充沛的工具对人机相互信任研究也具有价值。Abstract
Given the rapid advance in ITS technologies, future mobility is pointing to vehicular autonomy. However, there is still a long way before full automation, and human intervention is required. This work sheds light on understanding human drivers' intervention behavior involved in the operation of autonomous vehicles (AVs) and utilizes this knowledge to improve the perception of critical driving scenarios. Experiment environments were implemented where the virtual reality (VR) and traffic micro-simulation are integrated, and tests were carried out under typical and diverse traffic scenes. Performance indicators such as the probability of intervention, accident rates are defined and used to quantify and compare the risk levels. By offering novel insights into drivers' intervention behavior, this work will help improve the performances of the automated control under similar scenarios. Furthermore, such an integrated and immersive tool for autonomous driving studies will be valuable for research on human-to-automation trust. To the best knowledge of the authors, this work is among the pioneer works making efforts into such types of tools.
摘要
随着智能交通技术的快速发展,未来的交通将朝向自动驾驶。然而,全自动化仍然很遥见,人类 intervención仍然是必要的。这项工作探讨了自动驾驶汽车(AV)运行中人类驾驶员的干预行为,并利用这些知识来改善相应的驾驶情况。实验环境中,虚拟现实(VR)和交通微simulation被集成,并在典型和多样化的交通场景下进行了测试。性能指标such as干预概率和事故率被定义,用于量化和比较不同场景下的风险水平。这项工作将为相似场景下的自动控制性能提供新的意见,并且将有价值的支持人类对自动化的信任。根据作者们所知,这项工作是目前已知的先驱工作之一。
Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training
paper_authors: Runze He, Shaofei Huang, Xuecheng Nie, Tianrui Hui, Luoqi Liu, Jiao Dai, Jizhong Han, Guanbin Li, Si Liu
for: 这篇论文targets the adaptive source-driven 3D scene editing task, proposing a CustomNeRF model that unifies a text description or reference image as the editing prompt.
results: 实验结果显示,这篇论文的CustomNeRF可以在实际场景下生成精确的编辑结果,并在文本驱动和图像驱动的设置下都能够获得良好的效果。Abstract
In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency given a single-view reference image. To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing, aimed at foreground-only manipulation while preserving the background. For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem among different views in image-driven editing. Extensive experiments show that our CustomNeRF produces precise editing results under various real scenes for both text- and image-driven settings.
摘要
在这篇论文中,我们target了适应性的源驱动3D场景编辑任务,通过提议一个CustomNeRF模型,将文本描述或参考图像作为编辑提示。然而,实现符合编辑提示的编辑结果是非常困难,因为存在两个主要挑战,包括只编辑前景区域和多视图一致性给定单视图参考图像。为解决第一个挑战,我们提议了本地-全局迭代编辑(LGIE)训练方案,将前景区域编辑与全图像编辑进行交互,以实现前景只 manipulate 而保持背景不变。为解决第二个挑战,我们还设计了类别导向正则化,利用生成模型中的类别假设来缓解不同视图之间的不一致问题。我们的CustomNeRF在实际场景中对多种文本和图像驱动场景进行了广泛的实验,并证明了它能够生成精准的编辑结果。
ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions
For: 该论文使用Neural-ODE模型来解决常数内存成本的问题,并提供了一种基于Nesterov加速器的ODE解算器来改进Neural-ODE模型的稳定性和性能。* Methods: 该论文使用了 continuous depth neural network和数字ODE-integrator来Parameterize differential equation,并提供了一种基于NAG加速器的ODE解算器来改进Neural-ODE模型的稳定性和性能。* Results: 该论文通过三个不同的任务,包括超参数数据预测、概率分布估计和时间序列预测,证明了其方法的效果,包括更快的训练速度和与其他固定步骤显式ODE解算器和离散深度模型相比的性能。Abstract
Neural-ODE parameterize a differential equation using continuous depth neural network and solve it using numerical ODE-integrator. These models offer a constant memory cost compared to models with discrete sequence of hidden layers in which memory cost increases linearly with the number of layers. In addition to memory efficiency, other benefits of neural-ode include adaptability of evaluation approach to input, and flexibility to choose numerical precision or fast training. However, despite having all these benefits, it still has some limitations. We identify the ODE-integrator (also called ODE-solver) as the weakest link in the chain as it may have stability, consistency and convergence (CCS) issues and may suffer from slower convergence or may not converge at all. We propose a first-order Nesterov's accelerated gradient (NAG) based ODE-solver which is proven to be tuned vis-a-vis CCS conditions. We empirically demonstrate the efficacy of our approach by training faster, while achieving better or comparable performance against neural-ode employing other fixed-step explicit ODE-solvers as well discrete depth models such as ResNet in three different tasks including supervised classification, density estimation, and time-series modelling.
摘要
neural-ODE 使用连续深度神经网络 parameterize diferencial 方程,并使用数字 ODE-integrator 解决。这些模型具有 Constant 内存成本,与有着不同层数的神经网络模型相比,内存成本将 linear 增长。除了内存效率之外,Neural-ode 还有其他优点,如输入评估方法的适应性和数值精度或快速训练的可选择性。然而,即使拥有这些优点,Neural-ode 仍有一些局限性。我们认为 ODE-integrator(也称 ODE-solver)是 Neural-ode 链中最弱的链接,它可能会出现稳定性、一致性和整合(CCS)问题,并且可能会导致更慢的收敛或者无法收敛。我们提议一种基于 Nesterov 加速器的第一个 ODE-solver,该方法已被证明可以与 CCS 条件进行调整。我们通过实验证明我们的方法可以更快地训练,而且在三个不同的任务中(包括Supervised 分类、概率分布预测和时间序列预测)都可以达到或superior 于 Neural-ode 使用其他固定步骤显式 ODE-solvers 或 Discrete depth 模型,如 ResNet。
The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model
paper_authors: Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, Wei Zeng
for: 提高用户搜索体验,帮助用户更准确地表达搜索意图。
methods: 使用视觉语言模型解析和组合多模态用户输入,以提高搜索结果的准确性和用户满意度。
results: 在NFT搜索系统中实现了更好的用户搜索体验,并且用户可以通过Contextualized interactions进行详细的搜索意图修改和调整。Abstract
Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.
摘要 translate-weight: 95 translate-options: "preserve-punctuation"图像搜索是一种重要和易于使用的方法,但现有的图像搜索方法都是基于 proximity 测量,如标签匹配或图像相似性,需要用户输入精确的搜索参数以获得满意的结果。为了满足用户对当代图像搜索引擎的需求,我们提出了一个创新的用户意图扩展框架。我们的框架利用视觉语言模型来分析和组合多模态用户输入,以提供更加准确和满意的结果。它包括两个阶段:1. 分析阶段,包括语言分析模块和大型语言模型,以提高文本输入的理解;以及视觉分析模块,通过交互分割模块快速确定图像中的详细视觉元素。2. 逻辑组合阶段,将多个用户搜索意图组合成一个统一逻辑表达,以满足复杂搜索enario中的更加复杂的操作。此外,用户可以通过对搜索结果进行Contextualized交互,以进一步 specify或调整他们的详细搜索意图。我们将这个框架应用于非 fungible токен(NFT)图像搜索系统中,并进行了用户研究以评估其用户性和创新性。结果表明,我们的框架可以明显改善用户的图像搜索经验。特别是,分析和Contextualized交互都有用,允许用户更加准确地表达他们的搜索意图,并在 iterative 搜索过程中享受更加愉悦的交互体验。
results: 相比QMeL方法,本方法可以达到3倍的多类分类分离度,同时使用的门数和深度减少到1/2。此外,本方法还可以超越类似配置的классический网络。Abstract
Deep metric learning has recently shown extremely promising results in the classical data domain, creating well-separated feature spaces. This idea was also adapted to quantum computers via Quantum Metric Learning(QMeL). QMeL consists of a 2 step process with a classical model to compress the data to fit into the limited number of qubits, then train a Parameterized Quantum Circuit(PQC) to create better separation in Hilbert Space. However, on Noisy Intermediate Scale Quantum (NISQ) devices. QMeL solutions result in high circuit width and depth, both of which limit scalability. We propose Quantum Polar Metric Learning (QPMeL) that uses a classical model to learn the parameters of the polar form of a qubit. We then utilize a shallow PQC with $R_y$ and $R_z$ gates to create the state and a trainable layer of $ZZ(\theta)$-gates to learn entanglement. The circuit also computes fidelity via a SWAP Test for our proposed Fidelity Triplet Loss function, used to train both classical and quantum components. When compared to QMeL approaches, QPMeL achieves 3X better multi-class separation, while using only 1/2 the number of gates and depth. We also demonstrate that QPMeL outperforms classical networks with similar configurations, presenting a promising avenue for future research on fully classical models with quantum loss functions.
摘要
深度度量学学习最近在классиical数据领域取得了非常有 promise的结果,创造了更好的特征空间。这个想法也被应用到量子计算机上via量子度量学学习(QMeL)。QMeL包括一个2步 proces,首先使用一个类型量化的模型将数据压缩到适合几个量子位的限制下,然后训练一个 Parametrized Quantum Circuit(PQC)以创造更好的分离在归结空间中。然而,在不纯量子设备(NISQ)上,QMeL解决方案会导致高宽度和深度,这些限制扩展性。我们提议使用量子偏振度度量学学习(QPMeL),它使用一个类型量化的模型学习偏振度的参数。然后,我们使用一个浅层PQC中的$R_y$和$R_z$门来创造状态,并使用一个可学习的层来学习强相关性。这个circuit还计算了对称性via SWAP Test,用于我们提议的对称 triplet损失函数,用于训练классиical和量子组件。与QMeL方法相比,QPMeL实现3倍更好的多类分离,同时使用的门数和深度分别为1/2。我们还证明QPMeL超过了类似配置的类型量化网络,提出了一个有前途的研究方向:完全类型量化网络与量子损失函数。
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
results: 对50种不同的手机任务进行了50个不同的手机应用中的任务自动化测试,结果表明MemoDroid可以在不同的上下文中适应学习任务,并将任务的延迟和成本降低到69.22%和77.36%。Abstract
The advent of large language models (LLMs) has opened up new opportunities in the field of mobile task automation. Their superior language understanding and reasoning capabilities allow users to automate complex and repetitive tasks. However, due to the inherent unreliability and high operational cost of LLMs, their practical applicability is quite limited. To address these issues, this paper introduces MemoDroid, an innovative LLM-based mobile task automator enhanced with a unique app memory. MemoDroid emulates the cognitive process of humans interacting with a mobile app -- explore, select, derive, and recall. This approach allows for a more precise and efficient learning of a task's procedure by breaking it down into smaller, modular components that can be re-used, re-arranged, and adapted for various objectives. We implement MemoDroid using online LLMs services (GPT-3.5 and GPT-4) and evaluate its performance on 50 unique mobile tasks across 5 widely used mobile apps. The results indicate that MemoDroid can adapt learned tasks to varying contexts with 100% accuracy and reduces their latency and cost by 69.22% and 77.36% compared to a GPT-4 powered baseline.
摘要
LLMs 的出现开创了移动任务自动化领域的新机会。它们的出色的语言理解和逻辑能力 permit users 自动化复杂和重复的任务。然而,由于 LLMS 的内在不可靠性和高运行成本,它们的实际应用受限。为解决这些问题,本文介绍了 MemoDroid,一种基于 LLM 的移动任务自动化工具,具有唯一的应用内存。MemoDroid 模拟了人类与移动应用互动的认知过程,探索、选择、 derivation 和回忆。这种方法允许更精确和高效地学习任务的过程,将其拆分成更小的、可重复、可重新排序和可适应不同目标的组件。我们使用在线 LLMS 服务(GPT-3.5 和 GPT-4)实现 MemoDroid,并对 50 个不同的移动任务进行了5种常用的移动应用评估。结果表明,MemoDroid 可以在不同的上下文中精准地适应学习任务,降低了延迟和成本,相比 GPT-4 基eline,减少了69.22% 和 77.36%。
Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation
results: 我们的结果表明,通过我们的几何解释,我们可以控制 Llama$2$ 的嵌入维度,并提取具有抽象表示力的 $7$ 个拓扑特征,这些特征可以用于解决多种下游任务,如排除恶意语言、推断提示的域名和解决 Jigsaw 挑战等。Abstract
Large Language Models~(LLMs) drive current AI breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve various downstream tasks. To provide a practical and principled answer, we propose to characterize LLMs from a geometric perspective. We obtain in closed form (i) the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the per-layer feedforward networks. Our results are informative, do not rely on approximations, and are actionable. First, we show that, motivated by our geometric interpretation, we can bypass Llama$2$'s RLHF by controlling its embedding's intrinsic dimension through informed prompt manipulation. Second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) LLM layer, providing a rich abstract representation of their inputs. Those features alone ($224$ for Mistral-7B and Llama$2$-7B) are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the Jigsaw challenge, which aims at characterizing the type of toxicity of various prompts. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in language models. Code: \url{https://github.com/RandallBalestriero/SplineLLM}.
摘要
大型语言模型(LLMs)驱动当前人工智能的突破,尽管它们的内部表示仍未得到充分了解,例如如何提取一些有用的特征来解决不同下游任务。为提供实用和原则的答案,我们提出使用几何方式来描述 LLMs。我们在closed form中获得了(i)多头注意力嵌入的内部维度,以及(ii)每层 feedforward 网络的分区和每个区域的优化映射。我们的结果是有用的,不需要使用简化,并且可以行动。首先,我们根据我们的几何解释,可以让 Llama$2$ 的 RLHF 通过控制其嵌入的内部维度来绕过。其次,我们 derive了 7 个可解释的spline特征,可以从任何(预训练) LLM 层中提取,提供了丰富的抽象表示。这些特征(224 个 para Mistral-7B 和 Llama$2$-7B)alone 是足够的,可以帮助解决毒性检测、推断提示的域名和 even 解决 Jigsaw 挑战,这个挑战的目的是Characterizing 不同提示的毒性。我们的结果表明,即使在大规模情况下,精确的理论结果可以回答实际问题。代码:\url{https://github.com/RandallBalestriero/SplineLLM}.
Quality Diversity in the Amorphous Fortress (QD-AF): Evolving for Complexity in 0-Player Games
results: 研究人员通过这种方法生成了多种环境,其中包括了竞争和合作多种多样性生存动力学。这些生成的世界可以作为学习算法的训练和测试用途。Abstract
We explore the generation of diverse environments using the Amorphous Fortress (AF) simulation framework. AF defines a set of Finite State Machine (FSM) nodes and edges that can be recombined to control the behavior of agents in the `fortress' grid-world. The behaviors and conditions of the agents within the framework are designed to capture the common building blocks of multi-agent artificial life and reinforcement learning environments. Using quality diversity evolutionary search, we generate diverse sets of environments. These environments exhibit certain types of complexity according to measures of agents' FSM architectures and activations, and collective behaviors. Our approach, Quality Diversity in Amorphous Fortress (QD-AF) generates families of 0-player games akin to simplistic ecological models, and we identify the emergence of both competitive and co-operative multi-agent and multi-species survival dynamics. We argue that these generated worlds can collectively serve as training and testing grounds for learning algorithms.
摘要
我们探索使用Amorphous Fortress(AF)模拟框架生成多样化环境。AF定义了一组finite state machine(FSM)节点和边,这些节点和边可以重新组合以控制fortress网格世界中agent的行为。在这个框架中,agent的行为和条件是为了捕捉人工智能多种生命和奖励学习环境中的共同建筑块。使用质量多样性进化搜索,我们生成了多样化的环境集。这些环境的复杂性根据代理人FSM架构和活动的度量,以及集体行为的类型。我们称这种方法为质量多样性在Amorphous Fortress(QD-AF),它生成了0 player游戏,类似于简单的生态学模型,并识别出了多种多样化的多代理人和多种族存活动动力学。我们认为这些生成的世界可以集体地作为学习算法的训练和测试场景。
GVFs in the Real World: Making Predictions Online for Water Treatment
results: 论文发现,使用GVF预测算法可以获得较低的 normalized mean-squared error,而且在线学习可以使预测模型适应不断变化的系统。Abstract
In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about the predictability of the data, suitable neural network architectures, how to overcome partial observability and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability, and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions -- discounted cumulative sums of observations -- and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD-prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by comparing a TD agent trained purely offline with no online updating to a TD agent that learns online. This final result is one of the first to motivate the importance of adapting predictions in real-time, for non-stationary high-volume systems in the real world.
摘要
在这篇论文中,我们 investigate了基于强制学习的预测方法在真实的饮用水净化厂中的应用。开发这种预测系统是水净化过程的优化和自动化的关键步骤。在这之前,我们需要回答许多问题,例如数据预测性、适合的神经网络架构、如何缺省性和多感知器问题等。我们首先描述了这个数据集,并强调了季节性、不同性、部分可见性和感知器的hetersonality问题。然后,我们描述了一种基于General Value Function(GVF)的预测方法,即累积观察值的折算和。我们解释了如何使用停机数据来适应我们的时间差分学习(TD)代理,包括如何选择在线细化的hyperparameter。我们发现TD预测代理在normalized mean-squared error方面比 класси型n-step预测代理更低。最后,我们展示了在部署中学习的重要性,比较了在线学习和纯在线学习的TD代理。这一结果是非站ary高量系统的实际世界中预测的第一个激励。
paper_authors: Duc Q. Nguyen, Thanh Toan Nguyen, Tho quan
for: This paper is written for researchers and practitioners working with graph-based data and subgraph matching, particularly in the fields of database systems, biochemistry, and cognitive science.
methods: The paper proposes a new method called xNeuSM, which uses Graph Learnable Multi-hop Attention Networks (GLeMA) to adaptively learn the attention factor decay for each node across hops, rather than relying on fixed hyperparameters.
results: The paper reports substantial improvements in prediction accuracy of up to 34% compared to approximate baselines, and at least a seven-fold faster query time than exact algorithms, in empirical evaluations on real-world datasets.Abstract
Subgraph matching is a challenging problem with a wide range of applications in database systems, biochemistry, and cognitive science. It involves determining whether a given query graph is present within a larger target graph. Traditional graph-matching algorithms provide precise results but face challenges in large graph instances due to the NP-complete problem, limiting their practical applicability. In contrast, recent neural network-based approximations offer more scalable solutions, but often lack interpretable node correspondences. To address these limitations, this article presents xNeuSM: Explainable Neural Subgraph Matching which introduces Graph Learnable Multi-hop Attention Networks (GLeMA) that adaptively learns the parameters governing the attention factor decay for each node across hops rather than relying on fixed hyperparameters. We provide a theoretical analysis establishing error bounds for GLeMA's approximation of multi-hop attention as a function of the number of hops. Additionally, we prove that learning distinct attention decay factors for each node leads to a correct approximation of multi-hop attention. Empirical evaluation on real-world datasets shows that xNeuSM achieves substantial improvements in prediction accuracy of up to 34% compared to approximate baselines and, notably, at least a seven-fold faster query time than exact algorithms. The source code of our implementation is available at https://github.com/martinakaduc/xNeuSM.
摘要
<>将文本翻译成简化中文。<>聚合图匹配是一个复杂的问题,具有广泛的应用于数据库系统、生物化学和认知科学等领域。它涉及到判断给定查询图是否存在于大型目标图中。传统的图匹配算法可以提供精确的结果,但在大图实例中遇到NP完全问题,导致实际应用受限。相比之下,现代神经网络基于的估算方法可以提供更可扩展的解决方案,但经常缺乏可解释的节点匹配。为了解决这些限制,本文提出了可解释的神经网络多趟注意力网络(GLeMA),可以适应性地学习每个节点的注意力因子衰减参数,而不是依赖于固定的超参数。我们提供了理论分析,证明GLeMA在多趟注意力中的估算为函数关系数量的误差界限。此外,我们证明学习每个节点的注意力衰减因子可以正确地估算多趟注意力。实际评估使用实际数据显示,xNeuSM可以与基于估算的比较基准相比提高预测精度达34%,并且在精确算法的查询时间上提供至少7倍的提升。xNeuSM的实现代码可以在https://github.com/martinakaduc/xNeuSM上获取。
A Simple and Scalable Representation for Graph Generation
methods: 该方法使用了一种新的图表示法 named gap encoded edge list (GEEL),具有小的表示尺寸,与节点数量成正比。此外,GEEL还采用了停止编码和带宽限制方案,从而减少了 vocabulary 大小。
results: 研究发现,采用这种简洁的表示方法不仅提高了可扩展性,还提高了性能,使得图生成过程变得更加简单。在十个非具有属性图生成任务和两个分子图生成任务中,GEEL显示出了效果。Abstract
Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.
摘要
近期,有一wave of interest in 使用神经网络进行图生成,这是一个基本的统计学学习问题,具有重要的应用,如分子设计和社区分析。然而,大多数方法在生成大规模图时遇到了 significanthindrances。这是因为它们需要输出完整的邻接矩阵,该矩阵的大小与节点数平方成正比。面对这个挑战,我们介绍了一种新的、简单的和可扩展的图表示方法,即阻塞编码边列表(GEEL)。GEEL具有小的表示尺寸,与边的数量成正比。此外,GEEL还减少了词汇量,通过包含阻塞编码和带宽限制方案。GEEL可以通过节点位置编码和扩展来进行autoregressive生成,并且我们进一步扩展了GEEL来处理Attribute graphs,我们设计了一种新的语法。我们的发现表明,采用这种紧凑的表示不仅提高了可扩展性,而且也提高了性能,因为它简化了图生成过程。我们在十个非杂Attribute graphs和两个分子图生成任务上进行了广泛的评估,并示出了GEEL的有效性。
Local-Global History-aware Contrastive Learning for Temporal Knowledge Graph Reasoning
results: 实验结果表明,LogCL 比基eline模型表现更好和更稳定,在四个 benchmark 数据集上达到了更高的预测精度。Abstract
Temporal knowledge graphs (TKGs) have been identified as a promising approach to represent the dynamics of facts along the timeline. The extrapolation of TKG is to predict unknowable facts happening in the future, holding significant practical value across diverse fields. Most extrapolation studies in TKGs focus on modeling global historical fact repeating and cyclic patterns, as well as local historical adjacent fact evolution patterns, showing promising performance in predicting future unknown facts. Yet, existing methods still face two major challenges: (1) They usually neglect the importance of historical information in KG snapshots related to the queries when encoding the local and global historical information; (2) They exhibit weak anti-noise capabilities, which hinders their performance when the inputs are contaminated with noise.To this end, we propose a novel \blue{Lo}cal-\blue{g}lobal history-aware \blue{C}ontrastive \blue{L}earning model (\blue{LogCL}) for TKG reasoning, which adopts contrastive learning to better guide the fusion of local and global historical information and enhance the ability to resist interference. Specifically, for the first challenge, LogCL proposes an entity-aware attention mechanism applied to the local and global historical facts encoder, which captures the key historical information related to queries. For the latter issue, LogCL designs four historical query contrast patterns, effectively improving the robustness of the model. The experimental results on four benchmark datasets demonstrate that LogCL delivers better and more robust performance than the state-of-the-art baselines.
摘要
Temporal知识图(TKG)已被认为是表征时间统计的有前途的方法。TKG的推测是预测未知的未来事实,具有广泛的实用价值。大多数推测研究在TKG中强调模型全球历史事实重复和地方历史邻近事实演化模式,显示出预测未来未知事实的有望性。然而,现有方法仍面临两个主要挑战:(1)它们通常忽略了相关查询的历史资讯在KG快照中的重要性;(2)它们具有轻微抗噪能力,这限制了它们在噪音污染的情况下的表现。为此,我们提出了一个新的绿色 локаль-全球历史认知对应(LogCL)模型,它运用对比学来更好地导引本地和全球历史信息的融合,提高抗噪能力。具体来说,为首个挑战,LogCL提出了一个统计数据关注机制,对本地和全球历史实验器中的历史信息进行焦点关注。另一方面,LogCL设计了四种历史查询对照模式,有效地提高了模型的Robustness。实验结果显示,LogCL在四个benchmark dataset上表现更好和更有弹性的比基eline。
Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson’s Disease (A Comparative Study)
results: 研究发现,使用深度学习数据生成技术可以提高机器学习分类器的准确率,从而提高AI基于语音评估系统的诊断精度。Abstract
Changes in speech and language are among the first signs of Parkinson's disease (PD). Thus, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals' voices. Although several studies have shown reasonable results in developing such AI systems, these systems would need more data samples to achieve promising performance. This paper explores using deep learning-based data generation techniques on the accuracy of machine learning classifiers that are the core of such systems.
摘要
Changes in speech and language are among the first signs of Parkinson's disease (PD). Therefore, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals' voices. Although several studies have shown reasonable results in developing such AI systems, these systems would need more data samples to achieve promising performance. This paper explores using deep learning-based data generation techniques to improve the accuracy of machine learning classifiers, which are the core of such systems.Here's the word-for-word translation:语言和语音变化是parkinson病的早期征 symptoms。因此,临床医生已经尝试通过语音来识别患有parkinson病的人们多年来。透过人工智能技术,医生可以使用语音识别系统来检测parkinson病。这些系统可以通过机器学习分类器来开发,这些分类器已经被训练使用个人语音。虽然一些研究已经得到了可靠的结果,但这些系统需要更多的数据样本来达到可靠的性能。这篇论文探讨使用深度学习数据生成技术来提高机器学习分类器的准确率,这些分类器是这些系统的核心。
OCGEC: One-class Graph Embedding Classification for DNN Backdoor Detection
results: 对多个任务进行评估,OCGEC可以达到AUC分数超过98%,这高于现有的方法,即使它们使用大量的正样本和负样本进行训练。此外,OCGEC还可以提供新的视角,可以用于改进其他后门防御任务。Abstract
Deep neural networks (DNNs) have been found vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. There are various approaches to detect backdoor attacks, however they all make certain assumptions about the target attack to be detected and require equal and huge numbers of clean and backdoor samples for training, which renders these detection methods quite limiting in real-world circumstances. This study proposes a novel one-class classification framework called One-class Graph Embedding Classification (OCGEC) that uses GNNs for model-level backdoor detection with only a little amount of clean data. First, we train thousands of tiny models as raw datasets from a small number of clean datasets. Following that, we design a ingenious model-to-graph method for converting the model's structural details and weight features into graph data. We then pre-train a generative self-supervised graph autoencoder (GAE) to better learn the features of benign models in order to detect backdoor models without knowing the attack strategy. After that, we dynamically combine the GAE and one-class classifier optimization goals to form classification boundaries that distinguish backdoor models from benign models. Our OCGEC combines the powerful representation capabilities of graph neural networks with the utility of one-class classification techniques in the field of anomaly detection. In comparison to other baselines, it achieves AUC scores of more than 98% on a number of tasks, which far exceeds existing methods for detection even when they rely on a huge number of positive and negative samples. Our pioneering application of graphic scenarios for generic backdoor detection can provide new insights that can be used to improve other backdoor defense tasks. Code is available at https://github.com/jhy549/OCGEC.
摘要
深度神经网络(DNN)已被发现容易受到后门攻击,这引发了应用于关键任务的安全问题。现有多种检测后门攻击的方法,但它们都假设特定的目标攻击和巨大量的干净和后门样本进行训练,这限制了这些检测方法在实际情况下的应用。这项研究提出了一种新的一类分类框架 called One-class Graph Embedding Classification(OCGEC),使用图神经网络(GNN)进行模型级别后门检测,只需要一小量的干净数据。首先,我们从少量干净数据中训练了数以千计的小模型。然后,我们设计了一种巧妙的模型到图方法,将模型的结构细节和权重特征转化为图数据。我们然后预训练了一个生成自动encoder(GAE)以更好地学习干净模型的特征,以便无需知道攻击策略,可以检测后门模型。接着,我们动态将GAE和一类分类优化目标结合,以形成分类边界,将后门模型和干净模型分开。我们的 OCGEC 结合了图神经网络的强大表示能力和一类分类技术在异常检测领域的实用性。与其他基准方法相比,它在多个任务上达到了AUC分数超过98%,这大大超过了基于巨大正例和负例样本的方法。我们在应用图形场景进行通用后门检测的先河应用,可以提供新的视角,用于改进其他后门防御任务。代码可以在 上获取。
Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off
results: 根据结果显示,这个方法可以比以同数量的非零参数的binarization方法更加精确,并且可以实现26%的速度提升、双倍的能源效率和2.8倍的浓度减少。Abstract
Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key algorithmic techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose Signed Binarization, a unified co-design framework that synergistically integrates hardware-software systems, quantization functions, and representation learning techniques to address this trade-off. Our results demonstrate that Signed Binarization is more accurate than binarization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters, both of the same type, for a DNN block. Finally, our approach achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods for ResNet 18, presenting an alternative solution for deploying efficient models in resource-limited environments.
摘要
efficiently 推理深度神经网络(DNN)在限制性的边缘设备上是必要的。量化和稀疏是关键的算法技术,它们在硬件软件界面上翻译成重复和稀疏在tensor中的现象。本文介绍了重复稀疏负担的概念,帮助解释执行时的计算效率。我们提出了签名二进制化,一种统一的协设框架,它synergistically地结合硬件软件系统、量化函数和表示学习技术来解决这个负担。我们的结果表明,签名二进制化比使用同样数量的非零权重的二进制化更高准确。etailed分析表明,签名二进制化生成了一个较小的非零参数的分布,占总参数的同类型的分布的同时,对于DNN块。最后,我们的方法在实际硬件上实现了26%的速度提升,双倍的能效率,并将密度减少了2.8倍 compared to binary方法,用于部署高效的模型在限制性的环境中。
How to Configure Good In-Context Sequence for Visual Question Answering
results: 通过对三个VQA数据集(VQAv2、VizWiz、OK-VQA)进行大量实验,研究人员发现了三种关键的LVLM内部特性,并证明了哪些策略可以一直提高ICL VQA性能。Abstract
Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP, researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However, when implementing ICL using these LVLMs, researchers usually resort to the simplest way like random sampling to configure the in-context sequence, thus leading to sub-optimal results. To enhance the ICL performance, in this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally, through observing the changes of the LVLM outputs by altering the in-context sequence, we gain insights into the inner properties of LVLMs, improving our understanding of them. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2, VizWiz, and OK-VQA, we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https://github.com/GaryJiajia/OFv2_ICL_VQA.
摘要
受大语言模型在新任务上通过内容学习(ICL)的成功启发,研究人员还开发了大视语言模型(LVLM)的ICL功能。然而,在实现ICL使用这些LVLM时,研究人员通常采用随机抽样来配置内容序列,导致效果不佳。为了提高ICL性能,在本研究中,我们使用视觉问答(VQA)作为案例研究,探索多种内容配置,找到有力的一些。另外,通过对LVLM输出变化的内容序列进行修改,我们获得了LVLM的内部特性的深入理解,提高了对其的理解。特别是,为了探索内容配置,我们设计了多种检索方法,采用不同的策略来修改已有的示例。通过对三个VQA数据集(VQAv2、VizWiz和OK-VQA)进行了系统的实验,我们发现了三个应用LVLM的内部特性,并证明了哪些策略可以一直提高ICL VQA性能。我们的代码可以在以下链接中找到:https://github.com/GaryJiajia/OFv2_ICL_VQA。
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models
results: 在三种表示任务中,实际上 Achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.Abstract
The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities. We enforce consistency between the respective encoder branches (receiving augmented inputs) to prevent overfitting in downstream tasks. Our method is evaluated on three representative tasks: generalization to novel classes, cross-dataset evaluation, and unseen domain shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
摘要
“输入文本提示选择对视语言预训练(VLP)模型的性能具有关键作用。我们提出了ApollO,一种多模式的统一方法, combinines adapter和提示学习来提高视语言模型的通用化能力。我们的方法采用可调整的跨注意力adapter层和视频语言编码器来强化两Modal之间的对接。我们在下游任务中对各自的编码器支持(接受了增强的输入)进行了一致性保持,以避免过度适应。我们的方法在三个表示任务中进行了评估:推广到新类,跨数据集评估和未看到的频率偏移。在实践中,ApollO比MaPLe(SOTA)在新类上达到了6.03%的相对提升。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The traditional Chinese writing system is also widely used in Taiwan and Hong Kong.
Explainable AI is Responsible AI: How Explainability Creates Trustworthy and Socially Responsible Artificial Intelligence
methods: 本文报告 analyzes state-of-the-art literature on responsible AI (RAI) and explainable AI (XAI) technologies, demonstrating that XAI can be used to ensure fairness, robustness, privacy, security, and transparency in a wide range of contexts.
results: 本研究 conclude that XAI is an essential foundation for every pillar of RAI, providing a critical tool for developing trustworthy AI systems that minimize bias, protect privacy, support security, and enhance transparency and accountability.Abstract
Artificial intelligence (AI) has been clearly established as a technology with the potential to revolutionize fields from healthcare to finance - if developed and deployed responsibly. This is the topic of responsible AI, which emphasizes the need to develop trustworthy AI systems that minimize bias, protect privacy, support security, and enhance transparency and accountability. Explainable AI (XAI) has been broadly considered as a building block for responsible AI (RAI), with most of the literature considering it as a solution for improved transparency. This work proposes that XAI and responsible AI are significantly more deeply entwined. In this work, we explore state-of-the-art literature on RAI and XAI technologies. Based on our findings, we demonstrate that XAI can be utilized to ensure fairness, robustness, privacy, security, and transparency in a wide range of contexts. Our findings lead us to conclude that XAI is an essential foundation for every pillar of RAI.
摘要
人工智能(AI)已经被证明可以革命化各个领域,从医疗到金融 - 只要开发和部署负责任。这是负责任AI的主题,强调开发可靠的AI系统,减少偏见,保护隐私,支持安全,提高透明度和责任感。可解释AI(XAI)在负责任AI(RAI)中被广泛视为一个重要的建筑物,大多文献认为它是透明度的解决方案。本工作将 explore RAI和XAI技术的现状 literatura。根据我们的发现,我们示出XAI可以在各种情况下确保公正、Robustness、隐私、安全和透明度。我们的发现表明XAI是负责任AI的基础。
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
results: 研究发现,使用精细调教或人工反馈进行对齐后,LLM的性能可以大幅提高。然而,研究还发现,通过使用STRATEGIC的提示和ICL来进行准确性和可靠性的评估,可以使得不需要调教的对齐方法达到类似的性能水平。Abstract
The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.
摘要
大型语言模型(LLM)的准确调整过程通常包括监督精度调整(SFT)和人工反馈学习(RLHF)。一项研究(LIMA,详见周等,2023)显示,只使用1000个示例进行SFT可以 дости到显著的准确性表现,这表明对适应调整的影响可能是“ superficiale”。这种情况提出了对适应调整的转化是如何实现的问题。我们通过分析基本LLM和其调整版本之间的单词分布异同来分析适应调整的效果。我们发现,基本LLM和其调整版本在大多数单词位置上的decoding表现几乎相同。大多数分布异同发生在形式化单词上。这些证据直接支持LIMA所提出的 superficialeAlignment Hypothesis。 基于这些发现,我们重新思考了LLM的适应调整。我们提出了一个简单、无需调整的适应方法,即URIAL。URIAL通过基本LLM中的增强学习(ICL)来实现有效的适应,只需要三个常见的形式化示例和系统提示。我们在一个多样化的示例集上进行了细致和可读的评估。结果表明,基本LLM与URIAL可以与SFT或SFT+RLHF同等或者超越LLM的性能。我们发现,通过策略性的提示和ICL,可以大幅减少基于调整和无基于调整的适应方法之间的差距。我们的发现和URIAL的表现 suggeST That deeper analysis and theoretical understanding of alignment is crucial to future LLM research。
methods: 该研究提出了一种名为“Koopman Embed to Equivariant Control”(KEEC)的方法,其基于李群理论,从拟合流形上学习非线性动力学系统,并在这个对应的geometry上进行优化控制。
results: 研究发现,使用isometric和isoomorphic损失函数,保证geometry的紧凑性和光滑性,可以超越不具有这些性质的损失函数,并实现 quadratic convergence。Abstract
This paper investigates how representation learning can enable optimal control in unknown and complex dynamics, such as chaotic and non-linear systems, without relying on prior domain knowledge of the dynamics. The core idea is to establish an equivariant geometry that is diffeomorphic to the manifold defined by a dynamical system and to perform optimal control within this corresponding geometry, which is a non-trivial task. To address this challenge, Koopman Embed to Equivariant Control (KEEC) is introduced for model learning and control. Inspired by Lie theory, KEEC begins by learning a non-linear dynamical system defined on a manifold and embedding trajectories into a Lie group. Subsequently, KEEC formulates an equivariant value function equation in reinforcement learning on the equivariant geometry, ensuring an invariant effect as the value function on the original manifold. By deriving analytical-form optimal actions on the equivariant value function, KEEC theoretically achieves quadratic convergence for the optimal equivariant value function by leveraging the differential information on the equivariant geometry. The effectiveness of KEEC is demonstrated in challenging dynamical systems, including chaotic ones like Lorenz-63. Notably, our findings indicate that isometric and isomorphic loss functions, ensuring the compactness and smoothness of geometry, outperform loss functions without these properties.
摘要
KEEC begins by learning a non-linear dynamical system defined on a manifold and embedding trajectories into a Lie group. Then, it formulates an equivariant value function equation in reinforcement learning on the equivariant geometry, ensuring an invariant effect as the value function on the original manifold. By deriving analytical-form optimal actions on the equivariant value function, KEEC achieves quadratic convergence for the optimal equivariant value function.The effectiveness of KEEC is demonstrated in challenging dynamical systems, including the Lorenz-63 system, which is chaotic. The authors find that isometric and isomorphic loss functions, which ensure the compactness and smoothness of the geometry, outperform loss functions without these properties.In summary, this paper presents a new approach to optimal control in complex and unknown dynamics using representation learning and equivariant geometry. The proposed method, KEEC, leverages Lie theory and reinforcement learning to achieve quadratic convergence and outperform existing methods in challenging dynamical systems.
paper_authors: Karanpartap Singh, James Zou for:* 评估大语言模型(LLM)水印的质量methods:* 使用LLM-judger指南进行评估* 使用文本嵌入分类来分辨水印和无水印文本results:* 现有水印方法可轻松地被检测出来* 水印会影响文本质量,特别是减少响应的 coherence 和深度* 发现评估水印质量的 metric 需要更加丰富,以捕捉水印的各种缺陷Abstract
With the increasing use of large-language models (LLMs) like ChatGPT, watermarking has emerged as a promising approach for tracing machine-generated content. However, research on LLM watermarking often relies on simple perplexity or diversity-based measures to assess the quality of watermarked text, which can mask important limitations in watermarking. Here we introduce two new easy-to-use methods for evaluating watermarking algorithms for LLMs: 1) evaluation by LLM-judger with specific guidelines; and 2) binary classification on text embeddings to distinguish between watermarked and unwatermarked text. We apply these methods to characterize the effectiveness of current watermarking techniques. Our experiments, conducted across various datasets, reveal that current watermarking methods are detectable by even simple classifiers, challenging the notion of watermarking subtlety. We also found, through the LLM judger, that watermarking impacts text quality, especially in degrading the coherence and depth of the response. Our findings underscore the trade-off between watermark robustness and text quality and highlight the importance of having more informative metrics to assess watermarking quality.
摘要
随着大语言模型(LLM)如ChatGPT的使用,水印技术已经出现为跟踪机器生成内容的有力的方法。然而,LLM水印研究经常利用简单的复杂度或多样性基准来评估水印文本质量,这可能会隐藏重要的水印限制。在这篇文章中,我们介绍了两种新的容易使用的方法来评估LLM水印算法:1)由LLM评审人使用特定指南进行评估;2)基于文本嵌入分类来分辨水印和无水印文本。我们应用这些方法来描述当前水印技术的效果。我们的实验,在不同的数据集上进行了,显示了现有水印方法可以被简单的分类器检测出来,这挑战了水印细微性的假设。我们还发现,通过LLM评审人,水印会影响文本质量,尤其是在减少响应的 coherence 和深度。我们的发现强调了水印Robustness 和文本质量之间的贸易,并且高亮了需要更多的信息来评估水印质量。
Measuring Distributional Shifts in Text: The Advantage of Language Model-Based Embeddings
results: 实验结果表明,通用的 LLM-based 嵌入 algorthm 能够具有高度敏感度,比较其他嵌入算法。 authors 还提出了 “drift sensitivity” 作为评估语言模型的一个重要纪录。Abstract
An essential part of monitoring machine learning models in production is measuring input and output data drift. In this paper, we present a system for measuring distributional shifts in natural language data and highlight and investigate the potential advantage of using large language models (LLMs) for this problem. Recent advancements in LLMs and their successful adoption in different domains indicate their effectiveness in capturing semantic relationships for solving various natural language processing problems. The power of LLMs comes largely from the encodings (embeddings) generated in the hidden layers of the corresponding neural network. First we propose a clustering-based algorithm for measuring distributional shifts in text data by exploiting such embeddings. Then we study the effectiveness of our approach when applied to text embeddings generated by both LLMs and classical embedding algorithms. Our experiments show that general-purpose LLM-based embeddings provide a high sensitivity to data drift compared to other embedding methods. We propose drift sensitivity as an important evaluation metric to consider when comparing language models. Finally, we present insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.
摘要
必须的一部分在生产环境中监控机器学习模型是测量输入和输出数据的变化。在这篇论文中,我们提出了一种测量自然语言数据中的分布变化的系统,并 investigate了使用大型自然语言模型(LLMs)来解决这个问题的潜在优势。随着LLMs的发展和在不同领域的成功应用,它们在解决不同的自然语言处理问题中表现出了很好的效果。LLMs的力量主要来自它们在相应的神经网络中生成的编码(嵌入)。我们首先提出了基于归一化的算法来测量文本数据中的分布变化,然后对文本嵌入生成器和经典嵌入算法生成的嵌入进行比较。我们的实验表明,通用的LLM-基于嵌入在数据变化敏感性方面表现出了高度的优势。我们提出了分布变化敏感度作为评估语言模型的重要评价指标。最后,我们presented insights and lessons learned from deploying our framework as part of the Fiddler ML Monitoring platform over a period of 18 months.
paper_authors: Carolina Zheng, Keyon Vafa, David M. Blei
for: 这篇论文的目的是比较 combine language models 和 topic models 的效果。
methods: 这篇论文使用的方法包括 four topic-guided language models 和 two baselines,并对每个模型在四个 corpus 上进行了评估。
results: 研究发现,none of these methods outperform a standard LSTM language model baseline,而且大多数方法无法学习好的话题。此外,研究者还训练了一个使用 neural language model 的 probes,发现基eline 的隐藏状态已经包含了话题信息。Abstract
A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.
摘要
一种最近的自然语言处理研究尝试将语言模型与主题模型结合。这些主题导向的语言模型将神经网络语言模型与无监督学习方法结合,以便发现文档级别的词语使用模式。本文在标准化设置下对这些方法进行比较。我们研究了四种主题导向的语言模型和两个基线,对每个 corpora 进行评估。意外地发现,none of these methods outperform a standard LSTM language model baseline,并且大多数方法无法学习好的主题。此外,我们在 neural language model 中训练了一个探针,发现基eline 的隐藏状态已经编码了主题信息。我们将所有用于这项研究的代码公开。
When it Rains, it Pours: Modeling Media Storms and the News Ecosystem
results: 验证媒体暴风的演化和主题分布,提供媒体覆盖和新闻主题间的影响关系的实践支持Abstract
Most events in the world receive at most brief coverage by the news media. Occasionally, however, an event will trigger a media storm, with voluminous and widespread coverage lasting for weeks instead of days. In this work, we develop and apply a pairwise article similarity model, allowing us to identify story clusters in corpora covering local and national online news, and thereby create a comprehensive corpus of media storms over a nearly two year period. Using this corpus, we investigate media storms at a new level of granularity, allowing us to validate claims about storm evolution and topical distribution, and provide empirical support for previously hypothesized patterns of influence of storms on media coverage and intermedia agenda setting.
摘要
大多数世界事件只 receive brief 的新闻报道。然而,有时会有一些事件引起媒体风暴,持续数周不断的广泛报道。在这项工作中,我们开发并应用一种对应式文章相似性模型,以identify Story clusters in corpora的当地和国家在线新闻,并 thereby create a comprehensive corpus of media storms over a nearly two year period。使用这个 corpus,我们investigate media storms at a new level of granularity,以验证风暴的演化和主题分布,并为媒体报道和intermedia Agenda Setting提供实证支持。
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
paper_authors: Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre Kıcıman, Hamid Palangi, Barun Patra, Robert West
results: 研究发现,GPT-4-turbo具有强大的参数知识偏好,而Mistral-7B具有最强的基于实际情况的选择能力。此外,研究还发现,对LLM的计算图 alone可以预测它们的准确性,尤其是在非基于参数的情况下。Abstract
Large language models (LLMs) have demonstrated impressive capabilities in storing and recalling factual knowledge, but also in adapting to novel in-context information. Yet, the mechanisms underlying their in-context grounding remain unknown, especially in situations where in-context information contradicts factual knowledge embedded in the parameters. This is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify the outdated parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the parametric knowledge clashes with the in-context information. We benchmark various LLMs with Fakepedia and discover that GPT-4-turbo has a strong preference for its parametric knowledge. Mistral-7B, on the contrary, is the model that most robustly chooses the grounded answer. Then, we conduct causal mediation analysis on LLM components when answering Fakepedia queries. We demonstrate that inspection of the computational graph alone can predict LLM grounding with 92.8% accuracy, especially because few MLPs in the Transformer can predict non-grounded behavior. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.
摘要
results: 通过对多个benchmark dataset进行广泛的实验,证明RVP可以更好地解决VQA任务,并且可以更好地处理复杂的数据结构。Abstract
Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.
摘要
Visual Programming (VP) 已经成为Visual Question Answering (VQA) 的强大框架。通过生成和执行特定问题的代码,这些方法表现出了 Compositional 和 reasoning 能力,特别是在几个shot 和 zero-shot enario 中。然而,现有的 VP 方法都会生成所有代码在单一函数中,导致代码的准确性和可读性受到限制。受人类编程实践启发,我们提出了 Recursive Visual Programming (RVP),它简化生成的 Routines,提供更高效的问题解决方法,并可以处理更复杂的数据结构。RVP 采用人类编程实践的迭代循环代码生成方法,将问题 decomposed 成更小的部分,以便更好地解决复杂问题。具有动态类型分配功能,即在系统 recursively 生成新代码时,自动确定返回类型并生成相应的代码来生成该输出。我们通过对 VSR、COVR、GQA 和 NextQA 等各种标准套件进行广泛的实验,证明了采用人类类似的迭代和模块化编程技术可以更好地解决 VQA 任务。
Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective
results: 实验结果表明,dSC可以成为一种可靠和便宜的LLM alignment方法,并且在安全、情感和隐私控制方面表现出色。Abstract
This paper proposes an interpretation of RLAIF as Bayesian inference by introducing distilled Self-Critique (dSC), which refines the outputs of a LLM through a Gibbs sampler that is later distilled into a fine-tuned model. Only requiring synthetic data, dSC is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align LLMs. Code released at \url{https://github.com/vicgalle/distilled-self-critique}.
摘要
Simplified Chinese translation:这篇论文提出RLAIF的解释为 bayesian inference,通过引入精炼自我批判(dSC)来细化LLM的输出,然后通过一个Gibbs采样器进行细化,最后生成一个精度高的模型。只需使用 sintetic data,dSC在安全、情感和隐私控制等方面进行了实验,表明它可以成为LLM的Alignment的可靠和便宜的代替方案。代码发布在。
Zero- and Few-Shots Knowledge Graph Triplet Extraction with Large Language Models
results: 研究发现,随着知识库上下文的提高,LLM 的 TE 能力得到了显著提高,并在一些情况下与基于BiLSTM网络架构的全部训练基eline相当。此外,研究还发现,模型的大小只有 logarithm 方式提高 TE 能力。Abstract
In this work, we tested the Triplet Extraction (TE) capabilities of a variety of Large Language Models (LLMs) of different sizes in the Zero- and Few-Shots settings. In detail, we proposed a pipeline that dynamically gathers contextual information from a Knowledge Base (KB), both in the form of context triplets and of (sentence, triplets) pairs as examples, and provides it to the LLM through a prompt. The additional context allowed the LLMs to be competitive with all the older fully trained baselines based on the Bidirectional Long Short-Term Memory (BiLSTM) Network architecture. We further conducted a detailed analysis of the quality of the gathered KB context, finding it to be strongly correlated with the final TE performance of the model. In contrast, the size of the model appeared to only logarithmically improve the TE capabilities of the LLMs.
摘要
在这个工作中,我们测试了一些大语言模型(LLM)的 triplet extraction(TE)能力在零和几个例目下。具体来说,我们提出了一个管道, dynamically gathering知识库(KB)中的Contextual information,包括context triplets和(句子, triplets)对的例子,并将其提供给LLM via prompt。这些额外的上下文使得LLMs可以与所有的老的完全训练基eline相比竞争。我们还进行了KB上下文质量的详细分析,发现与最终TE性能之间存在强相关性。然而,模型大小似乎只有对TE能力带来了对数的改进。
A Machine Learning Approach Towards SKILL Code Autocompletion
results: 提出了一种数据有效的方法,包括创建高质量 SKILL 数据集,使用 T5 模型在自动学习和监督学习中进行训练,并评估生成的 SKILL 代码。结果表明,使用该方法的模型比基线方法高于人工评分和 BLEU 分数。然而,由于可用的 SKILL 代码数据量非常少,模型训练时还存在许多限制。Abstract
As Moore's Law continues to increase the complexity of electronic systems, Electronic Design Automation (EDA) must advance to meet global demand. An important example of an EDA technology is SKILL, a scripting language used to customize and extend EDA software. Recently, code generation models using the transformer architecture have achieved impressive results in academic settings and have even been used in commercial developer tools to improve developer productivity. To the best of our knowledge, this study is the first to apply transformers to SKILL code autocompletion towards improving the productivity of hardware design engineers. In this study, a novel, data-efficient methodology for generating SKILL code is proposed and experimentally validated. More specifically, we propose a novel methodology for (i) creating a high-quality SKILL dataset with both unlabeled and labeled data, (ii) a training strategy where T5 models pre-trained on general programming language code are fine-tuned on our custom SKILL dataset using unsupervised and supervised learning, and (iii) evaluating synthesized SKILL code. We show that models trained using the proposed methodology outperform baselines in terms of human-judgment score and BLEU score. A major challenge faced was the extremely small amount of available SKILL code data that can be used to train a transformer model to generate SKILL code. Despite our validated improvements, the extremely small dataset available to us was still not enough to train a model that can reliably autocomplete SKILL code. We discuss this and other limitations as well as future work that could address these limitations.
摘要
Moore's Law 的不断提高复杂性要求电子设计自动化(EDA)技术不断进步,SKILL 是一种用于自定义和扩展 EDA 软件的脚本语言。在学术和商业领域中,基于 transformer 架构的代码生成模型已经取得了很好的成绩。据我们所知,这是第一个应用 transformers 到 SKILL 代码自动完成以提高硬件设计工程师的产效。在这种研究中,我们提出了一种新的、数据效率高的 SKILL 代码生成方法,并通过实验验证其效果。具体来说,我们提出了以下三个方法:1. 创建高质量的 SKILL 数据集,包括无标签数据和标签数据。2. 使用 T5 模型,先在通用编程语言代码上进行预训练,然后在我们自定义的 SKILL 数据集上进行无监督和监督学习。3. 评估生成的 SKILL 代码。我们的研究表明,使用我们提出的方法可以超越基eline。具体来说,我们的模型在人类评估 score 和 BLEU score 两个指标上均表现出色。然而,我们面临了一个主要的挑战:SKILL 代码数据的可用量非常小。尽管我们 Validated 改进,但我们的数据集仍然不够用于可靠地生成 SKILL 代码。我们讨论了这些限制以及未来的工作,以解决这些限制。
Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness
results: 实验结果表明,现有的知识编辑方法受到知识表示形式的影响,并且它们在推论 edited 知识中表现有限。Abstract
The potential of using a large language model (LLM) as a knowledge base (KB) has sparked significant interest. To manage the knowledge acquired by LLMs, we need to ensure that the editing of learned facts respects internal logical constraints, which are known as dependency of knowledge. Existing work on editing LLMs has partially addressed the issue of dependency, when the editing of a fact should apply to its lexical variations without disrupting irrelevant ones. However, they neglect the dependency between a fact and its logical implications. We propose an evaluation protocol with an accompanying question-answering dataset, DepEdit, that provides a comprehensive assessment of the editing process considering the above notions of dependency. Our protocol involves setting up a controlled environment in which we edit facts and monitor their impact on LLMs, along with their implications based on If-Then rules. Extensive experiments on DepEdit show that existing knowledge editing methods are sensitive to the surface form of knowledge, and that they have limited performance in inferring the implications of edited facts.
摘要
受大语言模型(LLM)知识库(KB)的潜在利用吸引了广泛的关注。为了管理LLM所获知,我们需要确保编辑学习的事实遵循内部逻辑约束,这些约束被称为知识依赖关系。现有的LLM编辑工作部分解决了事实编辑的问题,但忽略了事实与其логи合理的结论之间的依赖关系。我们提出了一种评估协议,并附加了一个相应的问答数据集, DepEdit,以全面评估编辑过程中的依赖关系。我们的协议包括在控制环境中编辑事实,并监测其影响LLM以及其逻辑推论的变化。广泛的DepEdit实验表明,现有的知识编辑方法受到表示知识的表达形式的影响,而且它们在推理编辑事实的逻辑结论时表现有限。
Prompting Disentangled Embeddings for Knowledge Graph Completion with Pre-trained Language Model
paper_authors: Yuxia Geng, Jiaoyan Chen, Yuhang Zeng, Zhuo Chen, Wen Zhang, Jeff Z. Pan, Yuxiang Wang, Xiaoliang Xu
for: 这篇论文主要是关于知识图完成(KGC)中使用预训练语言模型(PLM)的应用。
methods: 该论文提出了一种新的KGC方法 named PDKGC,其使用了两个提示:一个困难任务提示和一个分解结构提示。这两个提示用于在冻结PLM上进行KGC任务,并将其与文本信息结合以实现更全面的实体预测。
results: 对两个常用的KGC数据集进行了坚实的评估,结果表明PDKGC常常超过基elines,其中的组件都是有效的。codes和数据可以在https://github.com/genggengcss/PDKGC上获取。Abstract
Both graph structures and textual information play a critical role in Knowledge Graph Completion (KGC). With the success of Pre-trained Language Models (PLMs) such as BERT, they have been applied for text encoding for KGC. However, the current methods mostly prefer to fine-tune PLMs, leading to huge training costs and limited scalability to larger PLMs. In contrast, we propose to utilize prompts and perform KGC on a frozen PLM with only the prompts trained. Accordingly, we propose a new KGC method named PDKGC with two prompts -- a hard task prompt which is to adapt the KGC task to the PLM pre-training task of token prediction, and a disentangled structure prompt which learns disentangled graph representation so as to enable the PLM to combine more relevant structure knowledge with the text information. With the two prompts, PDKGC builds a textual predictor and a structural predictor, respectively, and their combination leads to more comprehensive entity prediction. Solid evaluation on two widely used KGC datasets has shown that PDKGC often outperforms the baselines including the state-of-the-art, and its components are all effective. Our codes and data are available at https://github.com/genggengcss/PDKGC.
摘要
<> translate "Both graph structures and textual information play a critical role in Knowledge Graph Completion (KGC). With the success of Pre-trained Language Models (PLMs) such as BERT, they have been applied for text encoding for KGC. However, the current methods mostly prefer to fine-tune PLMs, leading to huge training costs and limited scalability to larger PLMs. In contrast, we propose to utilize prompts and perform KGC on a frozen PLM with only the prompts trained. Accordingly, we propose a new KGC method named PDKGC with two prompts -- a hard task prompt which is to adapt the KGC task to the PLM pre-training task of token prediction, and a disentangled structure prompt which learns disentangled graph representation so as to enable the PLM to combine more relevant structure knowledge with the text information. With the two prompts, PDKGC builds a textual predictor and a structural predictor, respectively, and their combination leads to more comprehensive entity prediction. Solid evaluation on two widely used KGC datasets has shown that PDKGC often outperforms the baselines including the state-of-the-art, and its components are all effective. Our codes and data are available at https://github.com/genggengcss/PDKGC." into Simplified Chinese.Here's the translation: both 图structure和文本信息在知识图完成(KGC)中发挥关键作用。随着预训练语言模型(PLMs)如BERT的成功,它们在KGC中应用于文本编码。然而,当前的方法主要是微调PLMs,导致很大的训练成本和更大的PLMs的扩展性。相比之下,我们提议使用提示并在冻结PLM上进行KGC。根据此,我们提出了一种新的KGC方法 named PDKGC,该方法使用了两个提示:一个hard task prompt,用于将KGC任务适应PLM的预训练任务中的token预测任务,以及一个分解结构提示,用于让PLM学习分解图表示,以便PLM可以将更有关系的结构知识与文本信息结合。通过两个提示,PDKGC建立了文本预测器和结构预测器,并将它们组合以实现更全面的实体预测。我们对两个常用的KGC数据集进行了坚实的评估,结果显示,PDKGC经常超越基elines,包括状态的前一个艺术。我们的代码和数据可以在https://github.com/genggengcss/PDKGC中获取。
Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication
results: 通过实验表明,EoT在多种复杂逻辑任务中表现出色,超过了现有的基线值,并且在成本效益方面表现出优异。Abstract
Large Language Models (LLMs) have recently made significant strides in complex reasoning tasks through the Chain-of-Thought technique. Despite this progress, their reasoning is often constrained by their intrinsic understanding, lacking external insights. To address this, we propose Exchange-of-Thought (EoT), a novel framework that enables cross-model communication during problem-solving. Drawing inspiration from network topology, EoT integrates four unique communication paradigms: Memory, Report, Relay, and Debate. This paper delves into the communication dynamics and volume associated with each paradigm. To counterbalance the risks of incorrect reasoning chains, we implement a robust confidence evaluation mechanism within these communications. Our experiments across diverse complex reasoning tasks demonstrate that EoT significantly surpasses established baselines, underscoring the value of external insights in enhancing LLM performance. Furthermore, we show that EoT achieves these superior results in a cost-effective manner, marking a promising advancement for efficient and collaborative AI problem-solving.
摘要
EoT integrates four unique communication paradigms: Memory, Report, Relay, and Debate, inspired by network topology. This paper explores the communication dynamics and volume associated with each paradigm. To ensure the accuracy of the reasoning chains, we have implemented a robust confidence evaluation mechanism within these communications.Our experiments across a variety of complex reasoning tasks show that EoT significantly outperforms existing baselines, demonstrating the importance of external insights in enhancing LLM performance. Additionally, EoT achieves these superior results in an efficient and cost-effective manner, marking a significant advancement in collaborative AI problem-solving.
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
paper_authors: Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su, Longyue Wang
for: This paper aims to improve the performance of large language models (LLMs) in multi-modal question answering tasks by addressing the challenge of selecting optimal chain of thought (CoT) demonstration examples.
methods: The proposed approach uses retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal similarities, and employs a stratified sampling method to promote the diversity of demonstration examples.
results: The proposed approach significantly improves the performance of LLMs in multi-modal reasoning tasks, achieving state-of-the-art results on the ScienceQA dataset. Specifically, the ChatGPT-based approach outperforms the Chameleon(ChatGPT) by 2.74% and the GPT4-based approach surpasses the Chameleon(GPT-4) by 0.89%. The best performing model shows a 6.05% increase over Chameleon for ChatGPT-based models and a 4.57% increase for GPT-4-based models.Here is the same information in Simplified Chinese:
results: 提议的方法在多模态理解任务中显著提高了 LLMS 的表现,在科学问答 datasets 上达到了状态的最佳结果。特别是,基于 ChatGPT 的方法比 Chameleon(ChatGPT) 高出 2.74%,GPT4 基于方法比 Chameleon(GPT-4) 高出 0.89%。最佳表现比 Chameleon 高出 6.05% 和 4.57%。Abstract
The advancement of Large Language Models(LLMs) has brought substantial attention to the Chain of Thought(CoT) approach, primarily due to its ability to enhance the capability of LLMs on tasks requiring complex reasoning. Moreover, the significance of CoT approaches extends to the application of LLMs for multi-modal tasks, such as multi-modal question answering. However, the selection of optimal CoT demonstration examples in multi-modal reasoning for LLMs remains less explored for LLMs due to the inherent complexity of multi-modal examples. In this paper, we introduce a novel approach that addresses this challenge by using retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal similarities. This method aims to refine the CoT reasoning process in multi-modal scenarios via informing LLMs with more relevant and informative examples. Furthermore, we employ a stratified sampling method categorising demonstration examples into groups based on their types and retrieving examples from different groups respectively to promote the diversity of demonstration examples. Through a series of experiments, we demonstrate that our approach significantly improves the performance of LLMs, achieving state-of-the-art results in multi-modal reasoning tasks. Specifically, our methods demonstrate significant advancements on the ScienceQA dataset. While our method based on ChatGPT outperforms the Chameleon(ChatGPT) by 2.74% with an accuracy of 82.67%, the GPT4-based approach surpasses the Chameleon(GPT-4) by 0.89%, achieving 87.43% on accuracy under the same setting. Moreover, our best performing show a 6.05% increase over Chameleon for ChatGPT-based models and a 4.57% increase for GPT-4-based models.
摘要
大语言模型(LLM)的进步导致了对循环思维(CoT)方法的广泛关注,主要是因为它能够增强 LLM 在需要复杂推理的任务上的能力。此外,CoT 方法在 LLM 应用于多modal任务,例如多modal问答任务上也有重要意义。然而,选择 LLM 的最佳 CoT 示例在多modal 推理中仍然是未explored的领域,因为多modal 示例的本质内在复杂。在这篇文章中,我们提出了一个新的方法,它可以通过循环示例的选择来自动地选择示例,并且使用对于多modal示例的跨modal相似性进行排序。这个方法的目的是通过将更有 relevance 和有用的示例告知 LLM,以改善多modal 推理过程中 LLM 的性能。此外,我们还使用了分组 stratified sampling 方法,将示例分组,并从不同的组别中选择示例,以提高示例的多样性。经过一系列的实验,我们发现了我们的方法可以对 LLM 进行 significiant 提升,在多modal 推理任务中获得了state-of-the-art 的结果。具体来说,我们的方法在 ScienceQA dataset 上达到了82.67% 的准确率,而我们基于 ChatGPT 的方法比 Chameleon(ChatGPT) 高2.74%,GPT4-based 方法比 Chameleon(GPT-4) 高0.89%,具体的结果如下:* 我们的 ChatGPT 方法在 ScienceQA dataset 上获得了87.43% 的准确率。* 我们的 GPT4-based 方法在 ScienceQA dataset 上获得了84.57% 的准确率。* 我们的 best performing 方法在 ChatGPT 和 GPT-4 上分别提高了6.05% 和4.57%。
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
paper_authors: Lei Wang, Jiabang He, Shenshen Li, Ning Liu, Ee-Peng Lim for: 这个论文的目的是减少大语言模型(LLM)的细化 объек hallucination(OH)。methods: 该论文提出了一个框架,即 \textit{ReCaption},包括两个组件:使用 ChatGPT 重写描述和 Fine-tune 指令适应 LLM 在重写描述上。results: 实验结果表明,ReCaption 能够有效减少不同 LLM 选项的细化 objet hallucination,并提高其文本生成质量。Abstract
Large language models (LLMs) have shown remarkable performance in natural language processing (NLP) tasks. To comprehend and execute diverse human instructions over image data, instruction-tuned large vision-language models (LVLMs) have been introduced. However, LVLMs may suffer from different types of object hallucinations. Nevertheless, LVLMs are evaluated for coarse-grained object hallucinations only (i.e., generated objects non-existent in the input image). The fine-grained object attributes and behaviors non-existent in the image may still be generated but not measured by the current evaluation methods. In this paper, we thus focus on reducing fine-grained hallucinations of LVLMs. We propose \textit{ReCaption}, a framework that consists of two components: rewriting captions using ChatGPT and fine-tuning the instruction-tuned LVLMs on the rewritten captions. We also propose a fine-grained probing-based evaluation method named \textit{Fine-Grained Object Hallucination Evaluation} (\textit{FGHE}). Our experiment results demonstrate that ReCaption effectively reduces fine-grained object hallucination for different LVLM options and improves their text generation quality. The code can be found at https://github.com/Anonymousanoy/FOHE.
摘要
Note:* LLMs: Large language models* LVLMs: Large vision-language models* FGHE: Fine-Grained Object Hallucination Evaluation
Voice-Based Smart Assistant System for Vehicles using RASA
results: 该论文的实验结果表明,使用语音助手应用程序可以提高车辆内部的安全性和驾驶体验,同时也可以减少驾驶者的分心和干扰。Abstract
Conversational AIs, or chatbots, mimic human speech when conversing. Smart assistants facilitate the automation of several tasks that needed human intervention earlier. Because of their accuracy, absence of dependence on human resources, and accessibility around the clock, chatbots can be employed in vehicles too. Due to people's propensity to divert their attention away from the task of driving while engaging in other activities like calling, playing music, navigation, and getting updates on the weather forecast and latest news, road safety has declined and accidents have increased as a result. It would be advantageous to automate these tasks using voice commands rather than carrying them out manually. This paper focuses on the development of a voice-based smart assistance application for vehicles based on the RASA framework. The smart assistant provides functionalities like navigation, communication via calls, getting weather forecasts and the latest news updates, and music that are completely voice-based in nature.
摘要
对话AI或chatbot可以模拟人类的话语,并且可以自动进行一些需要人工干预的任务。智能助手可以帮助自动化这些任务,因此chatbot可以在车辆中使用。由于人们对车辆驾驶时的注意力会被其他活动如电话联系、音乐播放、路径导航和天气预报等分散,因此道路安全性下降,而事故的数量则增加。这篇文章强调了基于RASA框架的voice-based智能助手应用程序的开发,这个应用程序提供了完全voice-based的功能,如航行、电话通话、天气预报和最新新闻更新等。
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment
methods: 我们提出了一种名为GroundedBERT的可视化语言学习方法,它组合了语言 corpus 中学习的上下文表示和可视化数据集中学习的视觉信息。我们还使用了最佳运输算法(OT)解决两种模态之间的分数对齐问题。
results: 我们的提议方法在GLUE和SQuAD datasets上的多种语言任务上显著超越了基eline语言模型。Abstract
Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT comprises two components: (i) the original BERT which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
摘要
Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT consists of two components: (i) the original BERT, which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module, which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.Here's the translation in Traditional Chinese:Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT consists of two components: (i) the original BERT, which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module, which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
paper_authors: Cong-Duy Nguyen, Thong Nguyen, Duc Anh Vu, Luu Anh Tuan For:The paper is written for the task of multimodal sentiment analysis, specifically addressing the limitations of previous methods in capturing the variation in sentiment scores within the same class and the significance of unimodal representations in the fusion vector.Methods:The paper proposes a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis, which enhances the discrimination and generalizability of the multimodal representation and overcomes biases in the fusion vector’s modality.Results:The experimental results demonstrated the effectiveness of the proposed approach, along with visualizations on two widely used datasets.Abstract
The effectiveness of a model is heavily reliant on the quality of the fusion representation of multiple modalities in multimodal sentiment analysis. Moreover, each modality is extracted from raw input and integrated with the rest to construct a multimodal representation. Although previous methods have proposed multimodal representations and achieved promising results, most of them focus on forming positive and negative pairs, neglecting the variation in sentiment scores within the same class. Additionally, they fail to capture the significance of unimodal representations in the fusion vector. To address these limitations, we introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality. Our experimental results, along with visualizations on two widely used datasets, demonstrate the effectiveness of our approach.
摘要
模型的有效性受到多modalities的融合表示的质量影响很大,在多modal sentiment analysis中。此外,每种模式都是从原始输入中提取出来的,然后与其他模式结合在一起构建一个多modal表示。虽然之前的方法已经提出了多modal表示并取得了良好的结果,但大多数都是通过形成正面和负面对的方式来实现,忽视了同一类别内的情感分数的变化。此外,它们还无法捕捉谱 modal 表示在折衔vector中的重要性。为了解决这些局限性,我们提出了一种名为Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis的框架。这种框架的目的是增强多modal表示的分辨率和泛化性,并在折衔vector中消除模式偏见。我们的实验结果,以及在两个常用的数据集上的视觉化,都表明了我们的方法的有效性。
Explaining with Contrastive Phrasal Highlighting: A Case Study in Assisting Humans to Detect Translation Differences
results: 研究发现,使用这种技术可以更好地匹配人类的理由,并帮助人们检测文本翻译中的细腻意义差异和极重机器翻译错误。In English, this means:
for: The purpose of this study is to explain how NLP models predict semantic divergence between two input texts.
methods: The study uses a new technique called phrase-alignment-guided erasure to generate contrastive highlights, and shows that this technique can better help people understand the predictions of NLP models.
results: The study finds that using this technique can better match human rationales, and help people detect fine-grained meaning differences in human translations and critical machine translation errors.Abstract
Explainable NLP techniques primarily explain by answering "Which tokens in the input are responsible for this prediction?''. We argue that for NLP models that make predictions by comparing two input texts, it is more useful to explain by answering "What differences between the two inputs explain this prediction?''. We introduce a technique to generate contrastive highlights that explain the predictions of a semantic divergence model via phrase-alignment-guided erasure. We show that the resulting highlights match human rationales of cross-lingual semantic differences better than popular post-hoc saliency techniques and that they successfully help people detect fine-grained meaning differences in human translations and critical machine translation errors.
摘要
“我们提出了一种可解释的NLP技术,主要是通过回答“哪些输入元素贡献到这个预测中?”来解释。我们认为,当NLP模型通过比较两个输入文本来做预测时,更有用的是通过回答“哪些差异在两个输入文本中引起这个预测?”来解释。我们介绍了一种生成对照高亮的技术,通过Alignment-guided erasure来解释semantic divergence模型的预测。我们证明了这些高亮能够更好地匹配人类的理由,并且能够帮助人检测人工翻译和机器翻译中的细部意义差异。”Here's a word-for-word translation of the text in Traditional Chinese:“我们提出了一种可解释的NLP技术,主要是通过回答“哪些输入元素贡献到这个预测中?”来解释。我们认为,当NLP模型通过比较两个输入文本来做预测时,更有用的是通过回答“哪些差异在两个输入文本中引起这个预测?”来解释。我们介绍了一种生成对照高亮的技术,通过Alignment-guided erasure来解释semantic divergence模型的预测。我们证明了这些高亮能够更好地匹配人类的理由,并且能够帮助人检测人工翻译和机器翻译中的细部意义差异。”
A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video
methods: 该论文提出了一种实践型多Modal video summarization任务的设定和数据集,并提出了一种评价指标。在实现该任务时,需要同时优化键帧选择和caption质量,这需要考虑键帧和caption之间的互相关系。
results: 论文提出了两个基eline系统的性能,并对其进行了评价。Abstract
This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying them in a listable format to grasp the video content quickly. This task aims to extract crucial scenes from the video in the form of images (keyframes) and generate corresponding captions explaining each keyframe's situation. This task is useful as a practical application and presents a highly challenging problem worthy of study. Specifically, achieving simultaneous optimization of the keyframe selection performance and caption quality necessitates careful consideration of the mutual dependence on both preceding and subsequent keyframes and captions. To facilitate subsequent research in this field, we also construct a dataset by expanding upon existing datasets and propose an evaluation framework. Furthermore, we develop two baseline systems and report their respective performance.
摘要
这篇论文提出了一个实用的多Modal视频概要任务设定和一个用于训练和评估该任务的数据集。目标任务是从给定的视频中提取关键场景,并将其转换为预定数量的关键帧-标题对,以便快速了解视频内容。这个任务的目标是从视频中提取关键场景,并将其转换为图像(关键帧)和对应的标题,以描述每个关键帧的情况。这个任务是一个实用的应用,同时也是一个具有挑战性的问题。在实现这个任务时,需要 simultanously 优化关键帧选择性和标题质量,这需要考虑关键帧和标题之间的互相依赖关系。为了促进后续的研究,我们还构建了一个数据集,并提出了评估框架。此外,我们还开发了两个基线系统,并对其表现进行了报告。
results: 分析结果表明,医生的专业社交连接可以预测医疗referral,这可能会提高医疗机构内部的合作和提高医疗服务质量。这项研究对主要-专科referral的下面机制进行了可质的分析,为改善病人护理和有效管理医疗服务提供了有价值的意见。Abstract
Medical referrals between primary care physicians (PC) and specialist care (SC) physicians profoundly impact patient care regarding quality, satisfaction, and cost. This paper investigates the influence of professional networks among medical doctors on referring patients from PC to SC. Using five-year consultation data from a Portuguese private health provider, we conducted exploratory data analysis and constructed both professional and referral networks among physicians. We then apply Graph Neural Network (GNN) models to learn latent representations of the referral network. Our analysis supports the hypothesis that doctors' professional social connections can predict medical referrals, potentially enhancing collaboration within organizations and improving healthcare services. This research contributes to dissecting the underlying mechanisms in primary-specialty referrals, thereby providing valuable insights for enhancing patient care and effective healthcare management.
摘要
医疗专家之间的医疗参考(PC)和专科医生(SC)之间的医疗参考对患者的质量、满意度和成本产生了深远的影响。这篇论文研究了医生之间的专业联系对从PC到SC的患者转介的影响。使用葡萄牙私人医疗提供商五年的咨询数据,我们进行了探索性数据分析,并构建了医生之间的专业和转介网络。然后,我们应用图 нейрон网络(GNN)模型来学习转介网络的隐藏表示。我们的分析支持医生的专业社会连接可以预测医疗参考,可能提高组织内协作和改善医疗服务。这项研究为Primary-specialty referrals的内部机制的分析提供了有价值的信息,以提高患者的护理和有效的医疗管理。
FaultFormer: Transformer-based Prediction of Bearing Faults
results: 研究发现Transformer 能够自动从讯号中提取特征,并学习全局和本地关系,以进行分类。此外,还提出了两种预训练策略,以便在生产线上适应新数据、情况或机器。Abstract
The growth of deep learning in the past decade has motivated important applications to smart manufacturing and machine health monitoring. In particular, vibration data offers a rich and reliable source to provide meaningful insights into machine health and predictive maintenance. In this work, we present a Transformer based framework for analyzing vibration signals to predict different types of bearing faults (FaultFormer). In particular, we process signal data using data augmentations and extract their Fourier modes to train a transformer encoder to achieve state of the art accuracies. The attention mechanism as well as model outputs were analyzed to confirm the transformer's ability to automatically extract features within signals and learn both global and local relationships to make classifications. Lastly, two pretraining strategies were proposed to pave the way for large, generalizable transformers that could adapt to new data, situations, or machinery on the production floor.
摘要
随着深度学习技术在过去一代的发展,智能制造和机器健康监测中得到了重要应用。特别是震动数据提供了一种丰富和可靠的源,以提供机器健康的有价值信息和预测维护。在这项工作中,我们提出了基于Transformer框架的震动信号分析方法,以预测不同类型的承式 fault(FaultFormer)。特别是,我们对信号数据进行了数据扩充和抽取其傅里埃模式,并使用Transformer编码器来实现state of the art的准确率。transformer的注意机制以及模型输出被分析,以确认transformer的能力自动提取信号中的特征和学习全局和局部关系,以进行分类。最后,我们提出了两种预训练策略,以便在生产线上适应新数据、情况或机器。
On the Trade-Off between Stability and Representational Capacity in Graph Neural Networks
results: 我们发现, EdgeNet 框架中 GNNs 的稳定性与不同类型的 EdgeNet 分类有关。 Specifically, 具有较少的参数空间度量的 GNNs (相对于其表示能力的下降) 更加稳定。这种负面关系的关键因素是 EdgeNet 参数矩阵和图Shift operator 的eigenvector misalignment。例如,使用单个整数表示每个信号shift (即完美对齐) 的 graph convolutional neural networks 更加稳定。Abstract
Analyzing the stability of graph neural networks (GNNs) under topological perturbations is key to understanding their transferability and the role of each architecture component. However, stability has been investigated only for particular architectures, questioning whether it holds for a broader spectrum of GNNs or only for a few instances. To answer this question, we study the stability of EdgeNet: a general GNN framework that unifies more than twenty solutions including the convolutional and attention-based classes, as well as graph isomorphism networks and hybrid architectures. We prove that all GNNs within the EdgeNet framework are stable to topological perturbations. By studying the effect of different EdgeNet categories on the stability, we show that GNNs with fewer degrees of freedom in their parameter space, linked to a lower representational capacity, are more stable. The key factor yielding this trade-off is the eigenvector misalignment between the EdgeNet parameter matrices and the graph shift operator. For example, graph convolutional neural networks that assign a single scalar per signal shift (hence, with a perfect alignment) are more stable than the more involved node or edge-varying counterparts. Extensive numerical results corroborate our theoretical findings and highlight the role of different architecture components in the trade-off.
摘要
RINAS: Training with Dataset Shuffling Can Be General and Fast
results: 实验结果表明,RINAS可以提高通用语言模型训练和视觉模型训练的 Throughput by up to 59%和89%,尤其是在大规模 dataset 上。Abstract
Deep learning datasets are expanding at an unprecedented pace, creating new challenges for data processing in model training pipelines. A crucial aspect of these pipelines is dataset shuffling, which significantly improves unbiased learning and convergence accuracy by adhering to the principles of random sampling. However, loading shuffled data for large datasets incurs significant overhead in the deep learning pipeline and severely impacts the end-to-end training throughput. To mitigate this, current deep learning systems often resort to partial dataset shuffling, sacrificing global randomness to maintain acceptable training throughput on large datasets, still leaving global shuffling efficiency issues not fully explored. In this work, we present RINAS, a data loading framework that systematically addresses the performance bottleneck of loading global shuffled datasets. Our key contribution is to offer an intra-batch unordered data fetching approach, which unleashes unexplored parallelism of data loading. We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision. Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
摘要
深度学习数据集的扩展正在无前例的速度下进行,创造了模型训练管道中数据处理的新挑战。一个关键的方面是数据洗混,可以帮助避免深度学习模型受到偏见的影响,并提高模型的准确率。然而,为大型数据集加载洗混数据可能会带来深度学习管道中的重要负担,影响整个训练过程的终端性能。为此,当前的深度学习系统通常采用部分数据洗混,把全局随机抽取替换为保持可接受的训练过程速度。但这些方法仍然留下全局洗混效率的问题未经探讨。在这种情况下,我们提出了RINAS,一个数据加载框架,系统地解决了加载全球洗混数据的性能瓶颈。我们的关键贡献是提供了内批量无序数据fetching方法,使得数据加载中的并行性得到了更好的发挥。我们在PyTorch框架下实现了RINAS,并对HuggingFace和TorchVision等常用的数据库进行了实现。我们的实验结果表明,RINAS可以提高普通语言模型训练和视觉模型训练的throughput by up to 59%和89%,分别。
A Waddington landscape for prototype learning in generalized Hopfield networks
results: 研究发现,当增强网络的非线性度时,学习过程会经历一种”特征至原型”的转变,其中内存的学习状态从混合态转变为纯态。此外,作者还发现了一些可靠的分解点和稳定点,这些点在学习过程中可以控制和预测内存的分化。Abstract
Networks in machine learning offer examples of complex high-dimensional dynamical systems reminiscent of biological systems. Here, we study the learning dynamics of Generalized Hopfield networks, which permit a visualization of internal memories. These networks have been shown to proceed through a 'feature-to-prototype' transition, as the strength of network nonlinearity is increased, wherein the learned, or terminal, states of internal memories transition from mixed to pure states. Focusing on the prototype learning dynamics of the internal memories we observe a strong resemblance to the canalized, or low-dimensional, dynamics of cells as they differentiate within a Waddingtonian landscape. Dynamically, we demonstrate that learning in a Generalized Hopfield Network proceeds through sequential 'splits' in memory space. Furthermore, order of splitting is interpretable and reproducible. The dynamics between the splits are canalized in the Waddington sense -- robust to variations in detailed aspects of the system. In attempting to make the analogy a rigorous equivalence, we study smaller subsystems that exhibit similar properties to the full system. We combine analytical calculations with numerical simulations to study the dynamical emergence of the feature-to-prototype transition, and the behaviour of splits in the landscape, saddles points, visited during learning. We exhibit regimes where saddles appear and disappear through saddle-node bifurcations, qualitatively changing the distribution of learned memories as the strength of the nonlinearity is varied -- allowing us to systematically investigate the mechanisms that underlie the emergence of Waddingtonian dynamics. Memories can thus differentiate in a predictive and controlled way, revealing new bridges between experimental biology, dynamical systems theory, and machine learning.
摘要
机器学习网络提供了复杂高维动力系统的示例,类似生物系统。我们研究通用哈夫菲尔德网络的学习动力学,允许内存的可视化。当网络强度不同时,这些网络学习的终态状态会从杂合到纯状态转换。我们关注内存学习动力学,发现它们与瓦丁顿矩阵中细胞分化的动力学具有很强的相似性。我们通过分析和数值仿真来研究学习过程中的特征转换和峰点的行为,发现学习过程会经过顺序的分裂,并且这些分裂是可预测的和控制的。我们可以通过调整非线性度来控制学习过程中的分裂和峰点的出现,从而系统地研究emergence的机制。因此,内存可以在预测和控制的情况下分化,揭示了新的生物学、动力学理论和机器学习之间的桥梁。
FLea: Improving federated learning on scarce and label-skewed data via privacy-preserving feature augmentation
results: 在广泛的实验中,\textit{FLea} 比起现有的 FL 方法(只分享模型参数)提高了性能,最高提高达 $17.6%$,并且比起分享数据增强法 ($6.3%$) 提高了性能,同时减少了对共享数据增强的隐私敏感性。Abstract
Learning a global model by abstracting the knowledge, distributed across multiple clients, without aggregating the raw data is the primary goal of Federated Learning (FL). Typically, this works in rounds alternating between parallel local training at several clients, followed by model aggregation at a server. We found that existing FL methods under-perform when local datasets are small and present severe label skew as these lead to over-fitting and local model bias. This is a realistic setting in many real-world applications. To address the problem, we propose \textit{FLea}, a unified framework that tackles over-fitting and local bias by encouraging clients to exchange privacy-protected features to aid local training. The features refer to activations from an intermediate layer of the model, which are obfuscated before being shared with other clients to protect sensitive information in the data. \textit{FLea} leverages a novel way of combining local and shared features as augmentations to enhance local model learning. Our extensive experiments demonstrate that \textit{FLea} outperforms the start-of-the-art FL methods, sharing only model parameters, by up to $17.6\%$, and FL methods that share data augmentations by up to $6.3\%$, while reducing the privacy vulnerability associated with shared data augmentations.
摘要
学习全球模型,抽象客户端上的知识,不聚合原始数据,是联邦学习(FL)的主要目标。通常情况下,这会在多个客户端上并发地进行本地训练,然后在服务器上进行模型集成。我们发现现有的FL方法在本地数据较少且标签偏度严重时表现不佳,这会导致过拟合和本地模型偏见。这是许多实际应用中的常见情况。为解决问题,我们提议了《FLea》框架,通过保护隐私信息的方式,让客户端交换隐私保护的特征,以帮助本地训练。特征指的是模型中间层的活动,以防止数据泄露。《FLea》利用了一种新的将本地特征和共享特征相结合的方式,以增强本地模型学习。我们的广泛实验表明,《FLea》可以与现有的FL方法相比,在共享模型参数的情况下,提高性能达17.6%,并且在共享数据增强情况下,提高性能达6.3%,同时降低了与共享数据增强相关的隐私暴露风险。
Reconsideration on evaluation of machine learning models in continuous monitoring using wearables
results: 研究表明,现实世界的变化、疾病动态、用户特性和假 notifications 等因素会导致 ML 模型的评估变得复杂,需要采用新的评估策略。Abstract
This paper explores the challenges in evaluating machine learning (ML) models for continuous health monitoring using wearable devices beyond conventional metrics. We state the complexities posed by real-world variability, disease dynamics, user-specific characteristics, and the prevalence of false notifications, necessitating novel evaluation strategies. Drawing insights from large-scale heart studies, the paper offers a comprehensive guideline for robust ML model evaluation on continuous health monitoring.
摘要
Here's the text in Simplified Chinese:这篇论文探讨了使用便携式设备进行连续健康监测的机器学习(ML)模型评估的挑战,并提出了新的评估策略来解决这些挑战。这些复杂性来自实际世界的变化、疾病动态、用户特定特征以及假阳通知的存在,需要新的评估策略。 drawing on insights from large-scale heart studies, the paper provides a comprehensive guide for robust ML model evaluation in continuous health monitoring.
paper_authors: Alakananda Mitra, Sahila Beegum, David Fleisher, Vangimalla R. Reddy, Wenguang Sun, Chittaranjan Ray, Dennis Timlin, Arindam Malakar for: This paper aims to develop a machine learning model to predict cotton yield based on climate change, soil diversity, cultivar, and inorganic nitrogen levels.methods: The authors use a combination of field data from the 1980s to the 1990s and process-based crop modeling to develop a machine learning model that can accurately predict cotton yield.results: The Random Forest Regressor method used in the study achieved a 97.75% accuracy rate, with a root mean square error of 55.05 kg/ha and an R2 of around 0.98. These results demonstrate the potential of machine learning techniques for supporting the cotton industry’s climate-smart initiatives.Here’s the text in Simplified Chinese:for: 这篇论文目标是开发一种基于气候变化、土壤多样性、栽培变种和无机氮水平的机器学习模型,以预测棉花收成。methods: 作者使用了1980年代到1990年代的田间数据和基于过程模型的气候模拟技术,开发出一种可靠地预测棉花收成的机器学习模型。results: 试用Random Forest Regressor方法,实验室得到了97.75%的准确率,root mean square error为55.05 kg/ha,R2约为0.98。这些结果表明机器学习技术可以准确地预测棉花收成,并且可以支持棉花行业的气候减灾Initiaves。Abstract
The cotton industry in the United States is committed to sustainable production practices that minimize water, land, and energy use while improving soil health and cotton output. Climate-smart agricultural technologies are being developed to boost yields while decreasing operating expenses. Crop yield prediction, on the other hand, is difficult because of the complex and nonlinear impacts of cultivar, soil type, management, pest and disease, climate, and weather patterns on crops. To solve this issue, we employ machine learning (ML) to forecast production while considering climate change, soil diversity, cultivar, and inorganic nitrogen levels. From the 1980s to the 1990s, field data were gathered across the southern cotton belt of the United States. To capture the most current effects of climate change over the previous six years, a second data source was produced using the process-based crop model, GOSSYM. We concentrated our efforts on three distinct areas inside each of the three southern states: Texas, Mississippi, and Georgia. To simplify the amount of computations, accumulated heat units (AHU) for each set of experimental data were employed as an analogy to use time-series weather data. The Random Forest Regressor yielded a 97.75% accuracy rate, with a root mean square error of 55.05 kg/ha and an R2 of around 0.98. These findings demonstrate how an ML technique may be developed and applied as a reliable and easy-to-use model to support the cotton climate-smart initiative.
摘要
美国棉花industry是承诺实施可持续生产方法,以最小化水、地、能源使用,同时提高棉花output和土壤健康。为提高产量而降低运营成本, клима smart 农业技术正在开发。然而,作物产量预测很具挑战,因为作物的种植、土壤类型、管理、害虫和病虫、气候和天气patterns具有复杂和非线性的影响。为解决这个问题,我们使用机器学习(ML)来预测生产,同时考虑气候变化、土壤多样性、种植和无机氮水平。在1980年代至1990年代之间,在南方棉花带的美国,我们收集了大量的场地数据。为捕捉最近六年的气候变化的影响,我们还生成了一个第二个数据源,使用过程基模型GOSSYM。我们专注于三个不同的区域内每个州的三个南方州:德克萨斯、密西西比和 Джорджи亚。为简化计算量,我们使用积累热量单位(AHU)作为时间序列天气数据的类比。随机森林回归器实现了97.75%的准确率,其中平均差分为55.05 kg/ha,R2约为0.98。这些发现表明了一种机器学习技术可以在支持棉花气候明智initiative的情况下发展和应用。
for: solves a class of convex Finite-Sum Coupled Compositional Stochastic Optimization (cFCCO) problems with applications in group distributionally robust optimization (GDRO), reinforcement learning, and learning to rank.
methods: introduces a unified family of efficient single-loop primal-dual block-coordinate proximal algorithms called ALEXR, which leverages block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable.
results: establishes the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which improve the best rates in previous works on smooth cFCCO problems and expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO.Abstract
This paper revisits a class of convex Finite-Sum Coupled Compositional Stochastic Optimization (cFCCO) problems with many applications, including group distributionally robust optimization (GDRO), reinforcement learning, and learning to rank. To better solve these problems, we introduce a unified family of efficient single-loop primal-dual block-coordinate proximal algorithms, dubbed ALEXR. This algorithm leverages block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we present lower complexity bounds to demonstrate that the convergence rates of ALEXR are optimal among first-order block-coordinate stochastic algorithms for the considered class of cFCCO problems.
摘要
ALEXR uses block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable. The authors establish the convergence rates of ALEXR in both convex and strongly convex cases, under smoothness and non-smoothness conditions of the involved functions. These rates improve upon the best rates in previous works on smooth cFCCO problems and expand the realm of cFCCO to solve more challenging non-smooth problems, such as the dual form of GDRO.Finally, the authors present lower complexity bounds to demonstrate that the convergence rates of ALEXR are optimal among first-order block-coordinate stochastic algorithms for the considered class of cFCCO problems.
results: 研究发现,随着训练集大小的增加,不同的分类器的最佳性能可能会有很大变化,而且这些变化可能与自然语言和图像 dataset 中观察到的缩放法律有关。Abstract
We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.
摘要
我们展示了扩展法则在Collider物理中的benchmark顶点对QCD涂抹分类问题中的出现。我们的六个 физи学上 Motivated的分类器在训练集大小的函数上显示了对数跨分布损失的power-law扩展法则,其中每个扩展法则都有不同的power law指数。这一结果 highlights the importance of在不同的训练集大小上比较分类器,而不是只用一个固定的训练集,因为最佳的分类器可能会在训练集的扩展上发生重大变化。我们对我们的结果进行了解释,并与之前在自然语言和图像 dataset 中观察到的扩展法则进行了比较。
Learning Polynomial Problems with $SL(2,\mathbb{R})$ Equivariance
results: 文章显示,使用神经网络可以高效地解决这些问题,并且可以达到10倍的速度提升,而无需增加维度和度数。此外,文章还发现了一些 polynomial 学习问题的特点,如 $SL(2,\mathbb{R})$ 非 compat 群的对称性。Abstract
Optimizing and certifying the positivity of polynomials are fundamental primitives across mathematics and engineering applications, from dynamical systems to operations research. However, solving these problems in practice requires large semidefinite programs, with poor scaling in dimension and degree. In this work, we demonstrate for the first time that neural networks can effectively solve such problems in a data-driven fashion, achieving tenfold speedups while retaining high accuracy. Moreover, we observe that these polynomial learning problems are equivariant to the non-compact group $SL(2,\mathbb{R})$, which consists of area-preserving linear transformations. We therefore adapt our learning pipelines to accommodate this structure, including data augmentation, a new $SL(2,\mathbb{R})$-equivariant architecture, and an architecture equivariant with respect to its maximal compact subgroup, $SO(2, \mathbb{R})$. Surprisingly, the most successful approaches in practice do not enforce equivariance to the entire group, which we prove arises from an unusual lack of architecture universality for $SL(2,\mathbb{R})$ in particular. A consequence of this result, which is of independent interest, is that there exists an equivariant function for which there is no sequence of equivariant polynomials multiplied by arbitrary invariants that approximates the original function. This is a rare example of a symmetric problem where data augmentation outperforms a fully equivariant architecture, and provides interesting lessons in both theory and practice for other problems with non-compact symmetries.
摘要
Translated into Simplified Chinese:优化和证明多项式的正性是数学和工程应用的基本原则,从动力系统到运筹学。然而,在实践中解决这些问题需要大量的semidefinitePrograms,具有糟糕的维度和度量。在这种工作中,我们第一次示出了神经网络可以效果地解决这些问题,实现了十倍的速度提升,同时保持高度准确。此外,我们发现这些多项式学习问题与非紧定群$SL(2,\mathbb{R})$相关,该群包括保持面积的线性变换。我们因此适应这种结构,包括数据扩充、新的$SL(2,\mathbb{R})$-equivariant架构和$SO(2,\mathbb{R})$的最大凝合子架构。奇怪的是,在实践中最成功的方法并不强制执行 equivariant 到整个群,我们证明这是由 $SL(2,\mathbb{R})$ 特殊的 Architecture 缺乏 universality 所致。这个结果的一个独立的 interesseting 结论是,存在一个 equivariant 函数,其中不存在任何 equivariant 多项式乘以arbitrary invariants 可以近似原始函数。这是非常罕见的对称问题,数据扩充 exceeds 完全 equivariant 架构,提供了有趣的理论和实践教训 для其他具有非紧定对称的问题。
Mitigating Data Injection Attacks on Federated Learning
results: simulations 表明,当协调节点检测并忽略所有伪装攻击者时,模型会恢复并 converges 到真实的模型Abstract
Federated learning is a technique that allows multiple entities to collaboratively train models using their data without compromising data privacy. However, despite its advantages, federated learning can be susceptible to false data injection attacks. In these scenarios, a malicious entity with control over specific agents in the network can manipulate the learning process, leading to a suboptimal model. Consequently, addressing these data injection attacks presents a significant research challenge in federated learning systems. In this paper, we propose a novel technique to detect and mitigate data injection attacks on federated learning systems. Our mitigation method is a local scheme, performed during a single instance of training by the coordinating node, allowing the mitigation during the convergence of the algorithm. Whenever an agent is suspected to be an attacker, its data will be ignored for a certain period, this decision will often be re-evaluated. We prove that with probability 1, after a finite time, all attackers will be ignored while the probability of ignoring a trustful agent becomes 0, provided that there is a majority of truthful agents. Simulations show that when the coordinating node detects and isolates all the attackers, the model recovers and converges to the truthful model.
摘要
federated learning 是一种技术,允许多个实体共同训练模型,使用他们的数据,而不需要数据隐私泄露。然而,这种技术可能受到假数据插入攻击。在这些情况下,一个黑客控制specific agents在网络中的实体可以操纵学习过程,导致模型下降。因此,解决这些数据插入攻击对 Federated Learning 系统具有重要的研究挑战。在这篇论文中,我们提出了一种检测和 Mitigate 数据插入攻击的新技术。我们的避免方法是一种本地方案,在协调节点单个训练过程中执行,allowing mitigation during the convergence of the algorithm。当一个代理被怀疑是黑客时,其数据将被忽略一定时间,这个决定将经常重新评估。我们证明,在某些条件下,在一定时间后,所有的黑客都将被忽略,而不会忽略任何真实的代理,假设有多数的真实代理。仪表试验显示,当协调节点检测并隔离所有黑客时,模型会恢复并 converges 到真实的模型。
Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios
results: 研究发现,在大多数情况下,使用ERM适应正例数据的分类器在单例情景下的性能会显著下降。此外,研究还引入了单例情景下的非正式风险分类器,并对其与原始提议进行了比较,发现它们在一些情况下存在显著差异。Abstract
In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.
摘要
在论文中,我们Arguments that the performance of classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, designed for case-control sampling scheme, may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. In addition, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that significant differences occur between them, especially when half or more positive observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered, and similar conclusions are drawn. Taking into account the differences of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.Here's the breakdown of the translation:* 在论文中 (shang zhong yu yan jiu) - In the paper* 我们Arguments (wo men ying xiang) - We argue* dass (da) - that* 表现 (biao xian) - performance* 可能 (ke yi) - may* 显著 (xian zhi) - significantly* 下降 (xia jiang) - deteriorate* 应用 (ying yao) - when applied* 到 (dao) - to* 单个样本 (dan ye xiang mu) - a single-sample scenario* 我们 (wo men) - we* 揭示 (jiang shi) - reveal* 他们 (ta men) - their* 行为 (xing wei) - behavior* 取决 (qiu jue) - depends* 在 (zai) - in* 很特殊 (hen tai xi) - all but very specific cases* 场景 (chang jing) - scenario* 以及 (yi qi) - and* 我们 (wo men) - we* 引入 (yin jian) - introduce* 一个单例 (yi ge dan yi) - a single-sample case analogue* 的 (de) - of* 非正式 (fei zheng xi) - non-negative risk classifier* 设计 (she jian) - designed* для (for) - for* 整体 (zheng ti) - overall* 数据 (shu ju) - data* 比较 (bi xie) - compare* 其性能 (qie xing neng) - its performance* 与 (yu) - with* 原始提案 (yuan shi tiaolan) - the original proposal* 显著差异 (xian zhi yi yi) - significant differences* 出现 (chu xian) - occur* 特别 (te bi) - especially* 半数 (ban shu) - half* 或更多 (or geng duo) - or more* 正面 (zheng mian) - positive* 观察 (guan yu) - observations* Label (jiao) - labeled* 相反 (xiang fan) - the opposite case* 当 (when) - when* ERM minimizer (ERM minimizer) - designed* 应用 (ying yao) - applied* 于 (yu) - to* 单个样本 (dan ye xiang mu) - a single-sample scenario* 我们 (wo men) - we* 揭示 (jiang shi) - reveal* 他们 (ta men) - their* 行为 (xing wei) - behavior* 取决 (qiu jue) - depends* 在 (zai) - in* 很特殊 (hen tai xi) - all but very specific cases* 场景 (chang jing) - scenario* 以及 (yi qi) - and* 我们 (wo men) - we* 需要 (xū yào) - requires* 一个小 cambio (yī gè xiǎo jiāng) - a sole, but crucial, change* 在 (zai) - in* 定义 (dìng yì) - definition* Empirical Risk (Empirical Risk) - of* 不同 (bù tiān) - different* 场景 (chang jing) - scenariosNote that the translation is in Simplified Chinese, and the word order may be different from the original text in Traditional Chinese.
Deep Set Neural Networks for forecasting asynchronous bioprocess timeseries
results: 在这篇论文中,作者使用了这种方法进行了一些预测任务,并与传统的适应方法和基于填充和对齐的方法进行了比较,结果表明,这种方法可以更好地处理缺失数据和不规则的时间序列,并且可以提高预测的准确性。Abstract
Cultivation experiments often produce sparse and irregular time series. Classical approaches based on mechanistic models, like Maximum Likelihood fitting or Monte-Carlo Markov chain sampling, can easily account for sparsity and time-grid irregularities, but most statistical and Machine Learning tools are not designed for handling sparse data out-of-the-box. Among popular approaches there are various schemes for filling missing values (imputation) and interpolation into a regular grid (alignment). However, such methods transfer the biases of the interpolation or imputation models to the target model. We show that Deep Set Neural Networks equipped with triplet encoding of the input data can successfully handle bio-process data without any need for imputation or alignment procedures. The method is agnostic to the particular nature of the time series and can be adapted for any task, for example, online monitoring, predictive control, design of experiments, etc. In this work, we focus on forecasting. We argue that such an approach is especially suitable for typical cultivation processes, demonstrate the performance of the method on several forecasting tasks using data generated from macrokinetic growth models under realistic conditions, and compare the method to a conventional fitting procedure and methods based on imputation and alignment.
摘要
培养实验经常产生稀疏和不规则的时间序列数据。经典方法,如最大可能性适应或 Monte-Carlo Markov chain 采样,可以轻松处理稀疏数据和时间格点不规则,但大多数统计学和机器学习工具不是直接处理稀疏数据的设计。在各种方法中,有许多填充缺失值(imputation)和对时间序列进行排序(alignment)的方法,但这些方法会将推断模型的偏见传递给目标模型。我们表明,使用深度集成神经网络,并将输入数据 triplet 编码,可以成功处理生物过程数据,无需进行填充或对齐过程。这种方法不依赖于特定时间序列的性质,可以适应任何任务,例如在线监测、预测控制、实验设计等。在这个工作中,我们关注于预测任务。我们认为,这种方法特别适用于典型的培养过程,并在使用数据生成自 макрокинетиче增长模型的实际条件下进行了证明。我们还与传统适应过程和基于填充和对齐的方法进行比较。
Federated Learning is Better with Non-Homomorphic Encryption
paper_authors: Konstantin Burlachenko, Abdulmajeed Alrowithi, Fahad Ali Albalawi, Peter Richtarik for:本研究旨在提供一种可靠、安全、可扩展的分布式学习(Federated Learning,FL)框架,以帮助解决传统AI方法所遇到的中央数据收集和隐私问题。methods:本研究使用了卷积 encryption 和分布式加密技术,以提供安全和可靠的训练过程。此外,本研究还提出了一种基于 permutation 的压缩算法,以减少计算和存储成本。results:本研究的实验结果表明,使用本研究提出的框架和算法可以减少计算和存储成本,同时保持较高的训练效果。此外,本研究还发现了一些可能的应用场景,如医疗领域和智能家居等。Abstract
Traditional AI methodologies necessitate centralized data collection, which becomes impractical when facing problems with network communication, data privacy, or storage capacity. Federated Learning (FL) offers a paradigm that empowers distributed AI model training without collecting raw data. There are different choices for providing privacy during FL training. One of the popular methodologies is employing Homomorphic Encryption (HE) - a breakthrough in privacy-preserving computation from Cryptography. However, these methods have a price in the form of extra computation and memory footprint. To resolve these issues, we propose an innovative framework that synergizes permutation-based compressors with Classical Cryptography, even though employing Classical Cryptography was assumed to be impossible in the past in the context of FL. Our framework offers a way to replace HE with cheaper Classical Cryptography primitives which provides security for the training process. It fosters asynchronous communication and provides flexible deployment options in various communication topologies.
摘要
传统的人工智能方法ologies需要中央数据收集,这在面临网络通信、数据隐私和存储容量问题时变得不实际。 Federated Learning(FL)提供了一种分布式人工智能模型训练的方法,无需收集原始数据。在FL训练中,保护隐私是一个重要的问题。一种流行的方法是使用扩展式加密(HE),这是 encrypt computation 的一种突破口。然而,这些方法带有额外的计算和存储占用。为解决这些问题,我们提出了一个创新的框架,它将 permutation-based compressors 与古典密码学结合,即使在FL中使用古典密码学 primitives 也可以提供安全性。我们的框架可以将 HE 替换为更便宜的古典密码学 primitives,并且支持异步通信和多种通信拓扑的灵活部署。
The GPU Phase Folding and Deep Learning Method for Detecting Exoplanet Transits
results: 比 traditional Box-fitting Least Squares(BLS)方法快三个数量级,并且在准确率和准确率之间具有更高的准确率和更高的准确率。Abstract
This paper presents GPFC, a novel Graphics Processing Unit (GPU) Phase Folding and Convolutional Neural Network (CNN) system to detect exoplanets using the transit method. We devise a fast folding algorithm parallelized on a GPU to amplify low signal-to-noise ratio transit signals, allowing a search at high precision and speed. A CNN trained on two million synthetic light curves reports a score indicating the likelihood of a planetary signal at each period. GPFC improves on speed by three orders of magnitude over the predominant Box-fitting Least Squares (BLS) method. Our simulation results show GPFC achieves 97% training accuracy, higher true positive rate at the same false positive rate of detection, and higher precision at the same recall rate when compared to BLS. GPFC recovers 100% of known ultra-short-period planets in Kepler light curves from a blind search. These results highlight the promise of GPFC as an alternative approach to the traditional BLS algorithm for finding new transiting exoplanets in data taken with Kepler and other space transit missions such as K2, TESS and future PLATO and Earth 2.0.
摘要
GFS: Graph-based Feature Synthesis for Prediction over Relational Databases
paper_authors: Han Zhang, Quan Gan, David Wipf, Weinan Zhang
for: 这篇论文主要是为了解决关系数据库中的数据挖掘和机器学习问题。
methods: 该论文提出了一种新的框架 called Graph-based Feature Synthesis (GFS),它将关系数据库转换为不同表之间的异种图,以保留数据中的关系结构。
results: 在四个真实的多表关系数据库上进行了广泛的实验,GFS的性能超过了之前为关系数据库设计的方法,证明了它的优越性。Abstract
Relational databases are extensively utilized in a variety of modern information system applications, and they always carry valuable data patterns. There are a huge number of data mining or machine learning tasks conducted on relational databases. However, it is worth noting that there are limited machine learning models specifically designed for relational databases, as most models are primarily tailored for single table settings. Consequently, the prevalent approach for training machine learning models on data stored in relational databases involves performing feature engineering to merge the data from multiple tables into a single table and subsequently applying single table models. This approach not only requires significant effort in feature engineering but also destroys the inherent relational structure present in the data. To address these challenges, we propose a novel framework called Graph-based Feature Synthesis (GFS). GFS formulates the relational database as a heterogeneous graph, thereby preserving the relational structure within the data. By leveraging the inductive bias from single table models, GFS effectively captures the intricate relationships inherent in each table. Additionally, the whole framework eliminates the need for manual feature engineering. In the extensive experiment over four real-world multi-table relational databases, GFS outperforms previous methods designed for relational databases, demonstrating its superior performance.
摘要
现代信息系统中广泛使用关系数据库,这些数据库总是含有价值的数据模式。有很多数据挖掘或机器学习任务在关系数据库上进行。然而,需要注意的是,关系数据库上的机器学习模型很少,大多数模型都是单表设计的。因此,通常需要在多个表中合并数据,并应用单表模型进行训练。这种方法不仅需要大量的特 engineering功能,还会遗弃关系数据库中的自然结构。为解决这些挑战,我们提出了一种新的框架,即图基于特征合成(GFS)。GFS将关系数据库转化为不同类型表的异类图,从而保留数据中的自然关系。通过单表模型的假设,GFS可以有效捕捉每个表中的复杂关系。此外,整个框架不需要手动特 engineering。在四个真实的多表关系数据库上进行了广泛的实验,GFS在前一些设计于关系数据库上的方法之上表现出优于性。
methods: 这篇论文使用的方法是Stochastic Optimal Control Matching(SOCM),它基于diffusion模型的条件Score Matching损失函数,并通过一个迭代的数值最佳控制算法来实现。
results: 实验结果显示,这篇论文的算法在四个不同的控制设定下均能够取得较低的错误值,比其他所有IDO技术更好。Abstract
Stochastic optimal control, which has the goal of driving the behavior of noisy systems, is broadly applicable in science, engineering and artificial intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least squares problem by trying to fit a matching vector field. The training loss, which is closely connected to the cross-entropy loss, is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. The optimization with respect to the reparameterization matrices aims at minimizing the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all the existing IDO techniques for stochastic optimal control for four different control settings. The key idea underlying SOCM is the path-wise reparameterization trick, a novel technique that is of independent interest, e.g., for generative modeling.
摘要
Optimal Data Generation in Multi-Dimensional Parameter Spaces, using Bayesian Optimization
methods: 我们提出了一种新的方法,利用 Gaussian process regression(GPR)模仿输入和输出参数之间的下面关系,以构建高度有用的最小数据库。通过知道的数据,GPR提供了预测性的方差和标准差,从而选择数据点,以获得高精度的ML模型训练数据库。
results: 我们的结果表明,使用 Bayesian optimization方法选择数据点后,ML模型在训练数据库上表现出了高精度和较小的数据量,在与传统方法训练的数据库上表现出了明显的优势。这种方法为高维复杂参数空间中的数据收集做出了贡献,以实现高精度机器学习预测。Abstract
Acquiring a substantial number of data points for training accurate machine learning (ML) models is a big challenge in scientific fields where data collection is resource-intensive. Here, we propose a novel approach for constructing a minimal yet highly informative database for training ML models in complex multi-dimensional parameter spaces. To achieve this, we mimic the underlying relation between the output and input parameters using Gaussian process regression (GPR). Using a set of known data, GPR provides predictive means and standard deviation for the unknown data. Given the predicted standard deviation by GPR, we select data points using Bayesian optimization to obtain an efficient database for training ML models. We compare the performance of ML models trained on databases obtained through this method, with databases obtained using traditional approaches. Our results demonstrate that the ML models trained on the database obtained using Bayesian optimization approach consistently outperform the other two databases, achieving high accuracy with a significantly smaller number of data points. Our work contributes to the resource-efficient collection of data in high-dimensional complex parameter spaces, to achieve high precision machine learning predictions.
摘要
获取具有大量数据点的高精度机器学习(ML)模型训练是科学领域中的一大挑战,因为数据收集是资源占用的。在这里,我们提出了一种新的方法,用于构建高度简洁又高度有 информатив性的数据库,用于训练ML模型在复杂多维参数空间中。为此,我们使用 Gaussian process regression(GPR)模仿输出和输入参数之间的下面关系。使用一组已知数据,GPR提供了预测性的方差和标准差,用于未知数据。根据预测的标准差,我们使用 Bayesian optimization来选择数据点,以获得高效的数据库 для训练ML模型。我们对使用这种方法 construct的数据库与传统方法 construct的数据库进行比较。我们的结果表明,ML模型在使用 Bayesian optimization方法 construct的数据库上训练时,常常表现出高精度和较少的数据点。我们的工作对高维复杂参数空间中资源有效的数据收集做出了贡献,以实现高精度机器学习预测。
paper_authors: Mohammad Ali Vahedifar, Azim Akhtarshenas, Mariam Sabbaghian, Mohammad Rafatpanah
for: 提高 K-Nearest Neighbors(KNN)算法的表现
methods: 利用 Mutual Information(MI)增强质量,基于 Cooperative Game Theory 的 Shapley values 细化值分配
results: 比较了 7 种 contemporary KNN 变种和传统 KNN,在 12 个广泛使用的数据集上进行了实验,并通过精度、准确率和回归率来评估方法的表现,研究发现 IMKNN 在不同的数据集和评价标准上具有显著的优势,能够在多种分类任务中提高 KNN 算法的表现。Abstract
In this research paper, we introduce a novel classification method aimed at improving the performance of the K-Nearest Neighbors (KNN) algorithm. Our approach leverages Mutual Information (MI) to enhance the significance of weights and draw inspiration from Shapley values, a concept originating from cooperative game theory, to refine value allocation. The fundamental concept underlying KNN is the classification of samples based on the majority thorough their k-nearest neighbors. While both the distances and labels of these neighbors are crucial, traditional KNN assigns equal weight to all samples and prevance considering the varying importance of each neighbor based on their distances and labels. In the proposed method, known as Information-Modified KNN (IMKNN), we address this issue by introducing a straightforward algorithm. To evaluate the effectiveness of our approach, it is compared with 7 contemporary variants of KNN, as well as the traditional KNN. Each of these variants exhibits its unique advantages and limitations. We conduct experiments on 12 widely-used datasets, assessing the methods' performance in terms of accuracy, precision and recall. Our study demonstrates that IMKNN consistently outperforms other methods across different datasets and criteria by highlighting its superior performance in various classification tasks. These findings underscore the potential of IMKNN as a valuable tool for enhancing the capabilities of the KNN algorithm in diverse applications.
摘要
在这篇研究论文中,我们介绍了一种新的分类方法,旨在提高基于K-最近邻居(KNN)算法的性能。我们的方法利用了相互信息(MI)来增强权重的重要性,并启取着合作游戏理论中的希柏利值(Shapley value)来细化值分配。KNN算法的基本思想是根据样本的k nearest neighbors来分类样本。而传统的KNN方法却忽略了每个邻居的重要性,不考虑它们的距离和标签的不同程度。在我们提出的方法中,称为信息修正KNN(IMKNN),我们解决了这个问题。我们对12种广泛使用的数据集进行了实验,对不同的数据集和评价标准进行了比较。我们的研究表明,IMKNN在不同的数据集和评价标准上具有优秀的性能,在多种分类任务中表现出色。这些发现表明IMKNN在多种应用中具有潜在的价值。
Towards early diagnosis of Alzheimer’s disease: Advances in immune-related blood biomarkers and computational modeling approaches
results: 研究发现,血液中免疫系统相关biomarker可能成为抑郁症的早期诊断工具,并且机器学习算法和机理模型方法可以帮助找到这些biomarker。Abstract
Alzheimer's disease has an increasing prevalence in the population world-wide, yet current diagnostic methods based on recommended biomarkers are only available in specialized clinics. Due to these circumstances, Alzheimer's disease is usually diagnosed late, which contrasts with the currently available treatment options that are only effective for patients at an early stage. Blood-based biomarkers could fill in the gap of easily accessible and low-cost methods for early diagnosis of the disease. In particular, immune-based blood-biomarkers might be a promising option, given the recently discovered cross-talk of immune cells of the central nervous system with those in the peripheral immune system. With the help of machine learning algorithms and mechanistic modeling approaches, such as agent-based modeling, an in-depth analysis of the simulation of cell dynamics is possible as well as of high-dimensional omics resources indicative of pathway signaling changes. Here, we give a background on advances in research on brain-immune system cross-talk in Alzheimer's disease and review recent machine learning and mechanistic modeling approaches which leverage modern omics technologies for blood-based immune system-related biomarker discovery.
摘要
阿尔茨海默病在全球人口中的发病率不断增加,然而现有的诊断方法基于推荐的生物标志物仅限于专业医疗机构。由于这种情况,阿尔茨海默病通常在晚期诊断,与现有的治疗方法只有在早期效果的情况下。血液基的生物标志物可以填补访问 easily 和成本低的诊断方法的空白。特别是免疫基的血液生物标志物可能是一个有前途的选择,因为最近发现的中枢神经系统免疫细胞与 périphériques 免疫系统之间的交互。通过机器学习算法和机制模型方法,如代理模型,可以进行深入的 cell 动态分析以及高维 omics 资源表示Pathway 信号变化。本文提供了关于研究大脑免疫系统之间的交互在阿尔茨海默病中的进展,以及最近机器学习和机制模型方法的应用,以探索血液基免疫系统相关生物标志物的发现。
Maximising Quantum-Computing Expressive Power through Randomised Circuits
results: 论文通过数值实验表明,随机量子电路approximation可以达到任意精度,并且与时间成本和门槛数之间存在贝叶矩阵关系。这些结果表明随机电路方法在量子计算中表现出了扎实的潜力。Abstract
In the noisy intermediate-scale quantum era, variational quantum algorithms (VQAs) have emerged as a promising avenue to obtain quantum advantage. However, the success of VQAs depends on the expressive power of parameterised quantum circuits, which is constrained by the limited gate number and the presence of barren plateaus. In this work, we propose and numerically demonstrate a novel approach for VQAs, utilizing randomised quantum circuits to generate the variational wavefunction. We parameterize the distribution function of these random circuits using artificial neural networks and optimize it to find the solution. This random-circuit approach presents a trade-off between the expressive power of the variational wavefunction and time cost, in terms of the sampling cost of quantum circuits. Given a fixed gate number, we can systematically increase the expressive power by extending the quantum-computing time. With a sufficiently large permissible time cost, the variational wavefunction can approximate any quantum state with arbitrary accuracy. Furthermore, we establish explicit relationships between expressive power, time cost, and gate number for variational quantum eigensolvers. These results highlight the promising potential of the random-circuit approach in achieving a high expressive power in quantum computing.
摘要
在噪声中等级量子时代,变量量子算法(VQA)已经出现为实现量子优势的有力的方法。然而,VQA的成功取决于可调量子电路的表达力,它受到了门数限制和板块障碍的影响。在这项工作中,我们提出了一种新的方法,使用随机量子电路来生成变量波函数。我们使用人工神经网络来参数化这些随机电路的分布函数,并优化它以找到解决方案。这种随机电路方法存在与表达力和时间成本之间的交易,即在固定门数下,可以逐渐增加表达力,但是随着时间成本的增加,表达力也会逐渐下降。此外,我们确立了变量量子电路的表达力、时间成本和门数之间的直接关系,这些结果表明随机电路方法在量子计算中实现高表达力的潜力。
Intrusion Detection System with Machine Learning and Multiple Datasets
For: The paper aims to enhance the performance of an intrusion detection system (IDS) using machine learning (ML) and hyperparameter tuning to combat attacks by unethical hackers.* Methods: The paper explores the use of multiple datasets and machine learning models, including XGBoost and random forest classifiers, to improve the accuracy and efficacy of the IDS. The paper also employs the RandomizedSearchCV hyperparameter technique to optimize the performance of the models.* Results: The proposed multi-dataset integration method achieved an accuracy score of 99.9% when equipped with XGBoost and random forest classifiers and the RandomizedSearchCV hyperparameter technique.Abstract
As Artificial Intelligence (AI) technologies continue to gain traction in the modern-day world, they ultimately pose an immediate threat to current cybersecurity systems via exploitative methods. Prompt engineering is a relatively new field that explores various prompt designs that can hijack large language models (LLMs). If used by an unethical attacker, it can enable an AI system to offer malicious insights and code to them. In this paper, an enhanced intrusion detection system (IDS) that utilizes machine learning (ML) and hyperparameter tuning is explored, which can improve a model's performance in terms of accuracy and efficacy. Ultimately, this improved system can be used to combat the attacks made by unethical hackers. A standard IDS is solely configured with pre-configured rules and patterns; however, with the utilization of machine learning, implicit and different patterns can be generated through the models' hyperparameter settings and parameters. In addition, the IDS will be equipped with multiple datasets so that the accuracy of the models improves. We evaluate the performance of multiple ML models and their respective hyperparameter settings through various metrics to compare their results to other models and past research work. The results of the proposed multi-dataset integration method yielded an accuracy score of 99.9% when equipped with the XGBoost and random forest classifiers and RandomizedSearchCV hyperparameter technique.
摘要
如果人工智能(AI)技术继续在现代世界得到普及,它们最终会对当前的网络安全系统 pose 潜在的威胁。快速工程是一个相对较新的领域,它探索了不同的提示设计,可以劫持大型自然语言模型(LLM)。如果由不道德的攻击者使用,它可以启用 AI 系统提供有害的洞察和代码。在这篇论文中,我们探讨了一种增强型网络入侵检测系统(IDS),它利用机器学习(ML)和超参数调整来提高模型的性能。最终,这种改进的系统可以用来对抗不道德的黑客的攻击。标准 IDS 仅仅配置了预先配置的规则和模式,但通过利用机器学习,可以生成透过模型的超参数设置和参数的不同和含义的隐式模式。此外,IDS 还将被配置多个数据集,以便提高模型的准确率。我们通过多种 metrics 评估多种 ML 模型和其相应的超参数设置,并与之前的研究成果进行比较。结果表明,我们的多数据集综合方法可以达到 99.9% 的准确率,当使用 XGBoost 和Random Forest 分类器和RandomizedSearchCV 超参数技术。
Analysis and mining of low-carbon and energy-saving tourism data characteristics based on machine learning algorithm
For: This paper aims to study the formation mechanism of residents’ low-carbon awareness and provide an important basis for traffic managers to guide urban residents to choose low-carbon travel modes.* Methods: The paper uses data mining technology to analyze the data of low-carbon travel questionnaires, and applies machine learning algorithms such as K-means clustering and random forest to explore the mechanism of residents’ low-carbon travel willingness.* Results: The paper finds that residents’ low-carbon travel willingness can be divided into three categories based on their social attribute characteristics, travel characteristics, and other factors. The four most significant factors affecting residents’ low-carbon travel willingness are occupation, residence, family composition, and commuting time.Abstract
In order to study the formation mechanism of residents' low-carbon awareness and provide an important basis for traffic managers to guide urban residents to choose low-carbon travel mode, this paper proposes a low-carbon energy-saving travel data feature analysis and mining based on machine learning algorithm. This paper uses data mining technology to analyze the data of low-carbon travel questionnaire, and regards the 15-dimensional problem under the framework of planned behavior theory as the internal cause variable that characterizes residents' low-carbon travel willingness. The author uses K-means clustering algorithm to classify the intensity of residents' low-carbon travel willingness, and applies the results as the explanatory variables to the random forest model to explore the mechanism of residents' social attribute characteristics, travel characteristics, etc. on their low-carbon travel willingness. The experimental results show that based on the Silhouette index test and t-SNE dimensionality reduction, residents' low-carbon travel willingness can be divided into three categories: strong, neutral, and not strong; Based on the importance index, the four most significant factors are the occupation, residence, family composition and commuting time of residents. Conclusion: This method provides policy recommendations for the development and management of urban traffic low-carbon from multiple perspectives.
摘要
为了研究城市居民低碳意识形成机制并为城市交通管理者提供低碳交通模式选择的重要依据,这篇论文提议利用机器学习算法进行低碳能源节能旅行数据特征分析和挖掘。本论文使用数据挖掘技术分析低碳旅行问卷数据,并在决定理论框架下视居民低碳旅行意愿的15个维度问题为内在原因变量。作者使用K-means归一化算法将居民低碳旅行意愿的强度分类,并将结果作为解释变量 aplicar Random Forest 模型探索居民社会属性、旅行属性等对其低碳旅行意愿的影响。实验结果显示,根据Silhouette 指数测试和t-SNE维度减少,居民低碳旅行意愿可以分为三类:强、中性和不强;根据重要性指数,居民的四个最重要因素是职业、住址、家庭结构和通勤时间。结论:这种方法为城市交通发展和管理带来多种视角的政策建议。
Unlocking optimal batch size schedules using continuous-time control and perturbation theory
results: 这篇论文得到了一种 continuous-time optimal batch size schedule для大家族of diffusion coefficients,并在 linear regression Setting 中应用了结果。Abstract
Stochastic Gradient Descent (SGD) and its variants are almost universally used to train neural networks and to fit a variety of other parametric models. An important hyperparameter in this context is the batch size, which determines how many samples are processed before an update of the parameters occurs. Previous studies have demonstrated the benefits of using variable batch sizes. In this work, we will theoretically derive optimal batch size schedules for SGD and similar algorithms, up to an error that is quadratic in the learning rate. To achieve this, we approximate the discrete process of parameter updates using a family of stochastic differential equations indexed by the learning rate. To better handle the state-dependent diffusion coefficient, we further expand the solution of this family into a series with respect to the learning rate. Using this setup, we derive a continuous-time optimal batch size schedule for a large family of diffusion coefficients and then apply the results in the setting of linear regression.
摘要
Non-Intrusive Load Monitoring for Feeder-Level EV Charging Detection: Sliding Window-based Approaches to Offline and Online Detection
paper_authors: Cameron Martin, Fucai Ke, Hao Wang for: 这篇论文的目的是为了实现电动车(EV)充电 networks 上的有效管理,并帮助确保能源和交通领域的减排。methods: 这篇论文使用了进步的计量基础设施,从分布网络中收集高精度的负载数据,并使用非入侵式负载监控(NILM)技术来探测EV充电。results: 这篇论文获得了高精度的EV充电探测结果,在 feeder 层获得了98.88%的偏好分(F-Score),并在线上探测中获得了93.01%的偏好分。Abstract
Understanding electric vehicle (EV) charging on the distribution network is key to effective EV charging management and aiding decarbonization across the energy and transport sectors. Advanced metering infrastructure has allowed distribution system operators and utility companies to collect high-resolution load data from their networks. These advancements enable the non-intrusive load monitoring (NILM) technique to detect EV charging using load measurement data. While existing studies primarily focused on NILM for EV charging detection in individual households, there is a research gap on EV charging detection at the feeder level, presenting unique challenges due to the combined load measurement from multiple households. In this paper, we develop a novel and effective approach for EV detection at the feeder level, involving sliding-window feature extraction and classical machine learning techniques, specifically models like XGBoost and Random Forest. Our developed method offers a lightweight and efficient solution, capable of quick training. Moreover, our developed method is versatile, supporting both offline and online EV charging detection. Our experimental results demonstrate high-accuracy EV charging detection at the feeder level, achieving an F-Score of 98.88% in offline detection and 93.01% in online detection.
摘要
Translated into Simplified Chinese:理解电动车(EV)充电在分布网络上的理解是管理电动车充电和能源交通领域的关键,以便实现碳 neutrality。高级计量基础设施已经允许分布系统运营商和供应商收集高分辨率的荷量数据。这些进步使得非侵入式荷量监测(NILM)技术可以通过荷量测量数据探测电动车充电。而现有研究主要集中在各个家庭的NILM电动车充电探测上,存在一个研究空白,即集成荷量测量数据从多个家庭的EV充电探测。在这篇论文中,我们开发了一种新的和有效的方法,用于在分布器级别探测电动车充电,包括滑动窗口特征提取和классиical机器学习技术,具体使用XGBoost和Random Forest模型。我们开发的方法具有轻量级和高效的特点,可以快速训练。此外,我们的方法是多样化的,支持在线和离线电动车充电探测。我们的实验结果表明,在分布器级别,我们的方法可以准确探测电动车充电,达到了98.88%的偏好分和93.01%的在线探测精度。
HGPROMPT: Bridging Homogeneous and Heterogeneous Graphs for Few-shot Prompt Learning
methods: 这篇论文提出了一个名为HGPROMPT的新的预训练和提示框架,该框架使用了一个双板设计,具有以下两个特点:first, it unifies homogeneous and heterogeneous graphs via a dual-template design; second, it proposes dual-prompt to assist downstream tasks in locating the most relevant prior to bridge the gaps caused by not only feature variations but also heterogeneity differences across tasks.
results: 在这篇论文中, authors thoroughly evaluated and analyzed HGPROMPT through extensive experiments on three public datasets, and achieved promising results. Specifically, HGPROMPT outperformed the state-of-the-art baselines on all three datasets, and demonstrated its ability to bridge the gap between pre-training and downstream tasks, as well as between homogeneous and heterogeneous graphs.Abstract
Graph neural networks (GNNs) and heterogeneous graph neural networks (HGNNs) are prominent techniques for homogeneous and heterogeneous graph representation learning, yet their performance in an end-to-end supervised framework greatly depends on the availability of task-specific supervision. To reduce the labeling cost, pre-training on self-supervised pretext tasks has become a popular paradigm,but there is often a gap between the pre-trained model and downstream tasks, stemming from the divergence in their objectives. To bridge the gap, prompt learning has risen as a promising direction especially in few-shot settings, without the need to fully fine-tune the pre-trained model. While there has been some early exploration of prompt-based learning on graphs, they primarily deal with homogeneous graphs, ignoring the heterogeneous graphs that are prevalent in downstream applications. In this paper, we propose HGPROMPT, a novel pre-training and prompting framework to unify not only pre-training and downstream tasks but also homogeneous and heterogeneous graphs via a dual-template design. Moreover, we propose dual-prompt in HGPROMPT to assist a downstream task in locating the most relevant prior to bridge the gaps caused by not only feature variations but also heterogeneity differences across tasks. Finally, we thoroughly evaluate and analyze HGPROMPT through extensive experiments on three public datasets.
摘要
GRAPH NEURAL NETWORKS (GNNs) 和异种 GRAPH NEURAL NETWORKS (HGNNs) 是Homogeneous 和异种 GRAPH 表示学习的主要技术,但它们在终端指导下的性能受到任务特定的监督的限制。为了减少标注成本,预训练在自我监督任务上是一种流行的方法,但是这两个任务之间的差异可能会导致 gap 问题。为了衔接这个 gap,提示学习在 few-shot Settings 中 emerge 为一种有前途的方向,不需要完全 fine-tune 预训练模型。然而,早期的提示学习研究主要集中在Homogeneous GRAPH 上,忽略了 downstream 应用中广泛存在的异种 GRAPH。在这篇论文中,我们提出了 HGPROMPT,一种 novel 的预训练和提示框架,可以Unify 不只是预训练和终端任务,还有 homogeneous 和异种 GRAPH。此外,我们还提出了 dual-prompt 在 HGPROMPT 中,可以帮助下游任务在不同的 feature 和异种 GRAPH 之间找到最相关的优化。最后,我们通过了 Extensive 的实验和分析,证明了 HGPROMPT 的可行性和有效性。
FlowHON: Representing Flow Fields Using Higher-Order Networks
results: 本文通过应用不同的下游任务,如流体density估计、流场分区和网络图示 Representation来证明FlowHON的效果。Abstract
Flow fields are often partitioned into data blocks for massively parallel computation and analysis based on blockwise relationships. However, most of the previous techniques only consider the first-order dependencies among blocks, which is insufficient in describing complex flow patterns. In this work, we present FlowHON, an approach to construct higher-order networks (HONs) from flow fields. FlowHON captures the inherent higher-order dependencies in flow fields as nodes and estimates the transitions among them as edges. We formulate the HON construction as an optimization problem with three linear transformations. The first two layers correspond to the node generation and the third one corresponds to edge estimation. Our formulation allows the node generation and edge estimation to be solved in a unified framework. With FlowHON, the rich set of traditional graph algorithms can be applied without any modification to analyze flow fields, while leveraging the higher-order information to understand the inherent structure and manage flow data for efficiency. We demonstrate the effectiveness of FlowHON using a series of downstream tasks, including estimating the density of particles during tracing, partitioning flow fields for data management, and understanding flow fields using the node-link diagram representation of networks.
摘要
<>translate("Flow fields are often partitioned into data blocks for massively parallel computation and analysis based on blockwise relationships. However, most of the previous techniques only consider the first-order dependencies among blocks, which is insufficient in describing complex flow patterns. In this work, we present FlowHON, an approach to construct higher-order networks (HONs) from flow fields. FlowHON captures the inherent higher-order dependencies in flow fields as nodes and estimates the transitions among them as edges. We formulate the HON construction as an optimization problem with three linear transformations. The first two layers correspond to the node generation and the third one corresponds to edge estimation. Our formulation allows the node generation and edge estimation to be solved in a unified framework. With FlowHON, the rich set of traditional graph algorithms can be applied without any modification to analyze flow fields, while leveraging the higher-order information to understand the inherent structure and manage flow data for efficiency. We demonstrate the effectiveness of FlowHON using a series of downstream tasks, including estimating the density of particles during tracing, partitioning flow fields for data management, and understanding flow fields using the node-link diagram representation of networks.")]Here's the translation in Simplified Chinese:流场 часто被分割为数据块进行大规模并行计算和分析,基于块之间的关系。然而,大多数先前技术只考虑流场中块之间的首次依赖关系,这不够用于描述复杂的流动模式。在这种工作中,我们提出了FlowHON方法,可以从流场中构建高阶网络(HON)。FlowHON捕捉流场中块之间的内在高阶依赖关系,并将它们作为节点和边进行估计。我们将HON构建定义为一个优化问题,包括三个线性变换。第一两层对应于节点生成,第三层对应于边估计。我们的定义允许节点生成和边估计在一个统一的框架中解决。与FlowHON相比,传统的图算法可以无需修改应用于分析流场,同时利用更高级的信息来理解流场的内在结构和管理流数据的效率。我们使用一系列下游任务,包括跟踪中粒子的浓度估计、分割流场数据管理和流场使用节点连接图表示方法来证明FlowHON的效果。
results: 该研究通过应用该新方法到一组synthetic toy case数据集,并成功从 simulate stellar streams中提取了一个分析星系潜能函数。Abstract
We introduce "Class Symbolic Regression" a first framework for automatically finding a single analytical functional form that accurately fits multiple datasets - each governed by its own (possibly) unique set of fitting parameters. This hierarchical framework leverages the common constraint that all the members of a single class of physical phenomena follow a common governing law. Our approach extends the capabilities of our earlier Physical Symbolic Optimization ($\Phi$-SO) framework for Symbolic Regression, which integrates dimensional analysis constraints and deep reinforcement learning for symbolic analytical function discovery from data. We demonstrate the efficacy of this novel approach by applying it to a panel of synthetic toy case datasets and showcase its practical utility for astrophysics by successfully extracting an analytic galaxy potential from a set of simulated orbits approximating stellar streams.
摘要
我们介绍“类别符号 regresión”,一个首创的框架,能够自动获得多个数据集合的单一数学函数形式,对应每个数据集合的专有适应parameter。这层次框架利用所有物理现象的共同限制,即所有物理现象都受到共同的管理法则。我们的方法延伸了我们之前的物理符号优化($\Phi$-SO)框架,它组合了尺度分析限制和深度游戏学 Symbolic Regression 的数学函数发现。我们透过对一系列人工构建的实验数据集合进行应用,成功地从数据中提取了一个analytic galaxy potential。
Distributed Continual Learning with CoCoA in High-dimensional Linear Regression
results: 研究发现,适当调整网络大小可以显著降低总体化错误,且最佳网络大小取决于任务相似性和任务数量。Abstract
We consider estimation under scenarios where the signals of interest exhibit change of characteristics over time. In particular, we consider the continual learning problem where different tasks, e.g., data with different distributions, arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the problem from a distributed estimation perspective. We consider the well-established distributed learning algorithm COCOA, which distributes the model parameters and the corresponding features over the network. We provide exact analytical characterization for the generalization error of COCOA under continual learning for linear regression in a range of scenarios, where overparameterization is of particular interest. These analytical results characterize how the generalization error depends on the network structure, the task similarity and the number of tasks, and show how these dependencies are intertwined. In particular, our results show that the generalization error can be significantly reduced by adjusting the network size, where the most favorable network size depends on task similarity and the number of tasks. We present numerical results verifying the theoretical analysis and illustrate the continual learning performance of COCOA with a digit classification task.
摘要
我们考虑在信号 интереса的特征发展过时进行估计。特别是,我们考虑了连续学习问题,其中不同任务(例如数据具有不同分布)在紧随着紧随着时间顺序抵达。相比于中央化学习文献,我们从分布式估计角度 investigate这个问题。我们考虑了已知的分布式学习算法COCOA,它将模型参数和相应的特征分布在网络上。我们提供了对于连续学习的一般化误差的正式分析,包括在不同任务类型下的情况。这些分析结果显示了模型参数的误差随着网络结构、任务相似度和任务数量的变化而变化。具体来说,我们的结果表明,通过调整网络大小,可以显著降低误差。最佳的网络大小取决于任务相似度和任务数量。我们提供了数值结果,证明了理论分析,并通过数字分类任务 illustrate了COCOA在连续学习中的性能。
Wild-Tab: A Benchmark For Out-Of-Distribution Generalization In Tabular Regression
results: 研究发现,许多异常情况泛化方法在未看到的数据上表现不佳,其 OOD 性能与 seen 数据上的性能之间存在显著的差异。而 ERM 简单的方法却能够在所有评估中表现出色,与当前状态法律环境齐平。Abstract
Out-of-Distribution (OOD) generalization, a cornerstone for building robust machine learning models capable of handling data diverging from the training set's distribution, is an ongoing challenge in deep learning. While significant progress has been observed in computer vision and natural language processing, its exploration in tabular data, ubiquitous in many industrial applications, remains nascent. To bridge this gap, we present Wild-Tab, a large-scale benchmark tailored for OOD generalization in tabular regression tasks. The benchmark incorporates 3 industrial datasets sourced from fields like weather prediction and power consumption estimation, providing a challenging testbed for evaluating OOD performance under real-world conditions. Our extensive experiments, evaluating 10 distinct OOD generalization methods on Wild-Tab, reveal nuanced insights. We observe that many of these methods often struggle to maintain high-performance levels on unseen data, with OOD performance showing a marked drop compared to in-distribution performance. At the same time, Empirical Risk Minimization (ERM), despite its simplicity, delivers robust performance across all evaluations, rivaling the results of state-of-the-art methods. Looking forward, we hope that the release of Wild-Tab will facilitate further research on OOD generalization and aid in the deployment of machine learning models in various real-world contexts where handling distribution shifts is a crucial requirement.
摘要
OUT-OF-DISTRIBUTION(OOD)泛化,一种关键技术 для 建立 Robust 机器学习模型,能够处理数据与训练集的分布不同。虽然在计算机视觉和自然语言处理领域已经取得了重要进展,但在表格数据领域(ubiquitous 在多个工业应用中)的探索仍然处于初期阶段。为了bridging 这个差距,我们提出了 Wild-Tab,一个大规模的 Benchmark ,专门用于表格回归任务中的 OOD 泛化。该 Benchmark 综合了3个来自天气预测和电力消耗预测等领域的工业数据,提供了一个真实的测试床。我们的广泛的实验,对 Wild-Tab 上的 10 种 OOD 泛化方法进行了详细的评估。我们发现,许多这些方法在未看到的数据上保持高性能水平很困难,OOD 性能与在分布靠拢的性能之间存在显著的差异。同时,Empirical Risk Minimization(ERM),尽管简单,仍然在所有评估中提供了Robust 的性能,与当前State-of-the-art 方法相当。我们希望,通过 Wild-Tab 的发布,将促进 OOD 泛化的研究,并帮助机器学习模型在各种实际应用中,处理分布shift 的需求。
EdgeConvFormer: Dynamic Graph CNN and Transformer based Anomaly Detection in Multivariate Time Series
results: 实验结果显示,EdgeConvFormer 能够从多元时间序列数据中学习时空相互关联,并在许多真实世界数据集上达到更好的异常探测性能。Abstract
Transformer-based models for anomaly detection in multivariate time series can benefit from the self-attention mechanism due to its advantage in modeling long-term dependencies. However, Transformer-based anomaly detection models have problems such as a large amount of data being required for training, standard positional encoding is not suitable for multivariate time series data, and the interdependence between time series is not considered. To address these limitations, we propose a novel anomaly detection method, named EdgeConvFormer, which integrates Time2vec embedding, stacked dynamic graph CNN, and Transformer to extract global and local spatial-time information. This design of EdgeConvFormer empowers it with decomposition capacities for complex time series, progressive spatiotemporal correlation discovery between time series, and representation aggregation of multi-scale features. Experiments demonstrate that EdgeConvFormer can learn the spatial-temporal correlations from multivariate time series data and achieve better anomaly detection performance than the state-of-the-art approaches on many real-world datasets of different scales.
摘要
“transformer基本模型可以吸取自注意机制,但它们在异常检测中存在一些问题,如需要大量数据进行训练、标准的 pozitional encoding 不适合多变量时间序列数据、并且不考虑时间序列之间的依赖关系。为解决这些限制,我们提出了一种新的异常检测方法,名为 EdgeConvFormer,它将 Time2vec 嵌入、堆式动态图 CNN 和 transformer 结合使用,以提取全球和本地空间时间信息。这种 EdgeConvFormer 的设计使得它具有对复杂时间序列进行分解、逐步发现时间序列之间的空间相关性、以及多尺度特征的代表汇总能力。实验表明,EdgeConvFormer 可以从多变量时间序列数据中学习空间时间相关性,并在许多真实世界数据集上超过当前状态的方法进行异常检测。”
ImputeFormer: Graph Transformers for Generalizable Spatiotemporal Imputation
paper_authors: Tong Nie, Guoyang Qin, Yuewen Mei, Jian Sun
for: This paper addresses the problem of multivariate time series imputation using deep neural architectures.
methods: The paper proposes a novel imputation model that leverages low-rank imputation methods and incorporates three key knowledge-driven enhancements: projected temporal attention, global adaptive graph convolution, and Fourier imputation loss.
results: The proposed model demonstrates superiority in terms of accuracy, efficiency, and flexibility on heterogeneous datasets, and provides strong empirical results that incorporating time series primitives can facilitate the development of a generalizable imputation model for a wide range of spatiotemporal imputation problems.Abstract
This paper focuses on the multivariate time series imputation problem using deep neural architectures. The ubiquitous issue of missing data in both scientific and engineering tasks necessitates the development of an effective and general imputation model. Leveraging the wisdom and expertise garnered from low-rank imputation methods, we power the canonical Transformers with three key knowledge-driven enhancements, including projected temporal attention, global adaptive graph convolution, and Fourier imputation loss. These task-agnostic inductive biases exploit the inherent structures of incomplete time series, and thus make our model versatile for a variety of imputation problems. We demonstrate its superiority in terms of accuracy, efficiency, and flexibility on heterogeneous datasets, including traffic speed, traffic volume, solar energy, smart metering, and air quality. Comprehensive case studies are performed to further strengthen the interpretability. Promising empirical results provide strong conviction that incorporating time series primitives, such as low-rank properties, can substantially facilitate the development of a generalizable model to approach a wide range of spatiotemporal imputation problems.
摘要
The Self-Loop Paradox: Investigating the Impact of Self-Loops on Graph Neural Networks
results: 研究发现,在某些 GNN 架构下,自 Loop 的信息可能会在图中具有更小的影响,相比于没有自 Loop 的图。这种现象被称为自 Loop парадок斯,它与 GNN 层数 $k$ 和 $k$ 是否为偶数有关。实验 validate 了这些理论发现,并在 23 个实际图上进行了实践研究。Abstract
Many Graph Neural Networks (GNNs) add self-loops to a graph to include feature information about a node itself at each layer. However, if the GNN consists of more than one layer, this information can return to its origin via cycles in the graph topology. Intuition suggests that this "backflow" of information should be larger in graphs with self-loops compared to graphs without. In this work, we counter this intuition and show that for certain GNN architectures, the information a node gains from itself can be smaller in graphs with self-loops compared to the same graphs without. We adopt an analytical approach for the study of statistical graph ensembles with a given degree sequence and show that this phenomenon, which we call the self-loop paradox, can depend both on the number of GNN layers $k$ and whether $k$ is even or odd. We experimentally validate our theoretical findings in a synthetic node classification task and investigate its practical relevance in 23 real-world graphs.
摘要
很多图 neural network (GNN) 添加自环到图以包含节点自身的特征信息在每层。然而,如果 GNN 包含多于一层,这些信息可以通过图 topology 的循环返回到其原点。人类的直觉 suggets 这些 "backflow" 信息在图有自环时应该更大。在这项工作中,我们反对这个直觉,并证明在某些 GNN 架构下,图中的自环可以使节点获得的信息更小。我们采用分析方法研究统计图集合的特性,并证明这种现象(我们称之为 "自环 парадок斯")可以受到层数 $k$ 和 $k$ 是偶数或奇数的影响。我们在 sintetic 节点分类任务中进行了实验验证我们的理论发现,并在 23 个真实世界图中进行了实验研究其实际 relevance。
Estimating Coronal Mass Ejection Mass and Kinetic Energy by Fusion of Multiple Deep-learning Models
paper_authors: Khalid A. Alobaid, Yasser Abduallah, Jason T. L. Wang, Haimin Wang, Shen Fan, Jialiang Li, Huseyin Cavus, Vasyl Yurchyshyn
for: 这 paper 的目的是估算 Coronal Mass Ejections (CMEs) 的质量和动能。
methods: 该 paper 使用的方法是一种深度学习模型,名为 DeepCME,通过抽象 LASCO C2 图像,以估算 CMEs 的质量和动能。
results: 实验结果显示,DeepCME 模型比最佳组件模型 InceptionResNet 和 InceptionNet 更加准确地估算 CMEs 的质量和动能,其中 MRE 为 0.013。Abstract
Coronal mass ejections (CMEs) are massive solar eruptions, which have a significant impact on Earth. In this paper, we propose a new method, called DeepCME, to estimate two properties of CMEs, namely, CME mass and kinetic energy. Being able to estimate these properties helps better understand CME dynamics. Our study is based on the CME catalog maintained at the Coordinated Data Analysis Workshops (CDAW) Data Center, which contains all CMEs manually identified since 1996 using the Large Angle and Spectrometric Coronagraph (LASCO) on board the Solar and Heliospheric Observatory (SOHO). We use LASCO C2 data in the period between January 1996 and December 2020 to train, validate and test DeepCME through 10-fold cross validation. The DeepCME method is a fusion of three deep learning models, including ResNet, InceptionNet, and InceptionResNet. Our fusion model extracts features from LASCO C2 images, effectively combining the learning capabilities of the three component models to jointly estimate the mass and kinetic energy of CMEs. Experimental results show that the fusion model yields a mean relative error (MRE) of 0.013 (0.009, respectively) compared to the MRE of 0.019 (0.017, respectively) of the best component model InceptionResNet (InceptionNet, respectively) in estimating the CME mass (kinetic energy, respectively). To our knowledge, this is the first time that deep learning has been used for CME mass and kinetic energy estimations.
摘要
coronial mass ejections (CMEs) 是太阳大规模的喷发,对地球有重要影响。在这篇论文中,我们提出了一种新的方法,称为深度CME,用于估计CME的质量和动能。可以更好地理解CME的动态。我们的研究基于由协调数据分析工作shop(CDAW)数据中心维护的CME目录,该目录包含自1996年以来由大角度和光谱成像仪(LASCO)在太阳和太阳系观测卫星(SOHO)上手动识别的所有CME。我们使用LASCO C2数据在1996年1月至2020年12月期间进行了10次交叉验证。深度CME方法是三个深度学习模型的 fusión,包括ResNet、InceptionNet和InceptionResNet。我们的混合模型从LASCO C2图像中提取特征,有效地将三个组件模型的学习能力结合起来,共同估计CME的质量和动能。实验结果表明,我们的混合模型的相对误差(MRE)为0.013(0.009,分别),比最佳组件模型InceptionResNet(InceptionNet,分别)在估计CME质量(动能,分别)的MRE的0.019(0.017,分别)低得多。到目前为止,这是深度学习首次用于CME质量和动能估计。
Optimizing Bus Travel: A Novel Approach to Feature Mining with P-KMEANS and P-LDA Algorithms
paper_authors: Hongjie Liu, Haotian Shi, Sicheng Fu, Tengfei Yuan, Xinhuan Zhang, Hongzhe Xu, Bin Ran for: This paper aims to develop a method for extracting features from public transportation data, specifically for bus travel, to improve the attractiveness, usage, and sustainability of public transportation.methods: The method uses Point of Interest (POI) data and combines enhanced P-KMENAS and P-LDA algorithms to overcome the limitations of disorganized and unstructured public transportation data. The method segments passenger travel paths into distinct clusters and identifies features such as age, occupation, gender, sports, cost, safety, and personality traits.results: The method successfully mines the diverse aspects of bus travel and effectively calculates relationships between individual travel behaviors. It assigns explanatory and evaluative probabilities to POI labels, thereby enhancing bus travel optimization.Abstract
Customizing services for bus travel can bolster its attractiveness, optimize usage, alleviate traffic congestion, and diminish carbon emissions. This potential is realized by harnessing recent advancements in positioning communication facilities, the Internet of Things, and artificial intelligence for feature mining in public transportation. However, the inherent complexities of disorganized and unstructured public transportation data introduce substantial challenges to travel feature extraction. This study presents a bus travel feature extraction method rooted in Point of Interest (POI) data, employing enhanced P-KMENAS and P-LDA algorithms to overcome these limitations. While the KMEANS algorithm adeptly segments passenger travel paths into distinct clusters, its outcomes can be influenced by the initial K value. On the other hand, Latent Dirichlet Allocation (LDA) excels at feature identification and probabilistic interpretations yet encounters difficulties with feature intermingling and nuanced sub-feature interactions. Incorporating the POI dimension enhances our understanding of travel behavior, aligning it more closely with passenger attributes and facilitating easier data analysis. By incorporating POI data, our refined P-KMENAS and P-LDA algorithms grant a holistic insight into travel behaviors and attributes, effectively mitigating the limitations above. Consequently, this POI-centric algorithm effectively amalgamates diverse POI attributes, delineates varied travel contexts, and imparts probabilistic metrics to feature properties. Our method successfully mines the diverse aspects of bus travel, such as age, occupation, gender, sports, cost, safety, and personality traits. It effectively calculates relationships between individual travel behaviors and assigns explanatory and evaluative probabilities to POI labels, thereby enhancing bus travel optimization.
摘要
The KMEANS algorithm segments passenger travel paths into distinct clusters, but the outcome can be influenced by the initial K value. On the other hand, LDA excels at feature identification and probabilistic interpretations, but it struggles with feature intermingling and nuanced sub-feature interactions. By incorporating POI data, our refined P-KMENAS and P-LDA algorithms provide a comprehensive understanding of travel behaviors and attributes, facilitating easier data analysis.Incorporating POI data enhances our understanding of travel behavior, aligning it more closely with passenger attributes and facilitating easier data analysis. Our method effectively amalgamates diverse POI attributes, delineates varied travel contexts, and imparts probabilistic metrics to feature properties. We successfully mine the diverse aspects of bus travel, such as age, occupation, gender, sports, cost, safety, and personality traits. Our method calculates relationships between individual travel behaviors and assigns explanatory and evaluative probabilities to POI labels, thereby enhancing bus travel optimization.
EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized and Reproducible ML for EDA Research
results: 该论文提出了一个首个的开源数据集,可以用于ML在EDA中的研究。该数据集包含了现代VLSI设计的复杂性,并且可以促进ML模型的转移性研究。Abstract
The application of Machine Learning (ML) in Electronic Design Automation (EDA) for Very Large-Scale Integration (VLSI) design has garnered significant research attention. Despite the requirement for extensive datasets to build effective ML models, most studies are limited to smaller, internally generated datasets due to the lack of comprehensive public resources. In response, we introduce EDALearn, the first holistic, open-source benchmark suite specifically for ML tasks in EDA. This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages. It fosters reproducibility and promotes research into ML transferability across different technology nodes. Accommodating a wide range of VLSI design instances and sizes, our benchmark aptly represents the complexity of contemporary VLSI designs. Additionally, we provide an in-depth data analysis, enabling users to fully comprehend the attributes and distribution of our data, which is essential for creating efficient ML models. Our contributions aim to encourage further advances in the ML-EDA domain.
摘要
machine learning(ml)在电子设计自动化(eda)中的应用在大规模inteграción(vlsi)设计方面已经引起了广泛的研究关注。尽管需要较大的数据来建立有效的ml模型,但大多数研究仅仅使用内部生成的小规模数据集,主要是因为lack comprehensive public resources。为了解决这个问题,我们介绍了EDALearn,首个涵盖ml任务在eda中的开源benchmark suite。这个benchmark suite提供了从合成到物理实现的端到端流程,从而扩大数据收集的范围。它促进了重复性和ml模型在不同技术节点之间的传输性研究。我们的benchmark适应了 contemporary vlsi设计的复杂性,并且提供了深入的数据分析,使用户可以全面了解数据的特征和分布,这是建立有效ml模型的关键。我们的贡献旨在鼓励ml-eda领域的进一步发展。
Universal Deoxidation of Semiconductor Substrates Assisted by Machine-Learning and Real-Time-Feedback-Control
methods: 该研究使用机器学习(ML)混合 convolution和视觉转换器(CNN-ViT)模型,使用干涉高能电子折射(RHEED)视频作为输入,判断substrate的除氧状态,并实现了自动化substrate除氧 beneath a controlled architecture。
results: 该研究表明,使用该ML模型可以准确地判断substrate的除氧状态,并且可以在不同的MBE设备上实现高精度的部署。此外,该研究还表明,使用模型训练数据来实现高精度的部署,可以标准化厚膜沉积过程中的除氧温度,从而提高厚膜的品质和可靠性。Abstract
Thin film deposition is an essential step in the semiconductor process. During preparation or loading, the substrate is exposed to the air unavoidably, which has motivated studies of the process control to remove the surface oxide before thin film deposition. Optimizing the deoxidation process in molecular beam epitaxy (MBE) for a random substrate is a multidimensional challenge and sometimes controversial. Due to variations in semiconductor materials and growth processes, the determination of substrate deoxidation temperature is highly dependent on the grower's expertise; the same substrate may yield inconsistent results when evaluated by different growers. Here, we employ a machine learning (ML) hybrid convolution and vision transformer (CNN-ViT) model. This model utilizes reflection high-energy electron diffraction (RHEED) video as input to determine the deoxidation status of the substrate as output, enabling automated substrate deoxidation under a controlled architecture. This also extends to the successful application of deoxidation processes on other substrates. Furthermore, we showcase the potential of models trained on data from a single MBE equipment to achieve high-accuracy deployment on other equipment. In contrast to traditional methods, our approach holds exceptional practical value. It standardizes deoxidation temperatures across various equipment and substrate materials, advancing the standardization research process in semiconductor preparation, a significant milestone in thin film growth technology. The concepts and methods demonstrated in this work are anticipated to revolutionize semiconductor manufacturing in optoelectronics and microelectronics industries by applying them to diverse material growth processes.
摘要
紧跨膜填充是半导体生产过程中不可或缺的步骤。在准备或加载过程中,SUBSTRATE会被不可避免地暴露在空气中,这有动使研究者努力控制过程以移除表层氧化。在分子束激光 Epitaxy (MBE) 中优化废氧化过程是一个多维度挑战和有争议的问题,因为不同的半导体材料和生长过程会导致SUBSTRATE的氧化状态异常。在这种情况下,我们采用了一种机器学习(ML)混合 convolution 和视觉变换(CNN-ViT)模型。这个模型使用射电高能电子折射(RHEED)视频作为输入,以确定SUBSTRATE的氧化状态,从而实现自动化SUBSTRATE氧化,并在控制的架构下进行。这种方法也可以应用于其他SUBSTRATE上。此外,我们还表明了模型在不同的MBE设备上进行训练后可以实现高精度的部署。与传统方法相比,我们的方法具有非常重要的实用价值。它标准化了不同设备和半导体材料的氧化温度,从而推动了半导体准备研究的标准化进程,这是膜生长技术中的一个重要突破。我们的方法和技术在optoelectronics和微电子业务中的应用将对半导体生产技术产生深远的影响,并且可以应用于多种材料生长过程。
AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix
paper_authors: Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, Ke Zhang for: This paper proposes a new approach to designing the preconditioning matrix for adaptive optimizers, which enhances the generalization performance of deep learning models.methods: The proposed method utilizes the gradient difference between two successive steps as the diagonal elements of the preconditioning matrix, and introduces an auto-switching function that enables the preconditioning matrix to switch dynamically between SGD and the adaptive optimizer.results: The proposed optimizer, named AGD, outperforms state-of-the-art optimizers on public datasets of NLP, CV, and RecSys, achieving highly competitive or significantly better predictive performance. Additionally, the paper analyzes the effects of the auto-switching function on various scenarios.Abstract
Adaptive optimizers, such as Adam, have achieved remarkable success in deep learning. A key component of these optimizers is the so-called preconditioning matrix, providing enhanced gradient information and regulating the step size of each gradient direction. In this paper, we propose a novel approach to designing the preconditioning matrix by utilizing the gradient difference between two successive steps as the diagonal elements. These diagonal elements are closely related to the Hessian and can be perceived as an approximation of the inner product between the Hessian row vectors and difference of the adjacent parameter vectors. Additionally, we introduce an auto-switching function that enables the preconditioning matrix to switch dynamically between Stochastic Gradient Descent (SGD) and the adaptive optimizer. Based on these two techniques, we develop a new optimizer named AGD that enhances the generalization performance. We evaluate AGD on public datasets of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (RecSys). Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) optimizers, achieving highly competitive or significantly better predictive performance. Furthermore, we analyze how AGD is able to switch automatically between SGD and the adaptive optimizer and its actual effects on various scenarios. The code is available at https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.
摘要
适应优化器,如 Adam,在深度学习中取得了非常出色的成功。这些适应优化器的关键组件之一是所谓的预conditioning矩阵,提供了加强的梯度信息和控制每个梯度方向的步长。在这篇论文中,我们提议一种新的预conditioning矩阵设计方法,利用两个连续步骤之间的梯度差为对角元素。这些对角元素与梯度矩阵和参数向量之间的内积非常相关,可以被视为梯度矩阵的一种近似。此外,我们引入了一种自动切换函数,使预conditioning矩阵可以在Stochastic Gradient Descent(SGD)和适应优化器之间动态切换。基于这两种技术,我们开发了一种新的优化器名为AGD,它可以提高总化性能。我们在公共数据集上进行了Natural Language Processing(NLP)、Computer Vision(CV)和Recommendation Systems(RecSys)等领域的实验,结果表明AGD可以与当前最佳优化器(SOTA)匹配或提高预测性能。此外,我们还分析了AGD自动切换SGD和适应优化器的实际效果和不同场景下的表现。代码可以在https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers中找到。
An End-to-End Network Pruning Pipeline with Sparsity Enforcement
results: 研究发现,使用这些方法可以实现剪枝神经网络模型,并且可以获得重要的性能提升。Abstract
Neural networks have emerged as a powerful tool for solving complex tasks across various domains, but their increasing size and computational requirements have posed significant challenges in deploying them on resource-constrained devices. Neural network sparsification, and in particular pruning, has emerged as an effective technique to alleviate these challenges by reducing model size, computational complexity, and memory footprint while maintaining competitive performance. However, many pruning pipelines modify the standard training pipeline at only a single stage, if at all. In this work, we look to develop an end-to-end training pipeline that befits neural network pruning and sparsification at all stages of training. To do so, we make use of nonstandard model parameter initialization, pre-pruning training methodologies, and post-pruning training optimizations. We conduct experiments utilizing combinations of these methods, in addition to different techniques used in the pruning step, and find that our combined pipeline can achieve significant gains over current state of the art approaches to neural network sparsification.
摘要
To achieve this, we utilize nonstandard model parameter initialization, pre-pruning training methodologies, and post-pruning training optimizations. We conduct experiments using combinations of these methods, as well as different techniques used in the pruning step, and find that our combined pipeline can achieve significant gains over current state-of-the-art approaches to neural network sparsification.
Robust Streaming, Sampling, and a Perspective on Online Learning
results: 本文得到了一些坚实的结论,证明了一些在统计学习和流动数据处理中的深刻关系。Abstract
In this work we present an overview of statistical learning, followed by a survey of robust streaming techniques and challenges, culminating in several rigorous results proving the relationship that we motivate and hint at throughout the journey. Furthermore, we unify often disjoint theorems in a shared framework and notation to clarify the deep connections that are discovered. We hope that by approaching these results from a shared perspective, already aware of the technical connections that exist, we can enlighten the study of both fields and perhaps motivate new and previously unconsidered directions of research.
摘要
在这项工作中,我们提供了统计学学习的概述,然后进行了不可靠流处理技术和挑战的抽查,最终得出了许多严谨的结论,证明我们所提出的关系。此外,我们将常见的证明整合到共同框架和符号系统中,以便强调深刻的连接。我们希望通过从共同角度出发,已经熟悉技术之间的连接,以便启发新的研究方向和未曾考虑的研究方向。Here's the word-for-word translation:在这项工作中,我们提供了统计学学习的概述,然后进行了不可靠流处理技术和挑战的抽查,最终得出了许多严谨的结论,证明我们所提出的关系。此外,我们将常见的证明整合到共同框架和符号系统中,以便强调深刻的连接。我们希望通过从共同角度出发,已经熟悉技术之间的连接,以便启发新的研究方向和未曾考虑的研究方向。
How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking
results: 在多种选择指标下达到与完全标注数据集相同的结果,大幅降低标注成本,并在弱监督和半监督学习设置下有效地导航提示选择。Abstract
The paper introduces LEMR, a framework that reduces annotation costs for model selection tasks. Our approach leverages ensemble methods to generate pseudo-labels, employs uncertainty sampling for target acquisition, and utilizes a Z-score mechanism for iterative committee reelection to refine model ranks. We present a systematic study across various selection metrics, demonstrating that LEMR achieves comparable results to fully labeled datasets with a fraction of the labeling budget. Our findings indicate that LEMR not only economizes the labeling effort in weak supervision and semi-supervised learning settings but also effectively guides prompt selection for large language models. With extensive experiments across 23 tasks, we reveal that our framework can dramatically decrease the labeling cost without compromising the accuracy of model selection, thereby offering a cost-effective alternative to traditional practices.
摘要
文章介绍了LEMR框架,这个框架可以减少模型选择任务的标注成本。我们的方法利用协同方法生成 Pseudo-标签,使用不确定采样来获取目标,并使用Z-score机制来进行迭代委员会重新选择来细化模型排名。我们在不同的选择指标下进行了系统性的研究,展示了LEMR可以与完全标注数据集相比,使用一个小部分的标注预算来实现相同的结果。我们的发现表明,LEMR不仅可以在弱监督和半监督学习设置下减少标注努力,还可以有效地指导大语言模型的提问选择。我们在23个任务上进行了广泛的实验,发现LEMR可以减少标注成本,不会降低模型选择的准确性,从而提供一种经济的代替方案。
Deep Learning-Driven Enhancement of Welding Quality Control: Predicting Welding Depth and Pore Volume in Hairpin Welding
results: 对于小型数据集的焊接实验,使用深度学习网络已经获得了扣准的结果,MAE值为0.1079 для预测焊接深度,并且MAE值为0.0641 для预测平均孔隙量。此外,验证过程显示了提案的方法的可靠性。这些结果 Promise significant advantages in controlling welding outcomes, moving beyond the current trend of relying merely on monitoring for defect classification.Abstract
To advance quality assurance in the welding process, this study presents a robust deep learning model that enables the prediction of two critical welds Key Performance Characteristics (KPCs): welding depth and average pore volume. In the proposed approach, a comprehensive range of laser welding Key Input Characteristics (KICs) is utilized, including welding beam geometries, welding feed rates, path repetitions for weld beam geometries, and bright light weld ratios for all paths, all of which were obtained from hairpin welding experiments. Two deep learning networks are employed with multiple hidden dense layers and linear activation functions to showcase the capabilities of deep neural networks in capturing the intricate nonlinear connections inherent within welding KPCs and KICs. Applying deep learning networks to the small numerical experimental hairpin welding dataset has shown promising results, achieving Mean Absolute Error (MAE) values as low as 0.1079 for predicting welding depth and 0.0641 for average pore volume. Additionally, the validity verification demonstrates the reliability of the proposed method. This, in turn, promises significant advantages in controlling welding outcomes, moving beyond the current trend of relying merely on monitoring for defect classification.
摘要
为提高钉焊质量控制,本研究提出了一种可靠的深度学习模型,可以预测两个关键钉焊性能指标(KPC):钉焊深度和平均孔颗率。在提posed方法中,使用了涵盖了激光钉焊的各种钉焊输入特征(KIC),包括钉焊光束几何、钉焊速率、路径重复数 для钉焊光束几何、和所有路径的明亮焊率。两个深度学习网络被采用,每个网络有多个隐藏层和线性活化函数,以示深度神经网络在钉焊KPC和KIC之间的非线性连接的强大能力。在应用深度学习网络于小规模的数字实验室钉焊 dataset中,已经达到了0.1079的 Mean Absolute Error(MAE)值,用于预测钉焊深度,以及0.0641的 MAE值,用于预测平均孔颗率。此外,有效验证表明提posed方法的可靠性。这有助于控制钉焊结果,超出了现在的仅仅通过监测来控制钉焊质量的趋势。
ActiveClean: Generating Line-Level Vulnerability Data via Active Learning
results: 该论文的ActiveClean模型可以很好地生成line-level漏洞数据,并且可以和现有的static分析方法相比肩。在进行了4.3K commits和119K commit lines的评估中,ActiveClean achieved an F1 score between 70-74。此外,该论文还示出了使用活动学习方法可以使用少量的训练数据来达到高度的准确率。Abstract
Deep learning vulnerability detection tools are increasing in popularity and have been shown to be effective. These tools rely on large volume of high quality training data, which are very hard to get. Most of the currently available datasets provide function-level labels, reporting whether a function is vulnerable or not vulnerable. However, for a vulnerability detection to be useful, we need to also know the lines that are relevant to the vulnerability. This paper makes efforts towards developing systematic tools and proposes. ActiveClean to generate the large volume of line-level vulnerability data from commits. That is, in addition to function-level labels, it also reports which lines in the function are likely responsible for vulnerability detection. In the past, static analysis has been applied to clean commits to generate line-level data. Our approach based on active learning, which is easy to use and scalable, provide a complementary approach to static analysis. We designed semantic and syntactic properties from commit lines and use them to train the model. We evaluated our approach on both Java and C datasets processing more than 4.3K commits and 119K commit lines. AcitveClean achieved an F1 score between 70-74. Further, we also show that active learning is effective by using just 400 training data to reach F1 score of 70.23. Using ActiveClean, we generate the line-level labels for the entire FFMpeg project in the Devign dataset, including 5K functions, and also detected incorrect function-level labels. We demonstrated that using our cleaned data, LineVul, a SOTA line-level vulnerability detection tool, detected 70 more vulnerable lines and 18 more vulnerable functions, and improved Top 10 accuracy from 66% to 73%.
摘要
深度学习漏洞检测工具在使用中日益受欢迎,其效果也得到了证明。这些工具需要大量高质量训练数据,但获得这些数据很 diffficult。现有的大多数数据集只提供函数级别的标签,而不是每行的标签。为了使漏洞检测有用,我们需要知道漏洞的相关行。这篇论文努力于开发系统化工具,并提出了ActiveClean来生成大量的行级漏洞数据。即除了函数级别的标签外,ActiveClean还报告了哪些行在函数中是漏洞检测的关键。在过去,静态分析已经被应用于清理提交以生成行级数据。我们的方法基于活动学习,它是容易使用和扩展的。我们从提交行中提取语义和 sintaxis 属性,并将其用于训练模型。我们对Java和C datasets进行了评估,处理了超过4300个提交和119000行提交。ActiveClean在这些数据集上 achieved F1 score between 70-74。此外,我们还证明了活动学习的有效性,只使用400个训练数据可以达到 F1 score of 70.23。使用ActiveClean,我们为FFMpeg项目在Devign数据集中生成了全部函数级别的漏洞数据,包括5000个函数,并检测到了错误的函数级别标签。我们示出了使用我们的净化数据,LineVul,当前最佳的行级漏洞检测工具,检测到了70个更多的漏洞行和18个更多的漏洞函数,并提高了Top 10准确率从66%提高到73%。
Scalable and Independent Learning of Nash Equilibrium Policies in $n$-Player Stochastic Games with Unknown Independent Chains
results: 这种算法可以在 polynomial 时间内 converge to the set of $\epsilon$-NE policies,并且可以在一个弱的距离(即 Nikaido-Isoda 距离的均值)中 converge to the stable $\epsilon$-NE policy。Abstract
We study a subclass of $n$-player stochastic games, namely, stochastic games with independent chains and unknown transition matrices. In this class of games, players control their own internal Markov chains whose transitions do not depend on the states/actions of other players. However, players' decisions are coupled through their payoff functions. We assume players can receive only realizations of their payoffs, and that the players can not observe the states and actions of other players, nor do they know the transition probability matrices of their own Markov chain. Relying on a compact dual formulation of the game based on occupancy measures and the technique of confidence set to maintain high-probability estimates of the unknown transition matrices, we propose a fully decentralized mirror descent algorithm to learn an $\epsilon$-NE for this class of games. The proposed algorithm has the desired properties of independence, scalability, and convergence. Specifically, under no assumptions on the reward functions, we show the proposed algorithm converges in polynomial time in a weaker distance (namely, the averaged Nikaido-Isoda gap) to the set of $\epsilon$-NE policies with arbitrarily high probability. Moreover, assuming the existence of a variationally stable Nash equilibrium policy, we show that the proposed algorithm converges asymptotically to the stable $\epsilon$-NE policy with arbitrarily high probability. In addition to Markov potential games and linear-quadratic stochastic games, this work provides another subclass of $n$-player stochastic games that, under some mild assumptions, admit polynomial-time learning algorithms for finding their stationary $\epsilon$-NE policies.
摘要
我们研究一种$n$-Player随机游戏的子集,即独立链随机游戏和未知过渡矩阵。在这类游戏中,玩家控制自己的内部Markov链,过渡不依赖于其他玩家的状态/动作。但是玩家的决策是通过奖励函数相互联系的。我们假设玩家只能获得实现的奖励,并且玩家不能见到其他玩家的状态和动作,也不知道自己的过渡概率矩阵。我们基于Markov链的占据 dual 形式和信任集的技术,提出了一种完全分布式镜像搜索算法,用于学习这类游戏的 $\epsilon$-NE 策略。我们的算法具有独立、可扩展性和收敛性的特点。具体来说,不论奖励函数的假设,我们证明我们的算法在弱Distance(即Nikaido-Isoda距离)中的平均时间内 converges 到 $\epsilon$-NE 策略的集合,并且可以在高概率下获得这些策略。此外,假设存在一个变化稳定的 Nash 平衡策略,我们证明我们的算法会在 asymptotic 上转移到稳定 $\epsilon$-NE 策略,并且可以在高概率下获得这些策略。除了Markov potential game和线性-quadratic stochastic game之外,这种研究还提供了另一种$n$-Player随机游戏的子集,这些游戏在某些轻微的假设下可以在 polynomial 时间内学习其站点 $\epsilon$-NE 策略。
RJHMC-Tree for Exploration of the Bayesian Decision Tree Posterior
methods: 使用Markov Chain Monte Carlo(MCMC)方法,其效果和效率归结于提议质量。本文探讨使用梯形 Monte Carlo(HMC)方法来更有效地探索bayesian决策树 posterior,通过利用可见性的几何学特性来实现全局更新。
results: HMC方法在预测测试精度、接受率和树复杂性方面表现优于现有方法。Abstract
Decision trees have found widespread application within the machine learning community due to their flexibility and interpretability. This paper is directed towards learning decision trees from data using a Bayesian approach, which is challenging due to the potentially enormous parameter space required to span all tree models. Several approaches have been proposed to combat this challenge, with one of the more successful being Markov chain Monte Carlo (MCMC) methods. The efficacy and efficiency of MCMC methods fundamentally rely on the quality of the so-called proposals, which is the focus of this paper. In particular, this paper investigates using a Hamiltonian Monte Carlo (HMC) approach to explore the posterior of Bayesian decision trees more efficiently by exploiting the geometry of the likelihood within a global update scheme. Two implementations of the novel algorithm are developed and compared to existing methods by testing against standard datasets in the machine learning and Bayesian decision tree literature. HMC-based methods are shown to perform favourably with respect to predictive test accuracy, acceptance rate, and tree complexity.
摘要
methods: multi-locality parallelizable search algorithm called MUSE
results: + improves detection accuracy of quantum variational classifiers by 2.3 times on average + improves quality of predictions from negative to positive coefficients of determination on real-world regression datasets + classification and regression scores of quantum variational models trained with MUSE are on par with classical counterpartsAbstract
In this work, we address the problem of automating quantum variational machine learning. We develop a multi-locality parallelizable search algorithm, called MUSE, to find the initial points and the sets of parameters that achieve the best performance for quantum variational circuit learning. Simulations with five real-world classification datasets indicate that on average, MUSE improves the detection accuracy of quantum variational classifiers 2.3 times with respect to the observed lowest scores. Moreover, when applied to two real-world regression datasets, MUSE improves the quality of the predictions from negative coefficients of determination to positive ones. Furthermore, the classification and regression scores of the quantum variational models trained with MUSE are on par with the classical counterparts.
摘要
在这项工作中,我们解决了量子变量机器学习自动化问题。我们开发了多地点并行搜索算法MUSE,以找到最佳性能的量子变量环境学习初始点和参数集。通过对五个真实世界分类 dataset 进行 simulations,我们发现,MUSE 在量子变量分类器的检测精度方面平均提高了2.3倍,并且当应用于两个真实世界回归 dataset 时,MUSE 可以从负相关系数提升到正相关系数。此外,使用 MUSE 训练的量子变量模型的分类和回归得分与классиical counterparts 相当。
Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression
methods: 这个论文使用了 near-optimal 和 almost linear-time 算法,并且使用了多向滤波技术来解决这些问题。
results: 这个论文提出了一个 sample near-optimal 和 almost linear-time 算法,可以在 $\mathbb{R}^d$ 上进行 Gaussian 平均估计和robust 线性回归,并且可以在 $\ell_2$-error 内部 approximates 目标平均和目标回归函数。这个结果比先前的工作更好,并且解决了一个在文献中开放的问题。Abstract
We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both of these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $\epsilon \in (0, \epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\epsilon)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.
摘要
我们研究了 Gaussian 平均估计和线性回传 regression 问题中的基本问题,在 Hubert 污染的存在下。我们的主要贡献是设计了首个近乎线性时间和几乎线性时间的算法,具有最佳误差保证且适用于这两个问题。具体来说,我们在 $\mathbb{R}^d$ 上的 Gaussian 稳定平均估计问题中,使用污染参数 $\epsilon \in (0, \epsilon_0)$,其中 $\epsilon_0$ 是小数字常数,我们提供了一个算法,其sample 复杂度为 $n = \tilde{O}(d/\epsilon^2)$,runtime 为近乎线性时间,可以将目标平均点 approximated 到 $\ell_2$-误差 $O(\epsilon)$ 之内。这比先前的工作更好,它们可以通过多项式复杂度和时间来实现这个误差保证。另外,我们还提供了一个可能对更广泛的应用有用的方法,即多向 filtering 方法。For the robust linear regression, we also provide the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature.
paper_authors: Yash Sanghvi, Yiheng Chi, Stanley H. Chan
for: 解压缩扩散照明预测
methods: 使用非 слепо的解压缩方法和 diffusion 方法
results: 达到了实验室和真实扩散数据集上的状态 искусственный智能水平结果Abstract
Blind deconvolution problems are severely ill-posed because neither the underlying signal nor the forward operator are not known exactly. Conventionally, these problems are solved by alternating between estimation of the image and kernel while keeping the other fixed. In this paper, we show that this framework is flawed because of its tendency to get trapped in local minima and, instead, suggest the use of a kernel estimation strategy with a non-blind solver. This framework is employed by a diffusion method which is trained to sample the blur kernel from the conditional distribution with guidance from a pre-trained non-blind solver. The proposed diffusion method leads to state-of-the-art results on both synthetic and real blur datasets.
摘要
盲目减除问题具有严重的不定性,因为原始信号和前进算子都不能准确知道。通常,这些问题是通过 alternate estimation of the image and kernel while keeping the other fixed 来解决的。在这篇论文中,我们表明这种框架存在问题,因为它容易陷入地方最小值中,而不是使用非盲目解决方案。我们建议使用一种扩散方法,该方法通过从含有前进算子的条件分布中采样抖杂kernel。我们的提议的扩散方法可以在both synthetic和实际抖杂数据集上达到状态的the-art的结果。
Deep learning acceleration of iterative model-based light fluence correction for photoacoustic tomography
results: 对比传统iterative方法,使用FNO神经网络可以在Correct Light Fluence(LF)过程中提高速度,并且可以保持相同的Correct LF质量。Abstract
Photoacoustic tomography (PAT) is a promising imaging technique that can visualize the distribution of chromophores within biological tissue. However, the accuracy of PAT imaging is compromised by light fluence (LF), which hinders the quantification of light absorbers. Currently, model-based iterative methods are used for LF correction, but they require significant computational resources due to repeated LF estimation based on differential light transport models. To improve LF correction efficiency, we propose to use Fourier neural operator (FNO), a neural network specially designed for solving differential equations, to learn the forward projection of light transport in PAT. Trained using paired finite-element-based LF simulation data, our FNO model replaces the traditional computational heavy LF estimator during iterative correction, such that the correction procedure is significantly accelerated. Simulation and experimental results demonstrate that our method achieves comparable LF correction quality to traditional iterative methods while reducing the correction time by over 30 times.
摘要
photoacoustic tomography(PAT)是一种有前途的成像技术,可以显示生物组织中色彩体的分布。然而,PAT成像的准确性受到光荷(LF)的限制,这会降低光吸收量的量化。现在,使用模型基于迭代方法进行LF修正,但这需要很大的计算资源,因为需要基于差分光传输模型多次估计LF。为了提高LF修正的效率,我们提议使用傅里曼神经网络(FNO),一种特地设计用于解决差分方程的神经网络,来学习PAT光传输的前向投影。我们的FNO模型通过使用匹配的finite-element-based LF估计数据进行训练,将传统的计算沉重的LF估计器替换为快速的FNO模型,从而大幅减少了修正过程的时间。实验和计算结果表明,我们的方法可以与传统的迭代方法准确地修正LF,而且减少了修正时间超过30倍。
Accelerated Parallel Magnetic Resonance Imaging with Compressed Sensing using Structured Sparsity
results: 这 paper 的结果表明,通过修改 Sparse SENSE 算法以使用 structured sparsity,可以 reconstruction 高质量图像,并且可以降低样本数量并缩短扫描时间。Abstract
Compressed sensing is an imaging paradigm that allows one to invert an underdetermined linear system by imposing the a priori knowledge that the sought after solution is sparse (i.e., mostly zeros). Previous works have shown that if one also knows something about the sparsity pattern (the locations where non-zero entries exist), one can take advantage of this structure to improve the quality of the result. A significant application of compressed sensing is magnetic resonance imaging (MRI), where samples are acquired in the Fourier domain. Compressed sensing allows one to reconstruct a high-quality image with fewer samples which can be collected with a faster scan. This increases the robustness of MRI to patient motion since less motion is possible during the shorter scan. Parallel imaging, where multiple coils are used to gather data, is another an more ubiquitously used method for accelerating MRI. Existing combinations of these acceleration methods, such as Sparse SENSE, yield high quality images with an even shorter scan time than either technique alone. In this work, we show how to modify Sparse SENSE with structured sparsity to reconstruct a high quality image with even fewer samples.
摘要
压缩感知是一种图像模式,允许我们对不充分的线性系统进行逆解,基于先前知道的含义,即寻找的解是稀疏的(即主要是零)。先前的工作表明,如果我们还知道稀疏性模式(非零元素存在的位置),那么可以利用这种结构来改善结果的质量。压缩感知在核磁共振成像(MRI)中有着重要应用,samples在傅里叶空间中被采集。压缩感知可以重建高质量图像,需要 fewer samples,这使得MRI更加抗震,因为 fewer motion是可能 durante shorter scan。多晶磁共振(Parallel imaging)是另一种更广泛使用的加速方法,其中多个antenna被用来收集数据。现有的加速方法的组合,如Sparse SENSE,可以提供高质量图像,scan时间更短。在这项工作中,我们将如何修改Sparse SENSE,以使用结构稀疏来重建高质量图像,并且需要更少的样本。
Coronary Atherosclerotic Plaque Characterization with Photon-counting CT: a Simulation-based Feasibility Study
paper_authors: Mengzhou Li, Mingye Wu, Jed Pack, Pengwei Wu, Bruno De Man, Adam Wang, Koen Nieman, Ge Wang
for: This paper is written to investigate the imaging capabilities of a deep-silicon photon-counting detector (PCCT) for coronary plaque characterization, with a focus on spatial resolution, noise, motion artifacts, radiation dose, and spectral characterization.
methods: The paper uses a systematic simulation study with a clinically-relevant digital plaque phantom to evaluate the performance of the deep-silicon PCCT scanner. The simulation study considers realistic geometrical parameters and chemical compositions of plaques.
results: The simulation results suggest that the deep-silicon PCCT design provides adequate spatial resolution for visualizing a necrotic core and quantitation of key plaque features. Advanced denoising techniques and aggressive bowtie filter designs can keep image noise to acceptable levels at this resolution while keeping radiation dose comparable to that of a conventional CT scan. However, the ultrahigh resolution of PCCT also means an elevated sensitivity to motion artifacts, and accurate motion correction methods are required for best plaque imaging quality.Here is the same information in Simplified Chinese text:
results: simulations 结果表明,深层晶体PCCT设计可以提供足够的空间分辨率 для识别肿瘤中的衰竭核心和关键肿瘤特征的量化。高级排除噪声技术和积极的碎弧筛件设计可以保持图像噪声到可接受水平,同时保持辐射剂量与传统CT扫描相比。但是,PCCT的超高分辨率也意味着运动artefacts 的敏感性增加。Abstract
Recent development of photon-counting CT (PCCT) brings great opportunities for plaque characterization with much-improved spatial resolution and spectral imaging capability. While existing coronary plaque PCCT imaging results are based on detectors made of CZT or CdTe materials, deep-silicon photon-counting detectors have unique performance characteristics and promise distinct imaging capabilities. In this work, we report a systematic simulation study of a deep-silicon PCCT scanner with a new clinically-relevant digital plaque phantom with realistic geometrical parameters and chemical compositions. This work investigates the effects of spatial resolution, noise, motion artifacts, radiation dose, and spectral characterization. Our simulation results suggest that the deep-silicon PCCT design provides adequate spatial resolution for visualizing a necrotic core and quantitation of key plaque features. Advanced denoising techniques and aggressive bowtie filter designs can keep image noise to acceptable levels at this resolution while keeping radiation dose comparable to that of a conventional CT scan. The ultrahigh resolution of PCCT also means an elevated sensitivity to motion artifacts. It is found that a tolerance of less than 0.4 mm residual movement range requires the application of accurate motion correction methods for best plaque imaging quality with PCCT.
摘要
近些年,光子计数 computed tomography(PCCT)的发展带来了对颗粒特征化的巨大机会,以提高空间分辨率和spectral imaging能力。现有的核心粒子PCCT成像结果基于CZT或CdTe材料的探测器,深入的半导体光子计数探测器具有独特的性能特点,承诺着不同的成像能力。在这项工作中,我们进行了一项系统性的PCCT扫描仪设计 simulate study,使用一个新的临床相关的数字颗粒模拟器,模拟了实际的几何参数和化学成分。这项工作研究了PCCT扫描仪的影像质量、噪声、运动 artifacts、辐射剂量和spectral特征。我们的 simulate结果表明,深入PCCT设计可以提供足够的空间分辨率,用于可见颗粒核心和颗粒特征的量化。使用高级滤波器技术和灵活的弯钩滤波器设计可以保持图像噪声在接受度水平,同时保持辐射剂量与传统CT扫描相同。PCCT的超高分辨率也意味着对运动 artifacts的敏感性增加。我们发现,对于最佳颗粒成像质量,运动 artifacts的准确修正方法必须在0.4毫米之间进行。
results: 研究结果显示,通过使用机器学习技术可以提高时射频测量精度,并且在不同的室内环境中,时射频和指纹识别方法的选择会受到不同的室内环境和参考用户的分布和密度的影响。Abstract
High-accuracy positioning has gained significant interest for many use-cases across various domains such as industrial internet of things (IIoT), healthcare and entertainment. Radio frequency (RF) measurements are widely utilized for user localization. However, challenging radio conditions such as non-line-of-sight (NLOS) and multipath propagation can deteriorate the positioning accuracy. Machine learning (ML)-based estimators have been proposed to overcome these challenges. RF measurements can be utilized for positioning in multiple ways resulting in time-based, angle-based and fingerprinting-based methods. Different methods, however, impose different implementation requirements to the system, and may perform differently in terms of accuracy for a given setting. In this paper, we use artificial neural networks (ANNs) to realize time-of-arrival (ToA)-based and channel impulse response (CIR) fingerprinting-based positioning. We compare their performance for different indoor environments based on real-world ultra-wideband (UWB) measurements. We first show that using ML techniques helps to improve the estimation accuracy compared to conventional techniques for time-based positioning. When comparing time-based and fingerprinting schemes using ANNs, we show that the favorable method in terms of positioning accuracy is different for different environments, where the accuracy is affected not only by the radio propagation conditions but also the density and distribution of reference user locations used for fingerprinting.
摘要
高精度定位已经在多个领域中受到广泛关注,如工业互联网 OF Things(IIoT)、医疗和娱乐等。无线电频(RF)测量广泛应用于用户地理位置。然而,非直线视线(NLOS)和多Path传播可能会降低定位精度。机器学习(ML)基于的估计器已经提议用于解决这些挑战。RF测量可以用于定位的多种方法,包括时间基本、角度基本和指纹识别方法。不同的方法对系统实施的要求不同,可能在给定的设置下有不同的准确性表现。在本文中,我们使用人工神经网络(ANNs)来实现时间基于的到来时间(ToA)和频率响应(CIR)指纹识别基本定位。我们对不同的室内环境进行比较,根据实际的 ultra-wideband(UWB)测量数据。我们首先表明,使用ML技术可以提高定位精度,比 convent ional技术更高。当比较时间基本和指纹 schemes使用 ANNs 时,我们表明,在不同的环境中,最佳的方法是不同的,其精度受到不仅电波传播条件,还受到参考用户位置的密度和分布的影响。
Distributed Optimization with Feasible Set Privacy
results: 我们证明,对于所有固定的映射函数f,我们的方案比使用SPIR基于的私人集合交会(PSI)协议来私下取得P_1∩P_2,并找到最佳解,具有更低的信息泄露量和下载成本。对于所有可能的固定映射函数f,我们的方案在高probability下表现更好。Abstract
We consider the setup of a constrained optimization problem with two agents $E_1$ and $E_2$ who jointly wish to learn the optimal solution set while keeping their feasible sets $\mathcal{P}_1$ and $\mathcal{P}_2$ private from each other. The objective function $f$ is globally known and each feasible set is a collection of points from a global alphabet. We adopt a sequential symmetric private information retrieval (SPIR) framework where one of the agents (say $E_1$) privately checks in $\mathcal{P}_2$, the presence of candidate solutions of the problem constrained to $\mathcal{P}_1$ only, while learning no further information on $\mathcal{P}_2$ than the solution alone. Further, we extract an information theoretically private threshold PSI (ThPSI) protocol from our scheme and characterize its download cost. We show that, compared to privately acquiring the feasible set $\mathcal{P}_1\cap \mathcal{P}_2$ using an SPIR-based private set intersection (PSI) protocol, and finding the optimum, our scheme is better as it incurs less information leakage and less download cost than the former. Over all possible uniform mappings of $f$ to a fixed range of values, our scheme outperforms the former with a high probability.
摘要
我们考虑一个受限制最佳化问题,其中两个代理人($E_1$和$E_2$)共同学习最佳解决方案,并将它们的可行集$\mathcal{P}_1$和$\mathcal{P}_2$保持对方不知。目标函数$f$是全球知道的,每个可行集是全球字母集中的一个集合点。我们采用一个对称的私人信息抽取(SPIR)框架,其中一个代理人($E_1$)在$\mathcal{P}_2$中私下检查候选解的存在,不会学习更多关于$\mathcal{P}_2$的资讯。另外,我们从我们的方案中提取了一个信息理论上隐私的阈值PSI(ThPSI)协议,并characterize its download cost。我们表明,相比 privately acquiring $\mathcal{P}_1\cap \mathcal{P}_2$ 使用 SPIR-based private set intersection(PSI)协议,并查找最佳解,我们的方案更好,因为它对信息泄露和下载成本具有较低的影响。对所有可能的均匀映射$f$到固定的值范围中,我们的方案在高概率下超过前者。
Joint State and Input Estimation for Linear Dynamical Systems with Sparse Control
results: 作者通过使用不同的假设来推估输入稀疏,并且对控制输入的公共支持进行扩展。他们的算法在精度和时间复杂度方面与现有方法相比,具有明显的优势,特别是在低维度测量 régime 下。Abstract
Sparsity constraints on the control inputs of a linear dynamical system naturally arise in several practical applications such as networked control, computer vision, seismic signal processing, and cyber-physical systems. In this work, we consider the problem of jointly estimating the states and sparse inputs of such systems from low-dimensional (compressive) measurements. Due to the low-dimensional measurements, conventional Kalman filtering and smoothing algorithms fail to accurately estimate the states and inputs. We present a Bayesian approach that exploits the input sparsity to significantly improve estimation accuracy. Sparsity in the input estimates is promoted by using different prior distributions on the input. We investigate two main approaches: regularizer-based MAP, and {Bayesian learning-based estimation}. We also extend the approaches to handle control inputs with common support and analyze the time and memory complexities of the presented algorithms. Finally, using numerical simulations, we show that our algorithms outperform the state-of-the-art methods in terms of accuracy and time/memory complexities, especially in the low-dimensional measurement regime.
摘要
Linear dynamics systems with control input sparsity constraints arise in many practical applications, such as networked control, computer vision, seismic signal processing, and cyber-physical systems. In this work, we consider the problem of jointly estimating the states and sparse inputs from low-dimensional (compressive) measurements. Due to the low-dimensional measurements, conventional Kalman filtering and smoothing algorithms cannot accurately estimate the states and inputs. We present a Bayesian approach that exploits the input sparsity to significantly improve estimation accuracy. Sparsity in the input estimates is promoted by using different prior distributions on the input. We investigate two main approaches: regularizer-based MAP, and Bayesian learning-based estimation. We also extend the approaches to handle control inputs with common support and analyze the time and memory complexities of the presented algorithms. Finally, using numerical simulations, we show that our algorithms outperform the state-of-the-art methods in terms of accuracy and time/memory complexities, especially in the low-dimensional measurement regime.Here's the translation in Traditional Chinese:Linear dynamics systems with control input sparsity constraints arise in many practical applications, such as networked control, computer vision, seismic signal processing, and cyber-physical systems. In this work, we consider the problem of jointly estimating the states and sparse inputs from low-dimensional (compressive) measurements. Due to the low-dimensional measurements, conventional Kalman filtering and smoothing algorithms cannot accurately estimate the states and inputs. We present a Bayesian approach that exploits the input sparsity to significantly improve estimation accuracy. Sparsity in the input estimates is promoted by using different prior distributions on the input. We investigate two main approaches: regularizer-based MAP, and Bayesian learning-based estimation. We also extend the approaches to handle control inputs with common support and analyze the time and memory complexities of the presented algorithms. Finally, using numerical simulations, we show that our algorithms outperform the state-of-the-art methods in terms of accuracy and time/memory complexities, especially in the low-dimensional measurement regime.
Fixed-point methods for long-term power control and beamforming design in large-scale MIMO
paper_authors: Lorenzo Miretti, Renato L. G. Cavalcante, Sławomir Stańczak
for: 解决现有大规模MIMO系统中的开放 JOINT 电力控制和扩散设计问题
methods: 使用固定点方法解决问题
results: 提出了一种基于通道统计的长期电力控制和扩散设计方法,可以 Mitigate 竞争性短期优化算法所受的严重扩展性问题,并且通过对数字 simulations 进行比较,证明了该方法的优化性。Abstract
This study presents novel applications of fixed-point methods to solve previously open joint power control and beamforming design problems in modern large-scale MIMO systems, e.g., based on the cell-free massive MIMO and XL-MIMO concepts. In particular, motivated by the need for scalable system architectures, we revisit the classical sum power minimization and max-min fair design criteria by considering long-term power control and beamforming design based on channel statistics and possibly limited channel state information (CSI) sharing across distributed processing units. This approach is believed to mitigate the severe scalability issues of competing short-term optimal algorithms in the literature, which must be executed for every channel realization by a central controller endowed with global CSI, hence imposing very demanding requirements in terms of computation and interconnection capabilities. The obtained optimal algorithms are then illustrated and compared against existing short-term and long-term approaches via numerical simulations in a cell-free massive MIMO setup.
摘要
Here is the translation in Simplified Chinese:这个研究提出了fixed-point方法的新应用,用于解决现有的大规模MIMO系统中的 JOINT 功率控制和扫描设计问题。这个研究是基于现有的矩阵MIMO系统,如绝缘巨积MIMO和XL-MIMO概念。特别是,由于需要可扩展的系统架构,我们重新考虑了 classical sum 功率最小化和max-min 公平设计 criterion,通过考虑长期功率控制和扫描设计,基于通道统计信息,并可能有限的通道状态信息(CSI)共享 Across distributed 处理单元。这种方法被认为可以减轻竞争性短期优化算法在文献中的严重扩展性问题,这些算法必须在每个通道实现时执行,由一个中央控制器携带全局CSI,因此对计算和连接能力假设了非常高的要求。研究所得到的优化算法然后与现有的短期和长期方法进行比较,通过数值仿真在绝缘巨积MIMO设置中进行 illustrate。
Optimal Dual-Polarized Planar Arrays for Massive Capacity Over Point-to-Point MIMO Channels
results: 论文的数据表明,通过采用合适的天线间隔和天线配置,可以实现极高的数据传输速率,达到1 Tbps以上,并且一个大的基站可以服务一个实际上izable的移动设备。Abstract
Future wireless networks must provide ever higher data rates. The available bandwidth increases roughly linearly as we increase the carrier frequency, but the range shrinks drastically. This paper explores if we can instead reach massive capacities using spatial multiplexing over multiple-input multiple-output (MIMO) channels. In line-of-sight (LOS) scenarios, therank of the MIMO channel matrix depends on the polarization and antenna arrangement. We optimize the rank and condition number by identifying the optimal antenna spacing in dual-polarized planar antenna arrays with imperfect isolation. The result is sparely spaced antenna arrays that exploit radiative near-field properties. We further optimize the array geometry for minimum aperture length and aperture area, which leads to different configurations. Moreover, we prove analytically that for fixed-sized arrays, the MIMO rank grows quadratically with the carrier frequency in LOS scenarios, if the antennas are appropriately designed. Hence, MIMO technology contributes more to the capacity growth than the bandwidth. The numerical results show that massive data rates, far beyond 1 Tbps, can be reached both over fixed point-to-point links. It is also possible for a large base station to serve a practically-sized mobile device.
摘要
未来无线网络必须提供越来越高的数据速率。可用频带增加约线性地,但覆盖范围减小很多。这篇论文探讨了 whether we can achieve massive capacities using spatial multiplexing over multiple-input multiple-output (MIMO) channels instead. In line-of-sight (LOS) scenarios, the rank of the MIMO channel matrix depends on the polarization and antenna arrangement. We optimize the rank and condition number by identifying the optimal antenna spacing in dual-polarized planar antenna arrays with imperfect isolation. The result is sparse antenna arrays that exploit radiative near-field properties. We further optimize the array geometry for minimum aperture length and aperture area, which leads to different configurations. Moreover, we prove analytically that for fixed-sized arrays, the MIMO rank grows quadratically with the carrier frequency in LOS scenarios, if the antennas are appropriately designed. Therefore, MIMO technology contributes more to capacity growth than bandwidth. The numerical results show that massive data rates, far beyond 1 Tbps, can be reached both over fixed point-to-point links. It is also possible for a large base station to serve a practically-sized mobile device.
Kirchhoff Meets Johnson: In Pursuit of Unconditionally Secure Communication
results: 论文提出了一种基于噪声的安全键交换方案,并证明了其可以实现无条件安全的通信系统。Abstract
Noise: an enemy to be dealt with and a major factor limiting communication system performance. However, what if there is gold in that garbage? In conventional engineering, our focus is primarily on eliminating, suppressing, combating, or even ignoring noise and its detrimental impacts. Conversely, could we exploit it similarly to biology, which utilizes noise-alike carrier signals to convey information? In this context, the utilization of noise, or noise-alike signals in general, has been put forward as a means to realize unconditionally secure communication systems in the future. In this tutorial article, we begin by tracing the origins of thermal noise-based communication and highlighting one of its significant applications for ensuring unconditionally secure networks: the Kirchhoff-law-Johnson-noise (KLJN) secure key exchange scheme. We then delve into the inherent challenges tied to secure communication and discuss the imperative need for physics-based key distribution schemes in pursuit of unconditional security. Concurrently, we provide a concise overview of quantum key distribution (QKD) schemes and draw comparisons with their KLJN-based counterparts. Finally, extending beyond wired communication loops, we explore the transmission of noise signals over-the-air and evaluate their potential for stealth and secure wireless communication systems.
摘要
噪声:一个需要处理和限制通信系统性能的敌人。但是,如果在那里有金属吗?在传统工程中,我们主要专注排除、抑制、战斗或忽略噪声和其不良影响。然而,我们可以从生物学中借运用噪声相似的传输信号来传递信息?在这个意义上,噪声的使用或类似信号的利用在未来可能实现无条件安全通信系统。在这个教程文章中,我们从噪声基础的发展开始,探讨了一个实现绝对安全网络的重要应用:基于噪声的 Kirchhoff-法vere-Johnson 安全金钥交换方案(KLJN)。我们然后详细介绍了安全通信中的挑战,并说明了对物理基础的金钥分配方案在寻求无条件安全方面的重要性。同时,我们将提供一个简洁的量子钥分配(QKD)方案的概观,并对其与 KLJN 方案进行比较。最后,我们将通信 Loop 的传输扩展到无线通信,评估噪声信号在无线通信系统中的潜在应用。
Augmenting Channel Charting with Classical Wireless Source Localization Techniques
paper_authors: Florian Euchner, Phillip Stephan, Stephan ten Brink
for: Channel Charting aims to construct a map of the radio environment by leveraging similarity relationships found in high-dimensional channel state information, with the goal of improving localization performance.
methods: The paper compares classical source localization techniques to Channel Charting with respect to localization performance, and suggests and evaluates methods to enhance Channel Charting with model-based localization approaches.
results: The paper demonstrates that Channel Charting can outperform classical localization methods on the considered dataset, and suggests incorporating information from model-based approaches during the training of the forward charting function for improved performance.Abstract
Channel Charting aims to construct a map of the radio environment by leveraging similarity relationships found in high-dimensional channel state information. Although resulting channel charts usually accurately represent local neighborhood relationships, even under conditions with strong multipath propagation, they often fall short in capturing global geometric features. On the other hand, classical model-based localization methods, such as triangulation and multilateration, can easily localize signal sources in the global coordinate frame. However, these methods rely heavily on the assumption of line-of-sight channels and distributed antenna deployments. Based on measured data, we compare classical source localization techniques to channel charts with respect to localization performance. We suggest and evaluate methods to enhance Channel Charting with model-based localization approaches: One approach involves using information derived from classical localization methods to map channel chart locations to physical positions after conventional training of the forward charting function. Foremost, though, we suggest to incorporate information from model-based approaches during the training of the forward charting function in what we call "augmented Channel Charting". We demonstrate that Channel Charting can outperform classical localization methods on the considered dataset.
摘要
Translated into Simplified Chinese:通道图表(Channel Charting)的目标是通过利用通道状态信息中的相似关系来构建广播环境的地图。尽管结果的通道图表通常准确地表示当地的 neighourhood 关系,但它们经常无法捕捉全局的几何特征。相反,经典的模型基地的地方化方法,如三角形和多角形,可以轻松地在全球坐标系中确定信号源。但这些方法具有分布antenna部署的假设,而且它们依赖于直线通道的假设。基于测量数据,我们比较了经典的源localization技术和通道图表的localization性能。我们建议并评估了在Channel Charting中使用模型基地的方法来增强localization性能:一种方法是使用经典的localization方法中的信息来将通道图表的位置映射到物理位置。然而,我们更主要地建议在Channel Charting的训练过程中使用模型基地的方法,我们称之为“增强通道图表”。我们示出了Channel Charting可以在考虑的数据集上超越经典的localization方法。
Intelligent Reflecting Surface-Aided Electromagnetic Stealth Against Radar Detection
paper_authors: Beixiong Zheng, Xue Xiong, Jie Tang, Rui Zhang
For: 提出一种基于智能反射表面(IRS)的电磁隐身系统,以实现质量可控的隐身,并且可以在多种频率和方向范围内进行自适应调整。* Methods: 使用IRS的可调式反射元件实现电磁隐身,并且通过最小化所有敌方雷达接收信号总功率来优化IRS的反射。* Results: 通过 simulations, validate the performance advantages of the proposed IRS-aided electromagnetic stealth system with the proposed IRS reflection designs, including improved stealth performance and low computational complexity.Abstract
While traditional electromagnetic stealth materials/metasurfaces can render a target virtually invisible to some extent, they lack flexibility and adaptability, and can only operate within a limited frequency and angle/direction range, making it challenging to ensure the expected stealth performance. In view of this, we propose in this paper a new intelligent reflecting surface (IRS)-aided electromagnetic stealth system mounted on targets to evade radar detection, by utilizing the tunable passive reflecting elements of IRS to achieve flexible and adaptive electromagnetic stealth in a cost-effective manner. Specifically, we optimize the IRS's reflection at the target to minimize the sum received signal power of all adversary radars. We first address the IRS's reflection optimization problem using the Lagrange multiplier method and derive a semi-closed-form optimal solution for the single-radar setup, which is then generalized to the multi-radar case. To meet real-time processing requirements, we further propose low-complexity closed-form solutions based on the reverse alignment/cancellation and minimum mean-square error (MMSE) criteria for the single-radar and multi-radar cases, respectively. Additionally, we propose practical low-complexity estimation schemes at the target to acquire angle-of-arrival (AoA) and/or path gain information via a small number of receive sensing devices. Simulation results validate the performance advantages of our proposed IRS-aided electromagnetic stealth system with the proposed IRS reflection designs.
摘要
传统电磁隐身材料/meta表面可以让目标在一定程度上 render 为无形visible,但它们缺乏 flexibility 和适应性,仅能在有限的频率和方向范围内运作,从而增加隐身性的挑战。为了解决这问题,我们在这篇文章中提出了一个新的智能反射表面(IRS)-支持的电磁隐身系统,通过利用 IRS 的可调的静电反射元件来实现可靠且适应的电磁隐身,在成本效益之间实现。具体来说,我们将 IRS 的反射问题优化为最小化所有敌人激光器的总接收信号功率。我们首先使用拉格朗日矩法来解决这问题,并 derive 一个半关注解的最佳解决方案 для单激光设置。然后,我们将这个解决方案扩展到多激光场景中。为了遵循实时处理需求,我们还提出了一些低复杂性的关联/抵销和最小平均方差(MMSE)的解决方案,分别适用于单激光和多激光场景。此外,我们还提出了实际的低复杂性估计方案,以确定目标上的射线来源信息。 simulation results 验证了我们提出的 IRS-aided 电磁隐身系统的性能优势。
Consensus-Based Distributed Nonlinear Filtering with Kernel Mean Embedding
for: 该文章提出了一种基于共识的分布式非线性筛选器,用于近似分布式非线性动力系统 posterior density 的扩展。
methods: 该筛选器使用了 kernel mean embedding (KME) 将系统状态嵌入到更高维度的 reproduce kernel Hilbert space (RKHS) 中,然后将非线性测量函数转换为线性形式。通过这样,提出了一个 KME 的 posterior distribution 更新规则。
results: 对比中心化筛选器,分布式筛选器可以保持分布式模式,同时具有中心化筛选器的准确性。两个示例演示了在目标跟踪场景中,包括几乎不变速度的目标和转弯目标,分布式筛选器可以准确地估计目标位置。Abstract
This paper proposes a consensus-based distributed nonlinear filter with kernel mean embedding (KME). This fills with gap of posterior density approximation with KME for distributed nonlinear dynamic systems. To approximate the posterior density, the system state is embedded into a higher-dimensional reproducing kernel Hilbert space (RKHS), and then the nonlinear measurement function is linearly converted. As a result, an update rule of KME of posterior distribution is established in the RKHS. To show the proposed distributed filter being capable of achieving the centralized estimation accuracy, a centralized filter, serving as an extension of the standard Kalman filter in the state space to the RKHS, is developed first. Benefited from the KME, the proposed distributed filter converges to the centralized one while maintaining the distributed pattern. Two examples are introduced to demonstrate the effectiveness of the developed filters in target tracking scenarios including nearly constantly moving target and turning target, respectively, with bearing-only, range and bearing measurements.
摘要
Translated into Simplified Chinese:这篇论文提出了一种基于协议的分布式非线性滤波器,使用核mean embedding(KME)来填充分布式非线性动态系统 posterior density 的 gap。为了 aproximate posterior density,系统状态被嵌入到一个更高维度的 reproduce kernel Hilbert space(RKHS)中,然后非线性测量函数被线性转换。这使得分布式滤波器中的KME posterior distribution的更新规则得到了确定。为了证明提出的分布式滤波器可以达到中央化估计精度,首先开发了基于state space的中央化滤波器,并将其扩展到RKHS中。由于KME,提出的分布式滤波器可以与中央化滤波器相同,同时保持分布式模式。两个例子被引入,用于演示在目标跟踪场景中,包括近 konstant moving target 和 turning target,分别使用 bearing-only、range和 bearing 测量。
Highly Accelerated Weighted MMSE Algorithms for Designing Precoders in FDD Systems with Incomplete CSI
paper_authors: Donia Ben Amor, Michael Joham, Wolfgang Utschick
for: 这篇论文的目的是Derive a lower bound on the training-based achievable downlink (DL) sum rate (SR) of a multi-user multiple-input-single-output (MISO) system operating in frequency-division-duplex (FDD) mode.
results: 该论文通过数值研究表明,提议的方法在具有有限通道知识的场景下(即具有很少的射频)是有效的。此外,提出了一种更高效的SIWMMSE方法,其中 precoder 更新是固定的。Abstract
In this work, we derive a lower bound on the training-based achievable downlink (DL) sum rate (SR) of a multi-user multiple-input-single-output (MISO) system operating in frequency-division-duplex (FDD) mode. Assuming linear minimum mean square error (LMMSE) channel estimation is used, we establish a connection of the derived lower bound on the signal-to-interference-noise-ratio (SINR) to an average MSE that allows to reformulate the SR maximization problem as the minimization of the augmented weighted average MSE (AWAMSE). We propose an iterative precoder design with three alternating steps, all given in closed form, drastically reducing the computation time. We show numerically the effectiveness of the proposed approach in challenging scenarios with limited channel knowledge, i.e., we consider scenarios with a very limited number of pilots. We additionally propose a more efficient version of the well-known stochastic iterative WMMSE (SIWMMSE) approach, where the precoder update is given in closed form.
摘要
在这个工作中,我们 derivates a lower bound on the training-based achievable downlink (DL) sum rate (SR) of a multi-user multiple-input-single-output (MISO) system operating in frequency-division-duplex (FDD) mode. Assuming linear minimum mean square error (LMMSE) channel estimation is used, we establish a connection of the derived lower bound on the signal-to-interference-noise-ratio (SINR) to an average mean squared error (MSE) that allows to reformulate the SR maximization problem as the minimization of the augmented weighted average MSE (AWAMSE). We propose an iterative precoder design with three alternating steps, all given in closed form, drastically reducing the computation time. We show numerically the effectiveness of the proposed approach in challenging scenarios with limited channel knowledge, i.e., we consider scenarios with a very limited number of pilots. We additionally propose a more efficient version of the well-known stochastic iterative WMMSE (SIWMMSE) approach, where the precoder update is given in closed form.Note: Please note that the translation is in Simplified Chinese, which is one of the two standardized Chinese languages. If you prefer Traditional Chinese, please let me know.
A Mapping of Triangular Block Interleavers to DRAM for Optical Satellite Communication
paper_authors: Lukas Steiner, Timo Lehnigk-Emden, Markus Fehrenz, Norbert Wehn
for: 优化低地球轨道卫星通信系统中的光学下降链的可靠性,通过交错处理实现可靠的数据传输。
methods: 使用三角形块交错器,并对象是使用JEDEC标准的DRAM设备。
results: 提出了一种新的mapping方法,可以在所有测试配置中实现过90%的带宽利用率,并且可以应用于任何JEDEC标准的DRAM设备。Abstract
Communication in optical downlinks of low earth orbit (LEO) satellites requires interleaving to enable reliable data transmission. These interleavers are orders of magnitude larger than conventional interleavers utilized for example in wireless communication. Hence, the capacity of on-chip memories (SRAMs) is insufficient to store all symbols and external memories (DRAMs) must be used. Due to the overall requirement for very high data rates beyond 100 Gbit/s, DRAM bandwidth then quickly becomes a critical bottleneck of the communication system. In this paper, we investigate triangular block interleavers for the aforementioned application and show that the standard mapping of symbols used for SRAMs results in low bandwidth utilization for DRAMs, in some cases below 50 %. As a solution, we present a novel mapping approach that combines different optimizations and achieves over 90 % bandwidth utilization in all tested configurations. Further, the mapping can be applied to any JEDEC-compliant DRAM device.
摘要
通信在低地球轨道卫星(LEO)下的光学下降链需要扩展,以确保可靠的数据传输。这些扩展器比普通的无线通信中使用的扩展器大得多。因此,在器件中的内存(SRAM)的容量不足以存储所有符号,需要使用外部 memories(DRAM)。由于需要非常高的数据速率,超过100 Gbit/s,DRAM带宽很快就成为通信系统的瓶颈。在这篇论文中,我们调查了三角形块扩展器,并发现了标准符号映射使用的SRAM中的低带宽利用率,在一些情况下低于50%。作为解决方案,我们提出了一种新的映射方法,该方法结合了不同的优化,在所有测试配置中都可以达到超过90%的带宽利用率。此外,该映射方法可以应用于任何JEDEC标准符合的DRAM设备。
results: 论文的数据显示, despite its simplicity, a doubly 1-bit quantized massive MIMO system with very large antenna arrays can deliver an impressive performance in terms of MSE and symbol error rate.Abstract
Enabling communications in the (sub-)THz band will call for massive multiple-input multiple-output (MIMO) arrays at either the transmit- or receive-side, or at both. To scale down the complexity and power consumption when operating across massive frequency and antenna dimensions, a sacrifice in the resolution of the digital-to-analog/analog-to-digital converters (DACs/ADCs) will be inevitable. In this paper, we analyze the extreme scenario where both the transmit- and receive-side are equipped with fully digital massive MIMO arrays and 1-bit DACs/ADCs, which leads to a system with minimum radio-frequency complexity, cost, and power consumption. Building upon the Bussgang decomposition, we derive a tractable approximation of the mean squared error (MSE) between the transmitted data symbols and their soft estimates. Numerical results show that, despite its simplicity, a doubly 1-bit quantized massive MIMO system with very large antenna arrays can deliver an impressive performance in terms of MSE and symbol error rate.
摘要
启用(互访-)THz频段的通信将需要巨量的多输入多输出(MIMO)数组,可以是发送或接收方的一侧,或者是两个侧。为了降低复杂性和功耗占用,在巨大频率和天线维度上运行时,需要牺牲数字到Analog/Analog到Digital(DACs/ADCs)的解析精度。在这篇论文中,我们分析了极端情况下,发送和接收两侧都使用完全数字化巨大MIMO数组和1比特DACs/ADCs,这导致了最低的无线电频率复杂性、成本和功耗占用。基于Bussgang分解,我们 derivates一个可追加的MSE(均方差)近似方法,用于评估发送数据符号和其软估值之间的差异。数值结果表明,尽管其简单,一个两жды1比特量化的巨大MIMO系统可以在MSE和符号错误率方面提供很出色的表现。
Learning Channel Capacity with Neural Mutual Information Estimator Based on Message Importance Measure
methods: 我们提出了一种合作框架, simultaneous estimation of channel capacity and design of optimal codebook。首先,我们将MIM-based GAN,一种基于消息重要度度量的生成对抗网络(GAN),应用于共识度估计,并开发了一种新的方法,称为MIM-based mutual information estimator(MMIE)。然后,我们设计了一个通用合作框架,在该框架中,一个生成器被视为通道输入生成器,而一个判定器则是共识度估计器,通过对生成器进行对抗训练,生成器自动学习了最佳代码库,而判定器则估计了通道容量。
results: 数值实验表明,Compared with several conventional estimators, the MMIE achieves state-of-the-art performance in terms of accuracy and stability.Abstract
Channel capacity estimation plays a crucial role in beyond 5G intelligent communications. Despite its significance, this task is challenging for a majority of channels, especially for the complex channels not modeled as the well-known typical ones. Recently, neural networks have been used in mutual information estimation and optimization. They are particularly considered as efficient tools for learning channel capacity. In this paper, we propose a cooperative framework to simultaneously estimate channel capacity and design the optimal codebook. First, we will leverage MIM-based GAN, a novel form of generative adversarial network (GAN) using message importance measure (MIM) as the information distance, into mutual information estimation, and develop a novel method, named MIM-based mutual information estimator (MMIE). Then, we design a generalized cooperative framework for channel capacity learning, in which a generator is regarded as an encoder producing the channel input, while a discriminator is the mutual information estimator that assesses the performance of the generator. Through the adversarial training, the generator automatically learns the optimal codebook and the discriminator estimates the channel capacity. Numerical experiments will demonstrate that compared with several conventional estimators, the MMIE achieves state-of-the-art performance in terms of accuracy and stability.
摘要
First, we will use MIM-based GAN, a novel form of generative adversarial network (GAN) that uses message importance measure (MIM) as the information distance, to estimate mutual information. We will develop a novel method called MIM-based mutual information estimator (MMIE). Then, we design a generalized cooperative framework for channel capacity learning, in which a generator is regarded as an encoder producing the channel input, while a discriminator is the mutual information estimator that assesses the performance of the generator. Through adversarial training, the generator automatically learns the optimal codebook and the discriminator estimates the channel capacity.Numerical experiments will demonstrate that compared to several conventional estimators, the MMIE achieves state-of-the-art performance in terms of accuracy and stability.