paper_authors: Paul Goyes-Penafiel, Leon Suarez-Rodriguez, Claudia Correa, Henry Arguello
for: 提高深度学习方法的适用范围和数据特性适应性
methods: 使用对抗生成网络(GAN)和重构网络,实现数据异常性控制和特征多样性提高
results: 比基eline方法和深度学习秘密法等提高8dB的PSNR表示性能Abstract
Seismic data interpolation plays a crucial role in subsurface imaging, enabling accurate analysis and interpretation throughout the seismic processing workflow. Despite the widespread exploration of deep supervised learning methods for seismic data reconstruction, several challenges still remain open. Particularly, the requirement of extensive training data and poor domain generalization due to the seismic survey's variability poses significant issues. To overcome these limitations, this paper introduces a deep-learning-based seismic data reconstruction approach that leverages data redundancy. This method involves a two-stage training process. First, an adversarial generative network (GAN) is trained using synthetic seismic data, enabling the extraction and learning of their primary and local seismic characteristics. Second, a reconstruction network is trained with synthetic data generated by the GAN, which dynamically adjusts the noise and distortion level at each epoch to promote feature diversity. This approach enhances the generalization capabilities of the reconstruction network by allowing control over the generation of seismic patterns from the latent space of the GAN, thereby reducing the dependency on large seismic databases. Experimental results on field and synthetic seismic datasets both pre-stack and post-stack show that the proposed method outperforms the baseline supervised learning and unsupervised approaches such as deep seismic prior and internal learning, by up to 8 dB of PSNR.
摘要
减少数据的重要作用在地球声图减少中发挥重要作用,帮助减少分析和解释过程中的准确性。despite the widespread exploration of deep supervised learning methods for seismic data reconstruction, several challenges still remain open. Particularly, the requirement of extensive training data and poor domain generalization due to the seismic survey's variability poses significant issues. To overcome these limitations, this paper introduces a deep-learning-based seismic data reconstruction approach that leverages data redundancy. This method involves a two-stage training process. First, an adversarial generative network (GAN) is trained using synthetic seismic data, enabling the extraction and learning of their primary and local seismic characteristics. Second, a reconstruction network is trained with synthetic data generated by the GAN, which dynamically adjusts the noise and distortion level at each epoch to promote feature diversity. This approach enhances the generalization capabilities of the reconstruction network by allowing control over the generation of seismic patterns from the latent space of the GAN, thereby reducing the dependency on large seismic databases. Experimental results on field and synthetic seismic datasets both pre-stack and post-stack show that the proposed method outperforms the baseline supervised learning and unsupervised approaches such as deep seismic prior and internal learning, by up to 8 dB of PSNR.
User Dynamics-Aware Edge Caching and Computing for Mobile Virtual Reality
results: 对比传统缓存和计算资源分配策略,提出的方法可以显著提高VR视频流媒体性能Abstract
In this paper, we present a novel content caching and delivery approach for mobile virtual reality (VR) video streaming. The proposed approach aims to maximize VR video streaming performance, i.e., minimizing video frame missing rate, by proactively caching popular VR video chunks and adaptively scheduling computing resources at an edge server based on user and network dynamics. First, we design a scalable content placement scheme for deciding which video chunks to cache at the edge server based on tradeoffs between computing and caching resource consumption. Second, we propose a machine learning-assisted VR video delivery scheme, which allocates computing resources at the edge server to satisfy video delivery requests from multiple VR headsets. A Whittle index-based method is adopted to reduce the video frame missing rate by identifying network and user dynamics with low signaling overhead. Simulation results demonstrate that the proposed approach can significantly improve VR video streaming performance over conventional caching and computing resource scheduling strategies.
摘要
在本文中,我们提出了一种新的移动虚拟现实(VR)视频流传输的内容缓存和交付方法。我们的方法旨在最大化VR视频流传输性能,即最小化视频帧缺失率,通过在边缘服务器上推动受欢迎的VR视频块缓存和根据用户和网络动态进行adaptive资源调度。首先,我们设计了一种可扩展的内容分布 schemes,用于决定边缘服务器上缓存哪些视频块。我们通过考虑计算和缓存资源消耗进行了负荷平衡。其次,我们提出了一种基于机器学习的VR视频交付方案,该方案在边缘服务器上分配计算资源来满足多个VR头戴设备的视频交付请求。我们采用了Whittle指数来减少视频帧缺失率,并通过低信号过载来识别网络和用户动态。实验结果表明,提出的方法可以 significatively提高VR视频流传输性能,比 convential缓存和计算资源调度策略更好。
A Universal Framework for Multiport Network Analysis of Reconfigurable Intelligent Surfaces
paper_authors: Matteo Nerini, Shanpu Shen, Hongyu Li, Marco Di Renzo, Bruno Clerckx
for: 本研究旨在提出一种通用的多口网络分析框架,用于研究受支持系统的各种扩展和改进。
methods: 本研究使用了阻抗、导电、散射参数分析方法来模型受支持系统和受支持架构。
results: 研究者通过三种等效模型来描述受支持系统的影响,并提出了选择合适参数的方法。numerical results提供了额外证明这三种模型之间的等效性。Abstract
Reconfigurable intelligent surface (RIS) is an emerging paradigm able to control the propagation environment in wireless systems. Most of the research on RIS has been dedicated to system-level optimization and, with the advent of beyond diagonal RIS (BD-RIS), to RIS architecture design. However, developing general and unified electromagnetic (EM)-compliant models for RIS-aided systems remains an open problem. In this study, we propose a universal framework for the multiport network analysis of RIS-aided systems. With our framework, we model RIS-aided systems and RIS architectures through impedance, admittance, and scattering parameter analysis. Based on these analyses, three equivalent models are derived accounting for the effects of impedance mismatching and mutual coupling. The three models are then simplified by assuming large transmission distances, perfect matching, and no mutual coupling to understand the role of the RIS in the communication model. The derived simplified models are consistent with the model used in related literature, although we show that an additional approximation is commonly considered in the literature. We discuss the benefits of each analysis in characterizing and optimizing the RIS and how to select the most suitable parameters according to the needs. Numerical results provide additional evidence of the equivalence of the three analyses.
摘要
可编程智能表面(RIS)是一种新兴思路,可以控制无线系统的宣传环境。大多数RIS研究都集中在系统水平优化和BD-RIS架构设计上。然而,为RIS协助系统设计通用和统一的电磁(EM)适应模型仍然是一个未解决的问题。在这种研究中,我们提出了RIS协助系统多口网络分析的普适框架。我们通过阻抗、导电、散射参数分析来模型RIS协助系统和RIS架构。根据这些分析,我们 derive three equivalent models, each accounting for the effects of impedance mismatching and mutual coupling. These models are then simplified by assuming large transmission distances, perfect matching, and no mutual coupling to understand the role of the RIS in the communication model. We discuss the benefits of each analysis in characterizing and optimizing the RIS and how to select the most suitable parameters according to the needs. Numerical results provide additional evidence of the equivalence of the three analyses.
results: 本文对3GPP Rel-16位置定位在户外和户内环境中的性能进行了低analyze,并提供了系统配置的变化对位置定位的影响。Abstract
The widespread adoption of the fifth generation (5G) of cellular networks has brought new opportunities for localization-based services. High-precision positioning use cases and functionalities defined by the standards are drawing the interest of vertical industries. In the transition to the deployment, this paper aims to provide an in-depth tutorial on 5G positioning, summarizing the historical events that led to the standardization of cellular-based positioning, describing current and forthcoming releases of the Third Generation Partnership Project (3GPP) standard, and discussing about the major research trends. This paper is intended to represent an exhaustive guide for researchers and practitioners by providing fundamental notions on wireless localization, comprehensive definitions of measurements and architectures, examples of algorithms, and details on simulation approaches. Our approach aims to merge practical aspects of enabled use cases and related requirements with theoretical methodologies and fundamental bounds, allowing to understand the trade-off between system complexity and achievable, i.e., tangible, benefits of 5G positioning services. We also discuss about current limitations to be resolved for delivering accurate positioning solutions. We evaluate the performances of 3GPP Rel-16 positioning in outdoor and indoor environments, providing thorough analyses of the effect of changing the system configuration.
摘要
fifth generation (5G) 通信网络的普及已经带来了基于本地化服务的新机遇。高精度定位用例和功能由标准定义,吸引了专业领域的关注。在部署过程中,这篇论文的目标是为研究人员和实践者提供5G定位的深入教程,从历史事件的发展到3GPP标准的发布,从当前和未来的3GPP标准发布中概述了主要的研究趋势。这篇论文旨在为研究人员和实践者提供5G定位的权威指南,涵盖无线地位定位的基础知识,完整的定义测量和架构,算法的示例,以及模拟方法。我们的方法旨在将实践的使用案例和相关要求与理论方法和基本上限相结合,以便理解系统复杂性和可以实现的优良定位服务的贸易。我们还讨论了5G定位服务的当前限制,以及需要解决的问题。我们评估了Rel-16版本的定位性能在户外和户内环境中,并提供了丰富的分析结果,包括系统配置变化的效果。
Mutual Coupling in RIS-Aided Communication: Experimental Validation and Performance Evaluation
paper_authors: Pinjun Zheng, Ruiqi Wang, Atif Shamim, Tareq Y. Al-Naffouri
for: 这篇论文探讨了协同扩散(RIS)帮助通信系统中的互相干扰。
methods: 该论文首先引入了一种基于3D全波模拟的新的模型训练方法,然后通过实验测量 Validated the obtained model in a 1-bit quasi-passive RIS prototype operating in the mmWave band.
results: 比较分析表明, employed mutual coupling-aware model and the assessed model parameters are precise, offering a realistic evaluation of mutual coupling in authentic RIS hardware. The results also show that the mutual coupling in RIS exhibits heightened significance with increased RIS amplitude gains and showcases a frequency-dependent effect.Abstract
This paper explores the mutual coupling in the reconfigurable intelligent surface (RIS)-aided communication. Despite the existence of several mutual coupling-aware models for RIS-aided communication, a notable gap remains due to the lack of experimental validation. This paper bridges this gap by first introducing a novel model training approach based on the 3D full-wave simulation and subsequently validating the obtained model via experimental measurements in a 1-bit quasi-passive RIS prototype operating in the mmWave band. Comparative analyses reveal precision in both the employed mutual coupling-aware model and the assessed model parameters, offering a realistic evaluation of mutual coupling in authentic RIS hardware. Utilizing the validated mutual coupling-aware communication model, we systematically examine the impact of mutual coupling on communication performance by adopting the achievable rate as a performance indicator. Our results reveal that the mutual coupling in RIS exhibits heightened significance with increased RIS amplitude gains and showcases a frequency-dependent effect.
摘要
“UWBCarGraz” Dataset for Car Occupancy Detection using Ultra-Wideband Radar
results: 对比VMP算法,ResNet架构在低信号响应比(SNR)下表现更好,具体来说,在SNR=-20dB下,VMP检测器的AUC为0.87,而ResNet架构的AUC为0.91,如果目标人在安静呼吸。对于其他活动水平,表现相似。为了在车辆上部硬件中实现,我们进行了一些减少计算复杂性和提高性能的ablation研究。数据集用于训练和评估算法是公开可用的。Abstract
We present a data-driven car occupancy detection algorithm using ultra-wideband radar based on the ResNet architecture. The algorithm is trained on a dataset of channel impulse responses obtained from measurements at three different activity levels of the occupants (i.e. breathing, talking, moving). We compare the presented algorithm against a state-of-the-art car occupancy detection algorithm based on variational message passing (VMP). Our presented ResNet architecture is able to outperform the VMP algorithm in terms of the area under the receiver operating curve (AUC) at low signal-to-noise ratios (SNRs) for all three activity levels of the target. Specifically, for an SNR of -20 dB the VMP detector achieves an AUC of 0.87 while the ResNet architecture achieves an AUC of 0.91 if the target is sitting still and breathing naturally. The difference in performance for the other activities is similar. To facilitate the implementation in the onboard computer of a car we perform an ablation study to optimize the tradeoff between performance and computational complexity for several ResNet architectures. The dataset used to train and evaluate the algorithm is openly accessible. This facilitates an easy comparison in future works.
摘要
我们提出了基于ultra-wideband雷达的数据驱动车辆占用检测算法,使用ResNet架构。我们对这种算法进行了训练,并对其进行了与现有state-of-the-art车辆占用检测算法(基于变分消息传递)的比较。我们发现,在低信号响应比(SNR)下,我们的ResNet架构能够超越VMP算法,在三种不同活动水平下的占用率中,具有更高的接收操作曲线面积(AUC)。具体来说,在SNR=-20dB下,VMP检测器的AUC为0.87,而我们的ResNet架构的AUC为0.91,当目标坐在安静地呼吸时。其他活动水平的差异类似。为了在车辆上部硬件中实现,我们进行了一项减少性能和计算复杂度之间的权衡分析,对多个ResNet架构进行了优化。我们使用的训练和评估数据集是公开 accessible,这使得未来的研究更容易进行比较。
Meta-DSP: A Meta-Learning Approach for Data-Driven Nonlinear Compensation in High-Speed Optical Fiber Systems
paper_authors: Xinyu Xiao, Zhennan Zhou, Bin Dong, Dingjiong Ma, Li Zhou, Jie Sun
for: This paper aims to improve the performance of long-haul, high-speed optical fiber systems by developing a novel data-driven nonlinear compensation model based on meta-learning.
methods: The proposed model, called Meta-DSP, processes multi-modal data across diverse transmission rates, power levels, and channel numbers to enhance signal quality and reduce the complexity of the nonlinear processing algorithm.
results: Compared to existing methods, Meta-DSP delivers a 0.7 dB increase in the Q-factor and reduces computational complexity by a factor of ten while retaining comparable performance. The model’s scalability and generalization performance make it a promising solution for addressing the critical parameters defining optical communication networks.Abstract
Non-linear effects in long-haul, high-speed optical fiber systems significantly hinder channel capacity. While the Digital Backward Propagation algorithm (DBP) with adaptive filter (ADF) can mitigate these effects, it suffers from an overwhelming computational complexity. Recent solutions have incorporated deep neural networks in a data-driven strategy to alleviate this complexity in the DBP model. However, these models are often limited to a specific symbol rate and channel number, necessitating retraining for different settings, their performance declines significantly under high-speed and high-power conditions. We introduce Meta-DSP, a novel data-driven nonlinear compensation model based on meta-learning that processes multi-modal data across diverse transmission rates, power levels, and channel numbers. This not only enhances signal quality but also substantially reduces the complexity of the nonlinear processing algorithm. Our model delivers a 0.7 dB increase in the Q-factor over Electronic Dispersion Compensation (EDC), and compared to DBP, it curtails computational complexity by a factor of ten while retaining comparable performance. From the perspective of the entire signal processing system, the core idea of Meta-DSP can be employed in any segment of the overall communication system to enhance the model's scalability and generalization performance. Our research substantiates Meta-DSP's proficiency in addressing the critical parameters defining optical communication networks.
摘要
非线性效应在长距离、高速光纤系统中带来了通道容量的很大障碍。尽管数字倒推算法(DBP)与自适应滤波器(ADF)可以减轻这些效应,但它们受到极高的计算复杂性的困扰。现有的解决方案通常是通过深度神经网络在数据驱动策略中引入它们。然而,这些模型通常只能在特定的符号速率和通道数下工作,需要重新训练,其性能在高速和高功率条件下减退 significatively。我们介绍了 Meta-DSP,一种基于元学习的数据驱动非线性补做模型。这种模型可以处理多模态数据,并在不同的传输速率、功率水平和通道数下工作。这不仅提高了信号质量,还减少了非线性处理算法的计算复杂性。我们的模型与电子排序补做(EDC)相比,提高了Q因子的值0.7dB,相比DBP,它减少了计算复杂性的因子为10,保持了相似的性能。从整个信号处理系统的角度来看,Meta-DSP的核心思想可以在任何系统中实现,以提高模型的扩展性和泛化性表现。我们的研究证明了Meta-DSP在光通信网络中的灵活性和可扩展性。
Downlink Transmission in FBMC-based Massive MIMO with Co-located and Distributed Antennas
results: 研究人员通过对不完美的通道匀化和 CSI 知识的分析,证明了该预编码器在实际应用中的出色性能。数值评估也表明,该预编码器在比较 OFDM 方法为参照点时表现出色。Abstract
This paper introduces a practical precoding method for the downlink of Filter Bank Multicarrier-based (FBMC-based) massive multiple-input multiple-output (MIMO) systems. The proposed method comprises a two-stage precoder, consisting of a fractionally spaced prefilter (FSP) per subcarrier to equalize the channel across each subcarrier band. This is followed by a conventional precoder that concentrates the signals of different users at their spatial locations, ensuring each user receives only the intended information. In practical scenarios, a perfect channel reciprocity may not hold due to radio chain mismatches in the uplink and downlink. Moreover, the channel state information (CSI) may not be perfectly known at the base station. To address these issues, we theoretically analyze the performance of the proposed precoder in presence of imperfect CSI and channel reciprocity calibration errors. Our investigation covers both co-located (cell-based) and cell-free massive MIMO cases. In the cell-free massive MIMO setup, we propose an access point selection method based on the received SINRs of different users in the uplink. Finally, we conduct numerical evaluations to assess the performance of the proposed precoder. Our results demonstrate the excellent performance of the proposed precoder when compared with the orthogonal frequency division multiplexing (OFDM) method as a benchmark.
摘要
In practical scenarios, the channel reciprocity may not hold due to radio chain mismatches in the uplink and downlink, and the channel state information (CSI) may not be perfectly known at the base station. To address these issues, the paper analyzes the performance of the proposed precoder in the presence of imperfect CSI and channel reciprocity calibration errors. The investigation covers both co-located (cell-based) and cell-free massive MIMO cases.In the cell-free massive MIMO setup, the paper proposes an access point selection method based on the received SINRs of different users in the uplink. Numerical evaluations are conducted to assess the performance of the proposed precoder, and the results show that it outperforms the orthogonal frequency division multiplexing (OFDM) method as a benchmark.
Joint Sensing and Communication Optimization in Target-Mounted STARS-Assisted Vehicular Networks: A MADRL Approach
results: 通过使用目标车辆上的STARS系统,提高感知和通信性能,并在不同环境下进行了比较性分析和对比。Abstract
The utilization of integrated sensing and communication (ISAC) technology has the potential to enhance the communication performance of road side units (RSUs) through the active sensing of target vehicles. Furthermore, installing a simultaneous transmitting and reflecting surface (STARS) on the target vehicle can provide an extra boost to the reflection of the echo signal, thereby improving the communication quality for in-vehicle users. However, the design of this target-mounted STARS system exhibits significant challenges, such as limited information sharing and distributed STARS control. In this paper, we propose an end-to-end multi-agent deep reinforcement learning (MADRL) framework to tackle the challenges of joint sensing and communication optimization in the considered target-mounted STARS assisted vehicle networks. By deploying agents on both RSU and vehicle, the MADRL framework enables RSU and vehicle to perform beam prediction and STARS pre-configuration using their respective local information. To ensure efficient and stable learning for continuous decision-making, we employ the multi-agent soft actor critic (MASAC) algorithm and the multi-agent proximal policy optimization (MAPPO) algorithm on the proposed MADRL framework. Extensive experimental results confirm the effectiveness of our proposed MADRL framework in improving both sensing and communication performance through the utilization of target-mounted STARS. Finally, we conduct a comparative analysis and comparison of the two proposed algorithms under various environmental conditions.
摘要
utilization of integrated sensing and communication (ISAC) technology 可能会增强路边单元 (RSU) 的通信性能 через active sensing of target vehicles. In addition, installing a simultaneous transmitting and reflecting surface (STARS) on the target vehicle can provide an extra boost to the reflection of the echo signal, thereby improving the communication quality for in-vehicle users. However, the design of this target-mounted STARS system presents significant challenges, such as limited information sharing and distributed STARS control.In this paper, we propose an end-to-end multi-agent deep reinforcement learning (MADRL) framework to address the challenges of joint sensing and communication optimization in the considered target-mounted STARS assisted vehicle networks. By deploying agents on both RSU and vehicle, the MADRL framework enables RSU and vehicle to perform beam prediction and STARS pre-configuration using their respective local information. To ensure efficient and stable learning for continuous decision-making, we employ the multi-agent soft actor critic (MASAC) algorithm and the multi-agent proximal policy optimization (MAPPO) algorithm on the proposed MADRL framework.Extensive experimental results confirm the effectiveness of our proposed MADRL framework in improving both sensing and communication performance through the utilization of target-mounted STARS. Finally, we conduct a comparative analysis and comparison of the two proposed algorithms under various environmental conditions.
Joint channel estimation and data detection in massive MIMO systems based on diffusion models
methods: 该论文提出了一种基于扩散模型的 JOINT频率估计和数据检测算法,通过采样joint posterior distribution of symbols和通道来实现约束最大化估计。在实现这个算法时,我们构建了一个扩散过程,该模型joint distribution of channels and symbols given noisy observations,然后运行反向过程来生成样本。
results: 通过数值实验,我们示出了该算法比竞争方法具有更低的归一化平均方差Error和减少预先 overhead。Abstract
We propose a joint channel estimation and data detection algorithm for massive multilple-input multiple-output systems based on diffusion models. Our proposed method solves the blind inverse problem by sampling from the joint posterior distribution of the symbols and channels and computing an approximate maximum a posteriori estimation. To achieve this, we construct a diffusion process that models the joint distribution of the channels and symbols given noisy observations, and then run the reverse process to generate the samples. A unique contribution of the algorithm is to include the discrete prior distribution of the symbols and a learned prior for the channels. Indeed, this is key as it allows a more efficient exploration of the joint search space and, therefore, enhances the sampling process. Through numerical experiments, we demonstrate that our method yields a lower normalized mean squared error than competing approaches and reduces the pilot overhead.
摘要
我们提出了一种共同频道估计和数据检测算法,用于大规模多输入多出力系统,基于协沃分布。我们的提议方法解决了无目标反问题,通过采样 JOINT posterior distribution 的符号和通道,并计算 Approximate Maximum A Posteriori 估计。为实现这一点,我们构建了一个协沃过程,模型 JOINT 分布符号和通道给噪声观测,然后运行反向过程来生成样本。我们的唯一贡献在于包含符号的精确估计和学习的通道先验。这确实是关键,因为它允许更高效地探索 JOINT 搜索空间,并因此提高采样过程。通过数值实验,我们示出了我们的方法比竞争方法具有较低的 норmalized Mean Squared Error,并降低了卫星负荷。
paper_authors: Vikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva
results: 这种方法可以在噪声中提供高质量的生成音频,并且不需要任何类型的噪声或噪声标注。此外,我们还提出了一种多任务协同学习方法,通过结合自动матиче预测和协同学习来提高生成音频的质量。Abstract
Recent progress in self-supervised representation learning has opened up new opportunities for training from unlabeled data and has been a growing trend in voice conversion. However, unsupervised training of voice cloning seems to remain a challenging task. In this paper we propose a semi-supervised zero-shot voice cloning approach that works by adapting a HuBERT-based voice conversion system to the voice cloning task and shows the robustness of such a system to noises both in training data (we add noises resulting in up to 0db signal-to-noise-ratio to 35% of training data with no significant degradation of evaluation metrics) and in the target speaker reference audio at inference. Moreover, such a method does not require any type of denoising or noise-labeling of training data. Finally, we introduce a novel multi-tasking approach by incorporating self-supervised DINO loss into joint training of a CAM++ based speaker verification system and a unit-based VITS cloning system. We show that it significantly improves the quality of generated audio over baselines, especially for noisy target speaker references.
摘要
Future Full-Ocean Deep SSPs Prediction based on Hierarchical Long Short-Term Memory Neural Networks
results: 在不同深度层次上月均声速分布的预测准确性比其他现有方法更高,月均声速分布的误差小于1米/秒Abstract
The spatial-temporal distribution of underwater sound velocity affects the propagation mode of underwater acoustic signals. Therefore, rapid estimation and prediction of underwater sound velocity distribution is crucial for providing underwater positioning, navigation and timing (PNT) services. Currently, sound speed profile (SSP) inversion methods have a faster time response rate compared to direct measurement methods, however, most SSP inversion methods focus on constructing spatial dimensional sound velocity fields and are highly dependent on sonar observation data, thus high requirements have been placed on observation data sources. To explore the distribution pattern of sound velocity in the time dimension and achieve future SSP prediction without sonar observation data, we propose a hierarchical long short-term memory (H-LSTM) neural network for SSP prediction. By our SSP prediction method, the sound speed distribution could be estimated without any on-site data measurement process, so that the time efficiency could be greatly improved. Through comparing with other state-of-the-art methods, H-LSTM has better accuracy performance on prediction of monthly average sound velocity distribution, which is less than 1 m/s in different depth layers.
摘要
<>Translate the following text into Simplified Chinese:The spatial-temporal distribution of underwater sound velocity affects the propagation mode of underwater acoustic signals. Therefore, rapid estimation and prediction of underwater sound velocity distribution is crucial for providing underwater positioning, navigation and timing (PNT) services. Currently, sound speed profile (SSP) inversion methods have a faster time response rate compared to direct measurement methods, however, most SSP inversion methods focus on constructing spatial dimensional sound velocity fields and are highly dependent on sonar observation data, thus high requirements have been placed on observation data sources. To explore the distribution pattern of sound velocity in the time dimension and achieve future SSP prediction without sonar observation data, we propose a hierarchical long short-term memory (H-LSTM) neural network for SSP prediction. By our SSP prediction method, the sound speed distribution could be estimated without any on-site data measurement process, so that the time efficiency could be greatly improved. Through comparing with other state-of-the-art methods, H-LSTM has better accuracy performance on prediction of monthly average sound velocity distribution, which is less than 1 m/s in different depth layers.Translation:水下声速分布的空间-时间分布对声音信号的传播模式产生影响,因此快速估计和预测水下声速分布是提供水下定位、导航和时间服务(PNT)的关键。目前,声速Profile(SSP)反向方法有更快的时间响应率,但大多数SSP反向方法都是建立空间维度的声速场,高度依赖于声波观测数据,因此对观测数据的要求非常高。为了探索声速分布的时间维度分布 pattern和实现未来SSP预测无需声波观测数据,我们提议使用层次long short-term memory(H-LSTM)神经网络进行SSP预测。我们的预测方法可以无需任何现场数据测量过程,因此可以大幅提高时间效率。与其他现有方法比较,H-LSTM在月均声速分布预测中表现出更高的准确性,声速分布在不同深度层中的误差低于1 m/s。
for: bridging the gap between recent advancements in Neural Audio Synthesis (NAS) and standardized evaluation methodologies.
methods: open-source Python library with a range of audio quality metrics, including a unique Python implementation of the basic PEAQ algorithm, and multiple operating modes to accommodate various user needs.
results: simplifies and standardizes the evaluation of NAS systems.Abstract
Recent advancements in Neural Audio Synthesis (NAS) have outpaced the development of standardized evaluation methodologies and tools. To bridge this gap, we introduce AquaTk, an open-source Python library specifically designed to simplify and standardize the evaluation of NAS systems. AquaTk offers a range of audio quality metrics, including a unique Python implementation of the basic PEAQ algorithm, and operates in multiple modes to accommodate various user needs.
摘要
最近的神经音频合成(NAS)技术的发展速度超过了评估方法和工具的开发。为了bridging这个差距,我们介绍了AquaTk,一个开源的Python库,专门用于简化和标准化NAS系统的评估。AquaTk提供了多种音频质量指标,包括Python中Unique实现的基本PEAQ算法,并在多种模式下运行,以满足不同用户需求。
results: 实现了对非典型发音的高质量语音生成,并为 SLU 系统提供更公平的处理能力Abstract
Spoken language understanding (SLU) systems often exhibit suboptimal performance in processing atypical speech, typically caused by neurological conditions and motor impairments. Recent advancements in Text-to-Speech (TTS) synthesis-based augmentation for more fair SLU have struggled to accurately capture the unique vocal characteristics of atypical speakers, largely due to insufficient data. To address this issue, we present a novel data augmentation method for atypical speakers by finetuning a TTS model, called Aty-TTS. Aty-TTS models speaker and atypical characteristics via knowledge transferring from a voice conversion model. Then, we use the augmented data to train SLU models adapted to atypical speech. To train these data augmentation models and evaluate the resulting SLU systems, we have collected a new atypical speech dataset containing intent annotation. Both objective and subjective assessments validate that Aty-TTS is capable of generating high-quality atypical speech. Furthermore, it serves as an effective data augmentation strategy, contributing to more fair SLU systems that can better accommodate individuals with atypical speech patterns.
摘要
听说理解(Spoken Language Understanding,SLU)系统经常在处理非典型语音时表现出下标的性能,通常是由神经系统和motor功能障碍所致。最近,基于文本到语音(Text-to-Speech,TTS)合成的数据增强技术在更公正的SLU中获得了进展,但是它们在捕捉非典型说话者的特有声音特征方面存在准确性问题,主要是因为数据不足。为解决这个问题,我们提出了一种新的数据增强方法,即Aty-TTS。Aty-TTS模型通过知识传递自voice conversion模型来学习说话者和非典型特征。然后,我们使用这些增强数据来训练适应非典型语音的SLU系统。为了训练这些数据增强模型和评估所得的SLU系统,我们收集了一个新的非典型语音数据集,其中包含意向注解。对象和主观评估表明,Aty-TTS可以生成高质量的非典型语音,并且作为数据增强策略,它有效地改善了SLU系统的公正性,使其更好地适应非典型语音模式。
paper_authors: Syed Farhan Abbas, Nguyen Thanh Duc, Yoonguu Song, Kyungwon Kim, Boreom Lee for: 这个研究的目的是精确地分类脑血管图像,以帮助诊断脑血管疾病。methods: 这个研究使用了3D脑血管注意力UNet方法,named CV-AttentionUNet,来精确地提取脑血管图像。这个方法包括了一系列的预处理技术和深度超级vised UNet,以提高脑血管分类的精度。此外,这个方法还使用了注意力机制,以专注于相关的相互关联,并忽略无关的生物学信息。results: 我们的研究表明,CV-AttentionUNet方法可以对于脑血管分类task中的脑血管图像进行高精度的分类,并且在TubeTK dataset上表现比现有的state-of-the-art方法更好。Abstract
Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. Methods: In this paper, we attempt to solve these limitations and propose the 3D cerebrovascular attention UNet method, named CV-AttentionUNet, for precise extraction of brain vessel images. We proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. To combine the low and high semantics, we applied the attention mechanism. This mechanism focuses on relevant associations and neglects irrelevant anatomical information. Furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. Results: We demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. We believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. The results indicate that our method performed better than the existing state-of-the-art methods on the TubeTK dataset. Conclusion: The proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke
摘要
due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. the commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. methods: in this paper, we attempt to solve these limitations and propose the 3d cerebrovascular attention UNet method, named cv-attentionunet, for precise extraction of brain vessel images. we proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. to combine the low and high semantics, we applied the attention mechanism. this mechanism focuses on relevant associations and neglects irrelevant anatomical information. furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. results: we demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. we believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. the results indicate that our method performed better than the existing state-of-the-art methods on the tubetk dataset. conclusion: the proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke.
Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication
results: 对于14nm和3nm技术的缩放,本文实现了一个高达161 TOp/s/W@0.55V的能耗效率,并达到了CIFAR-10的Top-1准确率高于92.5% using ResNet9。Abstract
From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.
摘要
K-space Cold Diffusion: Learning to Reconstruct Accelerated MRI without Noise
paper_authors: Guoyao Shen, Mengyu Li, Chad W. Farris, Stephan Anderson, Xin Zhang
for: 这 paper 是为了提出一种基于冰晶扩展的 MRI 重建模型,用于快速 MRI 图像重建。
methods: 该模型使用冰晶扩展来实现一般化的图像变换,包括模糊、下采样等操作。
results: 根据对一个大量开源 MRI 数据集的测试,该模型可以生成高质量的重建图像,并且比其他深度学习基于 MRI 重建模型更好。Abstract
Deep learning-based MRI reconstruction models have achieved superior performance these days. Most recently, diffusion models have shown remarkable performance in image generation, in-painting, super-resolution, image editing and more. As a generalized diffusion model, cold diffusion further broadens the scope and considers models built around arbitrary image transformations such as blurring, down-sampling, etc. In this paper, we propose a k-space cold diffusion model that performs image degradation and restoration in k-space without the need for Gaussian noise. We provide comparisons with multiple deep learning-based MRI reconstruction models and perform tests on a well-known large open-source MRI dataset. Our results show that this novel way of performing degradation can generate high-quality reconstruction images for accelerated MRI.
摘要
现在的深度学习基于MRI重建模型已经取得了出色的表现。最近,扩散模型在图像生成、填充、超分解、图像修改等领域都有出色的表现。作为一种通用扩散模型,冷扩散进一步拓宽了范围,考虑了基于任意图像变换的模型,如模糊、下采样等。在这篇论文中,我们提出了基于k空间冷扩散模型的图像劣化和重建方法,无需Gaussian噪声。我们对多种深度学习基于MRI重建模型进行了比较,并在一个著名的大型开源MRI数据集上进行了测试。我们的结果表明,这种新的劣化方法可以生成高质量的重建图像 для加速MRI。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
results: 我们的方法在量化分析中表现出更好的平衡点,与基准方法相比,并且在用户研究中得到了证实。我们还展示了该方法在各种实际应用中的可行性。Abstract
Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. Project page is available at https://omriavrahami.com/the-chosen-one
摘要
近期文本到图像生成模型的进步,推开了大量的视觉创造力。然而,这些模型往往受到一致性的限制,这是许多实际应用中的关键问题,如故事化、游戏开发资产设计、广告等。现有方法通常依赖于多个预存图像或尝试繁琐的手动过程。在这个工作中,我们提出了一种完全自动的一致性Character生成解决方案,唯一的输入是文本提示。我们引入了一种迭代过程,每个阶段都会从一组相似的图像中提取一个更一致的标识。我们的量化分析表明,我们的方法在提示对齐和一致性之间做出了更好的平衡,这些发现得到了用户研究的证实。为了结束,我们展示了一些实际应用场景。项目页面可以在https://omriavrahami.com/the-chosen-one找到。
results: 与现有状态前的方法比,表现出2%的提高,达到更高的检测精度Abstract
Traffic videos inherently differ from generic videos in their stationary camera setup, thus providing a strong motion prior where objects often move in a specific direction over a short time interval. Existing works predominantly employ generic video object detection framework for traffic video object detection, which yield certain advantages such as broad applicability and robustness to diverse scenarios. However, they fail to harness the strength of motion prior to enhance detection accuracy. In this work, we propose two innovative methods to exploit the motion prior and boost the performance of both fully-supervised and semi-supervised traffic video object detection. Firstly, we introduce a new self-attention module that leverages the motion prior to guide temporal information integration in the fully-supervised setting. Secondly, we utilise the motion prior to develop a pseudo-labelling mechanism to eliminate noisy pseudo labels for the semi-supervised setting. Both of our motion-prior-centred methods consistently demonstrates superior performance, outperforming existing state-of-the-art approaches by a margin of 2% in terms of mAP.
摘要
traffic videos 自然 diferen FROM generic videos 的 stationary camera setup, thus providing a strong motion prior where objects often move in a specific direction over a short time interval. Existing works predominantly employ generic video object detection framework for traffic video object detection, which yield certain advantages such as broad applicability and robustness to diverse scenarios. However, they fail to harness the strength of motion prior to enhance detection accuracy. In this work, we propose two innovative methods to exploit the motion prior and boost the performance of both fully-supervised and semi-supervised traffic video object detection. Firstly, we introduce a new self-attention module that leverages the motion prior to guide temporal information integration in the fully-supervised setting. Secondly, we utilize the motion prior to develop a pseudo-labeling mechanism to eliminate noisy pseudo labels for the semi-supervised setting. Both of our motion-prior-centred methods consistently demonstrate superior performance, outperforming existing state-of-the-art approaches by a margin of 2% in terms of mAP.
Adaptive Shells for Efficient Neural Radiance Field Rendering
results: 实验结果表明,这个方法可以大幅提高渲染速度和视觉质量,并且可以在不同的场景中适应性地应用。此外,这个方法还可以提取出精度高的表面 mesh,用于下游应用such as 动画和仿真。Abstract
Neural radiance fields achieve unprecedented quality for novel view synthesis, but their volumetric formulation remains expensive, requiring a huge number of samples to render high-resolution images. Volumetric encodings are essential to represent fuzzy geometry such as foliage and hair, and they are well-suited for stochastic optimization. Yet, many scenes ultimately consist largely of solid surfaces which can be accurately rendered by a single sample per pixel. Based on this insight, we propose a neural radiance formulation that smoothly transitions between volumetric- and surface-based rendering, greatly accelerating rendering speed and even improving visual fidelity. Our method constructs an explicit mesh envelope which spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To this end, we generalize the NeuS formulation with a learned spatially-varying kernel size which encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with width determined by the kernel size, and fine-tune the radiance field within this band. At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Experiments show that our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.
摘要
Our method constructs an explicit mesh envelope that spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To achieve this, we generalize the NeuS formulation with a learned spatially-varying kernel size that encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with the width determined by the kernel size, and fine-tune the radiance field within this band.At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.
Visual Environment Assessment for Safe Autonomous Quadrotor Landing
results: 实验结果表明,该方法可以有效地评估环境中的降落区域,并帮助无人机安全着陆。Abstract
Autonomous identification and evaluation of safe landing zones are of paramount importance for ensuring the safety and effectiveness of aerial robots in the event of system failures, low battery, or the successful completion of specific tasks. In this paper, we present a novel approach for detection and assessment of potential landing sites for safe quadrotor landing. Our solution efficiently integrates 2D and 3D environmental information, eliminating the need for external aids such as GPS and computationally intensive elevation maps. The proposed pipeline combines semantic data derived from a Neural Network (NN), to extract environmental features, with geometric data obtained from a disparity map, to extract critical geometric attributes such as slope, flatness, and roughness. We define several cost metrics based on these attributes to evaluate safety, stability, and suitability of regions in the environments and identify the most suitable landing area. Our approach runs in real-time on quadrotors equipped with limited computational capabilities. Experimental results conducted in diverse environments demonstrate that the proposed method can effectively assess and identify suitable landing areas, enabling the safe and autonomous landing of a quadrotor.
摘要
自动识别和评估安全降落区的重要性对于保证飞行器的安全性和效率具有极高的重要性,尤其在系统故障、电池低下或完成特定任务后。在这篇论文中,我们提出了一种新的降落点检测和评估方法。我们的解决方案可以快速地将2D和3D环境信息集成,从而消除需要外部帮助(如GPS)和计算昂贵的高程地图。我们的管道使用神经网络(NN)来提取环境特征,并使用不同程度图来提取环境中关键的几何特征,如坡度、平坦度和荒凉程度。我们根据这些特征定义了多个成本指标,以评估环境中区域的安全性、稳定性和适用性,并将最适合的降落区标识出来。我们的方法在飞行器上搭载有有限的计算能力下运行,并在多种环境中进行了实验,证明了我们的方法可以有效地评估和标识适合降落的区域,使飞行器安全地自动降落。
Analyzing Deviations of Dyadic Lines in Fast Hough Transform
results: 研究发现,dyadic线模型的偏差的mean值为零,而方差为O(log(n))。随着n的增加,这些偏差的分布会转化为一个正态分布,其中mean值为零,并且具有小的方差。这个限定结果借鉴了 Erdős theory。Abstract
Fast Hough transform is a widely used algorithm in pattern recognition. The algorithm relies on approximating lines using a specific discrete line model called dyadic lines. The worst-case deviation of a dyadic line from the ideal line it used to construct grows as $O(log(n))$, where $n$ is the linear size of the image. But few lines actually reach the worst-case bound. The present paper addresses a statistical analysis of the deviation of a dyadic line from its ideal counterpart. Specifically, our findings show that the mean deviation is zero, and the variance grows as $O(log(n))$. As $n$ increases, the distribution of these (suitably normalized) deviations converges towards a normal distribution with zero mean and a small variance. This limiting result makes an essential use of ergodic theory.
摘要
Depth Insight – Contribution of Different Features to Indoor Single-image Depth Estimation
results: 研究发现,在indoor场景中,形状提取得到的结果具有更大的贡献,而其他特征也具有不同程度的贡献。这些发现可以帮助优化深度估算模型,提高其准确性和Robustness。Abstract
Depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Whereas impressive performances have been reported in this area recently using end-to-end trained deep neural architectures, as to what cues in the images that are being exploited by these black box systems is hard to know. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, colour and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimise depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.
摘要
depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Recently, impressive performances have been reported in this area using end-to-end trained deep neural architectures, but it is hard to know what cues in the images are being exploited by these black box systems. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, color, and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimize depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.Here's the translation in Traditional Chinese:depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Recently, impressive performances have been reported in this area using end-to-end trained deep neural architectures, but it is hard to know what cues in the images are being exploited by these black box systems. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, color, and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimize depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.
Match and Locate: low-frequency monocular odometry based on deep feature matching
results: 这篇论文在AISG-SLA Visual Localisation Challenge中评估了这种方法的表现,发现其可以实现精确的姿势和位置估计, orientation estimation error约3度, translation estimation error约2米,与其他参赛者相比,这种方法在computational efficiency和易于实现的前提下表现竞争力强。Abstract
Accurate and robust pose estimation plays a crucial role in many robotic systems. Popular algorithms for pose estimation typically rely on high-fidelity and high-frequency signals from various sensors. Inclusion of these sensors makes the system less affordable and much more complicated. In this work we introduce a novel approach for the robotic odometry which only requires a single camera and, importantly, can produce reliable estimates given even extremely low-frequency signal of around one frame per second. The approach is based on matching image features between the consecutive frames of the video stream using deep feature matching models. The resulting coarse estimate is then adjusted by a convolutional neural network, which is also responsible for estimating the scale of the transition, otherwise irretrievable using only the feature matching information. We evaluate the performance of the approach in the AISG-SLA Visual Localisation Challenge and find that while being computationally efficient and easy to implement our method shows competitive results with only around $3^{\circ}$ of orientation estimation error and $2m$ of translation estimation error taking the third place in the challenge.
摘要
准确和可靠的姿态估计在许多机器人系统中扮演着关键性的角色。常见的姿态估计算法通常基于高精度和高频信号,这些信号来自多种感知器。然而,这些感知器的包含使得系统变得更加昂贵和复杂。在这个工作中,我们介绍了一种新的机器人征卷方法,只需一个摄像头即可实现。这种方法基于图像特征匹配模型,通过比较连续帧视频流中的图像特征,生成初步估计。然后,使用卷积神经网络调整初步估计,同时估计转换的比例。我们在AISG-SLA视 lokalisierung Challenge中评估了这种方法的性能,发现它具有计算效率和易于实现的优点,同时与其他参赛者相比,其姿态估计误差为约3度和平均误差为2米,在挑战中排名第三。
On the Overconfidence Problem in Semantic 3D Mapping
results: 作者们示出,在一个模块化ObjectNav Agent上,通过正确地将Semantic Fusion纳入混合过程中,可以提高其成功率。此外,作者们还证明了地图准确性对下游任务的重要性。Abstract
Semantic 3D mapping, the process of fusing depth and image segmentation information between multiple views to build 3D maps annotated with object classes in real-time, is a recent topic of interest. This paper highlights the fusion overconfidence problem, in which conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion on a modular ObjectNav agent improves its success rates. Our code will be provided on Github for reproducibility upon acceptance.
摘要
Semantic 3D mapping, the process of combining depth and image segmentation information from multiple views to create 3D maps annotated with object classes in real-time, is a current area of interest. This paper highlights the fusion overconfidence problem, where conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion on a modular ObjectNav agent improves its success rates. Our code will be provided on Github for reproducibility upon acceptance.Here's the translation in Traditional Chinese:Semantic 3D mapping, the process of combining depth and image segmentation information from multiple views to create 3D maps annotated with object classes in real-time, is a current area of interest. This paper highlights the fusion overconfidence problem, where conventional mapping methods assign high confidence to the entire map even when they are incorrect, leading to miscalibrated outputs. Several methods to improve uncertainty calibration at different stages in the fusion pipeline are presented and compared on the ScanNet dataset. We show that the most widely used Bayesian fusion strategy is among the worst calibrated, and propose a learned pipeline that combines fusion and calibration, GLFS, which achieves simultaneously higher accuracy and 3D map calibration while retaining real-time capability. We further illustrate the importance of map calibration on a downstream task by showing that incorporating proper semantic fusion on a modular ObjectNav agent improves its success rates. Our code will be provided on Github for reproducibility upon acceptance.
SQLNet: Scale-Modulated Query and Localization Network for Few-Shot Class-Agnostic Counting
results: 在 popular CAC benchmarks 上,SQLNet 的表现优于当前领先方法,不仅在 counting 精度方面表现出色,还在 localization 和 bounding box 生成方面达到了优秀的result。Abstract
The class-agnostic counting (CAC) task has recently been proposed to solve the problem of counting all objects of an arbitrary class with several exemplars given in the input image. To address this challenging task, existing leading methods all resort to density map regression, which renders them impractical for downstream tasks that require object locations and restricts their ability to well explore the scale information of exemplars for supervision. To address the limitations, we propose a novel localization-based CAC approach, termed Scale-modulated Query and Localization Network (SQLNet). It fully explores the scales of exemplars in both the query and localization stages and achieves effective counting by accurately locating each object and predicting its approximate size. Specifically, during the query stage, rich discriminative representations of the target class are acquired by the Hierarchical Exemplars Collaborative Enhancement (HECE) module from the few exemplars through multi-scale exemplar cooperation with equifrequent size prompt embedding. These representations are then fed into the Exemplars-Unified Query Correlation (EUQC) module to interact with the query features in a unified manner and produce the correlated query tensor. In the localization stage, the Scale-aware Multi-head Localization (SAML) module utilizes the query tensor to predict the confidence, location, and size of each potential object. Moreover, a scale-aware localization loss is introduced, which exploits flexible location associations and exemplar scales for supervision to optimize the model performance. Extensive experiments demonstrate that SQLNet outperforms state-of-the-art methods on popular CAC benchmarks, achieving excellent performance not only in counting accuracy but also in localization and bounding box generation. Our codes will be available at https://github.com/HCPLab-SYSU/SQLNet
摘要
“类型不敏感 counting(CAC)任务最近被提出来解决输入图像中所有类型的对象数量的问题。为了解决这个复杂的任务,现有领先的方法都是通过密度地图回归来实现,这会导致其在下游任务中的缺乏能力和对 exemplars 的缺乏监督。为了解决这些限制,我们提出了一种基于localization的新方法,称为Scale-modulated Query and Localization Network(SQLNet)。它能够全面探索 exemplars 的尺度,并在查询和本地化两个阶段中准确地定位和估计每个对象的大小。在查询阶段,通过多scale exemplar合作和equifrequent size prompt embedding,HECE模块从少量 exemplars 中获得了丰富的描述符,然后将其传递给EUQC模块进行与查询特征的统一交互,生成相关的查询张量。在本地化阶段,SAML模块使用查询张量预测对象的信心、位置和大小。此外,我们还引入了具有灵活位置关系和 exemplars 尺度的scale-aware本地化损失,以便在模型性能优化。我们的实验结果表明,SQLNet 在流行的 CAC 测试准则上表现出色,不仅在计数准确性方面取得了出色的成绩,还在本地化和 bounding box 生成方面取得了优秀的成绩。我们的代码将在https://github.com/HCPLab-SYSU/SQLNet 上提供。”
TransFusion – A Transparency-Based Diffusion Model for Anomaly Detection
results: 这个研究在两个常用的检测 dataset上(VisA和MVTec AD)得到了 state-of-the-art 的表现,具有image-level AUROC 的98.5%和99.2%。Abstract
Surface anomaly detection is a vital component in manufacturing inspection. Reconstructive anomaly detection methods restore the normal appearance of an object, ideally modifying only the anomalous regions. Due to the limitations of commonly used reconstruction architectures, the produced reconstructions are often poor and either still contain anomalies or lack details in anomaly-free regions. Recent reconstructive methods adopt diffusion models, however with the standard diffusion process the problems are not adequately addressed. We propose a novel transparency-based diffusion process, where the transparency of anomalous regions is progressively increased, restoring their normal appearance accurately and maintaining the appearance of anomaly-free regions without loss of detail. We propose TRANSparency DifFUSION (TransFusion), a discriminative anomaly detection method that implements the proposed diffusion process, enabling accurate downstream anomaly detection. TransFusion achieves state-of-the-art performance on both the VisA and the MVTec AD datasets, with an image-level AUROC of 98.5% and 99.2%, respectively.
摘要
表面异常检测是制造检查中的重要组成部分。重建性异常检测方法可以修复物品的正常外观,理想情况下仅 modify anomalous regions。由于通用的重建架构受限,生成的重建结果经常仍然含有异常或lack of detail in anomaly-free regions。最近的重建方法采用扩散模型,但标准的扩散过程并不能够妥善解决问题。我们提出了一种新的透明度基于扩散过程,其中异常区域的透明度逐渐增加,将其正确地修复为正常的外观,并维持异常区域以外的外观不受损害。我们提出了名为TransFusion的检测方法,它实现了提议的扩散过程,允许精确的下游异常检测。TransFusion在VisA和MVTec AD datasets上 achieve state-of-the-art performance,具体来说是图像水平的AUROC为98.5%和99.2%。
DeepEMD: A Transformer-based Fast Estimation of the Earth Mover’s Distance
results: 实验表明,这种模型可以准确地计算摩尔变换距离和其梯度,并且在训练过程中具有高速度响应和广泛的应用前提。此外,模型在无法训练集上的运行表现也非常出色。Abstract
The Earth Mover's Distance (EMD) is the measure of choice between point clouds. However the computational cost to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose an attention-based model to compute an accurate approximation of the EMD that can be used as a training loss for generative models. To get the necessary accurate estimation of the gradients we train our model to explicitly compute the matching between point clouds instead of EMD itself. We cast this new objective as the estimation of an attention matrix that approximates the ground truth matching matrix. Experiments show that this model provides an accurate estimate of the EMD and its gradient with a wall clock speed-up of more than two orders of magnitude with respect to the exact Hungarian matching algorithm and one order of magnitude with respect to the standard approximate Sinkhorn algorithm, allowing in particular to train a point cloud VAE with the EMD itself. Extensive evaluation show the remarkable behaviour of this model when operating out-of-distribution, a key requirement for a distance surrogate. Finally, the model generalizes very well to point clouds during inference several times larger than during training.
摘要
地球移动者距离(EMD)是点云之间的度量标准,但计算成本使其成为训练损失的禁制品。标准方法是使用 Chamfer 距离作为代理。我们提议使用注意力基于模型来计算精确的EMD aproximation,以便用于训练生成模型。为了获得必要的精确Gradient,我们在模型训练中显式计算点云之间的匹配。我们将这个新的目标设定为估算真实的匹配矩阵。实验显示,这个模型可以准确地计算EMD和其Gradient,并且与准确的挪威抽象搜索算法和标准搜索算法相比,具有大量的时间速度提升(至少两个排名)和一个排名的速度提升。这使得我们可以使用EMD本身作为训练损失。我们的模型在误差外的操作中表现出色,这是距离代理的关键要求。此外,我们的模型在推理时可以处理大量的点云,并且可以在训练时使用相同的模型。
From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning
methods: 本文使用了对 batch data 的维度减少和重建,以实现batch数据之间的内部通信,并通过嵌入层来适应性地增强自我超vised feature编码能力。
results: 根据ImageNet-1k的线性分类测试,我们的方法可以在比较公平的情况下达到状态 arts 性能,而且在ImageNet-100上,相比原始性能,top1的最大提升为1.25%。Abstract
In recent years, self-supervised contrastive learning has emerged as a distinguished paradigm in the artificial intelligence landscape. It facilitates unsupervised feature learning through contrastive delineations at the instance level. However, crafting an effective self-supervised paradigm remains a pivotal challenge within this field. This paper delves into two crucial factors impacting self-supervised contrastive learning-bach size and pretext tasks, and from a data processing standpoint, proposes an adaptive technique of batch fusion. The proposed method, via dimensionality reduction and reconstruction of batch data, enables formerly isolated individual data to partake in intra-batch communication through the Embedding Layer. Moreover, it adaptively amplifies the self-supervised feature encoding capability as the training progresses. We conducted a linear classification test of this method based on the classic contrastive learning framework on ImageNet-1k. The empirical findings illustrate that our approach achieves state-of-the-art performance under equitable comparisons. Benefiting from its "plug-and-play" characteristics, we further explored other contrastive learning methods. On the ImageNet-100, compared to the original performance, the top1 has seen a maximum increase of 1.25%. We suggest that the proposed method may contribute to the advancement of data-driven self-supervised learning research, bringing a fresh perspective to this community.
摘要
近年来,自我超viscontrastive learning已经出现为人工智能领域的一种distinguished paradigm。它可以通过对instance level进行对比,实现无监督特征学习。然而,制定有效的自我超viscontrastive learning paradigm仍然是这个领域中的一个关键挑战。这篇论文探讨了自我超viscontrastive learning中两个关键因素:batch size和pretext tasks,并从数据处理角度提出了一种适应技术——批处理融合。提议的方法通过维度减少和批处理数据的重建,使得原来隔离的个体数据能够在Embedding层内进行INTRA-batch交流。此外,该方法可以逐渐增强自我超viscontrastive feature编码能力,并在训练进程中进行适应调整。我们基于经典对比学习框架进行了Linear classification测试,实验结果表明,我们的方法在相等比较下实现了状态盘领先性。由于其“嵌入式”的特点,我们进一步探索了其他对比学习方法。在ImageNet-100上,相比原始性能,排名前100的最大提升为1.25%。我们建议该方法可能会促进数据驱动的自我超viscontrastive学习研究,为这个社区带来一种新的视角。
SurgPLAN: Surgical Phase Localization Network for Phase Recognition
results: 我们的SurgPLAN在比较existing方法时,在精度和稳定性两个方面具有显著的优势。Abstract
Surgical phase recognition is crucial to providing surgery understanding in smart operating rooms. Despite great progress in automatic surgical phase recognition, most existing methods are still restricted by two problems. First, these methods cannot capture discriminative visual features for each frame and motion information with simple 2D networks. Second, the frame-by-frame recognition paradigm degrades the performance due to unstable predictions within each phase, termed as phase shaking. To address these two challenges, we propose a Surgical Phase LocAlization Network, named SurgPLAN, to facilitate a more accurate and stable surgical phase recognition with the principle of temporal detection. Specifically, we first devise a Pyramid SlowFast (PSF) architecture to serve as the visual backbone to capture multi-scale spatial and temporal features by two branches with different frame sampling rates. Moreover, we propose a Temporal Phase Localization (TPL) module to generate the phase prediction based on temporal region proposals, which ensures accurate and consistent predictions within each surgical phase. Extensive experiments confirm the significant advantages of our SurgPLAN over frame-by-frame approaches in terms of both accuracy and stability.
摘要
针对智能操作室中的手术理解,外科阶段识别具有重要的意义。虽然自动外科阶段识别技术已经取得了大量的进步,但大多数现有方法都受到两个问题的限制。首先,这些方法无法捕捉每帧和运动信息的特征特征,这限制了它们的识别精度。其次,frame-by-frame认识模式会导致识别性下降,这被称为“阶段震荡”。为了解决这两个挑战,我们提议一种名为外科阶段封顶网络(SurgPLAN),它可以提供更加准确和稳定的外科阶段识别。具体来说,我们首先设计了一种Pyramid SlowFast(PSF)架构,它作为视觉后ION来捕捉多scalespatial和时间特征。此外,我们还提出了一种时间阶段本地化(TPL)模块,它可以基于时间区域提案来生成阶段预测,从而保证了识别的准确和一致。我们进行了广泛的实验,结果表明,我们的SurgPLAN在 Frame-by-frame方法的基础上具有显著优势,包括更高的准确率和稳定性。
VertDetect: Fully End-to-End 3D Vertebral Instance Segmentation Model
results: 这个模型在 VerSe 2019 和 VerSe 2020 公共和隐藏测试集中均 achieved state-of-the-art 性能,其中 Dice Similarity Coefficient (DSC) 为 0.883 (95% CI, 0.843-0.906) 和 0.882 (95% CI, 0.835-0.909),以及 0.868 (95% CI, 0.834-0.890) 和 0.869 (95% CI, 0.832-0.891)。Abstract
Vertebral detection and segmentation are critical steps for treatment planning in spine surgery and radiation therapy. Accurate identification and segmentation are complicated in imaging that does not include the full spine, in cases with variations in anatomy (T13 and/or L6 vertebrae), and in the presence of fracture or hardware. This paper proposes VertDetect, a fully automated end-to-end 3D vertebral instance segmentation Convolutional Neural Network (CNN) model to predict vertebral level labels and segmentations for all vertebrae present in a CT scan. The utilization of a shared CNN backbone provides the detection and segmentation branches of the network with feature maps containing both spinal and vertebral level information. A Graph Convolutional Network (GCN) layer is used to improve vertebral labelling by using the known structure of the spine. This model achieved a Dice Similarity Coefficient (DSC) of 0.883 (95% CI, 0.843-0.906) and 0.882 (95% CI, 0.835-0.909) in the VerSe 2019 and 0.868 (95\% CI, 0.834-0.890) and 0.869 (95\% CI, 0.832-0.891) in the VerSe 2020 public and hidden test sets, respectively. This model achieved state-of-the-art performance for an end-to-end architecture, whose design facilitates the extraction of features that can be subsequently used for downstream tasks.
摘要
<>translate("VertDetect is a fully automated end-to-end 3D vertebral instance segmentation Convolutional Neural Network (CNN) model that predicts vertebral level labels and segmentations for all vertebrae present in a CT scan. The shared CNN backbone provides the detection and segmentation branches with feature maps containing both spinal and vertebral level information. A Graph Convolutional Network (GCN) layer is used to improve vertebral labeling using the known structure of the spine. This model achieved a Dice Similarity Coefficient (DSC) of 0.883 (95% CI, 0.843-0.906) and 0.882 (95% CI, 0.835-0.909) in the VerSe 2019 and 0.868 (95\% CI, 0.834-0.890) and 0.869 (95\% CI, 0.832-0.891) in the VerSe 2020 public and hidden test sets, respectively. This model achieved state-of-the-art performance for an end-to-end architecture, whose design facilitates the extraction of features that can be subsequently used for downstream tasks.">> Here's the breakdown of the translation:* "VertDetect" is translated as "VertDetect" (全自动的三维vertebral实例分割Convolutional Neural Network模型)* "Convolutional Neural Network" is translated as "Convolutional Neural Network" (卷积神经网络)* "end-to-end" is translated as "端到端" (end-to-end)* "vertebral instance segmentation" is translated as "vertebral实例分割" (vertebral instance segmentation)* "CT scan" is translated as "CT扫描" (CT scan)* "spinal and vertebral level information" is translated as "脊梁和vertebral уров别信息" (spinal and vertebral level information)* "Graph Convolutional Network" is translated as "图connvolutional网络" (Graph Convolutional Network)* "vertebral labeling" is translated as "vertebral标注" (vertebral labeling)* "known structure of the spine" is translated as "脊梁的已知结构" (known structure of the spine)* "Dice Similarity Coefficient" is translated as " dice相似度系数" (Dice Similarity Coefficient)* "public and hidden test sets" is translated as "公共和隐藏测试集" (public and hidden test sets)* "state-of-the-art performance" is translated as "现状最佳性能" (state-of-the-art performance)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.
Score-based generative models learn manifold-like structures with constrained mixing
for: score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold
methods: linear approximations and subspaces spanned by local feature vectors
results: the learned vector field mixes samples by a non-conservative field within the manifold, and the subspace spanned by the local features overlaps with an effective density function.Here is the text in Simplified Chinese:
for: SBMs 学习数据分布的低维折衔
methods: 利用线性近似和本地特征向量生成子空间
results: 学习得到的向量场在折衔中混合样本,并且在折衔中保持数据分布的折衔结构。Abstract
How do score-based generative models (SBMs) learn the data distribution supported on a low-dimensional manifold? We investigate the score model of a trained SBM through its linear approximations and subspaces spanned by local feature vectors. During diffusion as the noise decreases, the local dimensionality increases and becomes more varied between different sample sequences. Importantly, we find that the learned vector field mixes samples by a non-conservative field within the manifold, although it denoises with normal projections as if there is an energy function in off-manifold directions. At each noise level, the subspace spanned by the local features overlap with an effective density function. These observations suggest that SBMs can flexibly mix samples with the learned score field while carefully maintaining a manifold-like structure of the data distribution.
摘要
Harnessing Transformers: A Leap Forward in Lung Cancer Image Detection
methods: 本研究使用了多种方法,包括TL、变换器和卷积神经网络(CNN)模型,其中变换器在图像分析中表现最佳,具有97.41%的准确率 для肾癌检测和94.71%的准确率 для histopathological lung cancer。
results: 本研究结果显示,变Transformers在图像分析中表现最佳,其准确率为97.41% для肾癌检测和94.71% для histopathological lung cancer。Abstract
This paper discusses the role of Transfer Learning (TL) and transformers in cancer detection based on image analysis. With the enormous evolution of cancer patients, the identification of cancer cells in a patient's body has emerged as a trend in the field of Artificial Intelligence (AI). This process involves analyzing medical images, such as Computed Tomography (CT) scans and Magnetic Resonance Imaging (MRIs), to identify abnormal growths that may help in cancer detection. Many techniques and methods have been realized to improve the quality and performance of cancer classification and detection, such as TL, which allows the transfer of knowledge from one task to another with the same task or domain. TL englobes many methods, particularly those used in image analysis, such as transformers and Convolutional Neural Network (CNN) models trained on the ImageNet dataset. This paper analyzes and criticizes each method of TL based on image analysis and compares the results of each method, showing that transformers have achieved the best results with an accuracy of 97.41% for colon cancer detection and 94.71% for Histopathological Lung cancer. Future directions for cancer detection based on image analysis are also discussed.
摘要
TL encompasses a range of methods, including those used in image analysis, such as transformers and Convolutional Neural Network (CNN) models trained on the ImageNet dataset. This paper examines and critiques each TL method based on image analysis, comparing the results of each method. The paper finds that transformers have achieved the best results, with an accuracy of 97.41% for colon cancer detection and 94.71% for Histopathological Lung cancer. The paper also discusses future directions for cancer detection based on image analysis.Translated into Simplified Chinese:这篇论文探讨了转移学习(TL)和转换器在生物图像分析中的肿瘤检测。随着癌症患者的增加,在病理图像分析中确定患者体内癌细胞的存在已成为人工智能领域的趋势。这个过程涉及分析医疗图像,如计算Tomography(CT)扫描和磁共振成像(MRI),以确定癌细胞的存在。多种技术和方法已被实现以提高癌症分类和检测的质量和性能,包括TL,它允许知识从一个任务中传输到另一个任务或领域中。TL包括许多方法,其中包括在图像分析中使用的转换器和Convolutional Neural Network(CNN)模型在ImageNet数据集上被训练。这篇论文对每种TL方法进行了分析和评价,并比较了每种方法的结果。论文发现,转换器已经实现了最好的结果,具体来说是97.41%的肿瘤检测精度 для肠癌和94.71%的肺癌检测精度。论文还讨论了基于图像分析的未来癌症检测的方向。Translated into Traditional Chinese:这篇论文探讨了将学习(TL)和转换器在生物图像分析中的肿瘤检测。随着癌症患者的增加,在病理图像分析中确定患者体内癌细胞的存在已成为人工智能领域的趋势。这个过程涉及分析医疗图像,如计算Tomography(CT)扫描和磁共振成像(MRI),以确定癌细胞的存在。多种技术和方法已被实现以提高癌症分类和检测的质量和性能,包括TL,它允许知识从一个任务中传输到另一个任务或领域中。TL包括许多方法,其中包括在图像分析中使用的转换器和Convolutional Neural Network(CNN)模型在ImageNet数据集上被训练。这篇论文对每种TL方法进行了分析和评价,并比较了每种方法的结果。论文发现,转换器已经实现了最好的结果,具体来说是97.41%的肿瘤检测精度 для肝癌和94.71%的肺癌检测精度。论文还讨论了基于图像分析的未来癌症检测的方向。
RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection
results: 研究表明,使用RED-DOT模型可以在VERITEbenchmark上实现30%的提高,并在NewsCLIPings+上达到竞争性和改进的性能,无需大量的证据或多个Encoder。此外,研究还进行了质量分析,表明”导向注意力”模块可以提高模型的解释性。Abstract
Online misinformation is often multimodal in nature, i.e., it is caused by misleading associations between texts and accompanying images. To support the fact-checking process, researchers have been recently developing automatic multimodal methods that gather and analyze external information, evidence, related to the image-text pairs under examination. However, prior works assumed all collected evidence to be relevant. In this study, we introduce a "Relevant Evidence Detection" (RED) module to discern whether each piece of evidence is relevant, to support or refute the claim. Specifically, we develop the "Relevant Evidence Detection Directed Transformer" (RED-DOT) and explore multiple architectural variants (e.g., single or dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and comparative experiments demonstrate that RED-DOT achieves significant improvements over the state-of-the-art on the VERITE benchmark by up to 28.5%. Furthermore, our evidence re-ranking and element-wise modality fusion led to RED-DOT achieving competitive and even improved performance on NewsCLIPings+, without the need for numerous evidence or multiple backbone encoders. Finally, our qualitative analysis demonstrates that the proposed "guided attention" module has the potential to enhance the architecture's interpretability. We release our code at: https://github.com/stevejpapad/relevant-evidence-detection
摘要
在线资讯承害 frequently 是多 modal 的,即由诽导的文字和附加的图像所致。为支持事实核查过程,研究人员最近已经开发出自动多模式方法,将外部信息、证据聚合和分析。但是,先前的工作假设所有收集到的证据都是有用的。在这一 studyt,我们引入一个“有用证据检测”(RED)模组,以决定每个证据是否有用,以支持或驳回主张。我们开发了“导向注意力”(RED-DOT)模型,并考虑多种架构和机制(例如,单Stage 或 dual-stage,以及“导向注意力”)。我们实施了广泛的删除和比较实验,证明了 RED-DOT 在 VERITE 标准 benchmark 上可以实现最多 28.5% 的提升。此外,我们的证据重新排序和元素综合模式融合实现了 RED-DOT 在 NewsCLIPings+ 上的竞争性和改进性,无需丰富的证据或多个背部构成器。最后,我们的Qualitative分析显示出“导向注意力”模组具有提高架构解释性的潜力。我们在 GitHub 上发布了代码:https://github.com/stevejpapad/relevant-evidence-detection。
Selection of Distinct Morphologies to Divide & Conquer Gigapixel Pathology Images
results: 研究表明,SDM方法在多个公共和私人生化病理学数据集上具有remarkable的效果,并且不需要参数设定,因为它自动优化选择过程以捕捉WSIs中的形态特征。Abstract
Whole slide images (WSIs) are massive digital pathology files illustrating intricate tissue structures. Selecting a small, representative subset of patches from each WSI is essential yet challenging. Therefore, following the "Divide & Conquer" approach becomes essential to facilitate WSI analysis including the classification and the WSI matching in computational pathology. To this end, we propose a novel method termed "Selection of Distinct Morphologies" (SDM) to choose a subset of WSI patches. The aim is to encompass all inherent morphological variations within a given WSI while simultaneously minimizing the number of selected patches to represent these variations, ensuring a compact yet comprehensive set of patches. This systematically curated patch set forms what we term a "montage". We assess the representativeness of the SDM montage across various public and private histopathology datasets. This is conducted by using the leave-one-out WSI search and matching evaluation method, comparing it with the state-of-the-art Yottixel's mosaic. SDM demonstrates remarkable efficacy across all datasets during its evaluation. Furthermore, SDM eliminates the necessity for empirical parameterization, a crucial aspect of Yottixel's mosaic, by inherently optimizing the selection process to capture the distinct morphological features within the WSI.
摘要
整个扫描图像(WSIs)是大量数字病理学文件,展示了复杂的组织结构。选择WSIs中每个patch的小样本 subset是必要的,但是具有挑战性。为了促进WSIs的分析,包括类别和WSIs匹配,我们提出了一种新方法称为“选择独特形态”(SDM)。该方法的目标是在给定WSIs中涵盖所有自然形态的变化,同时最小化选择的patch数量,以确保一个紧凑且全面的patch集。这个系统化批处形成了我们称之为“montage”。我们在不同的公共和私人 histopathology 数据集上评估SDM montage的表现。我们使用离开一个 WSI 搜索和匹配评估方法,与现有的 Yottixel 的落幕进行比较。SDM在所有数据集上都表现出了很好的效果。此外,SDM 摒弃了 Yottixel 落幕中的参数化,因为它自动优化选择过程,以捕捉 WSIs 中的独特形态特征。
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization
paper_authors: Yunshan Zhong, Jiawei Hu, Mingbao Lin, Mengzhao Chen, Rongrong Ji for:* 这个 paper 是为了解决 transformer 模型在实际应用中的 Computational Cost 问题,特别是在训练和测试过程中的 dense 计算成本问题。methods:* 这个 paper 使用 post-training quantization (PTQ) 方法,将 transformer 模型训练后的 weights 量化为 low-bit 格式,以提高 computational efficiency。* 这个 paper introduce 一个 novel 的 I&S-ViT 方法,它可以在 PTQ 过程中稳定地调整 transformer 模型的 weights,以提高模型的性能。results:* 这个 paper 的结果显示,I&S-ViT 方法可以在 diverse vision tasks 中提高 transformer 模型的性能,特别是在 low-bit enario 下。* 例如,I&S-ViT 方法可以提高 3-bit ViT-B 模型的性能 by 50.68%。Abstract
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications. Post-training quantization (PTQ), tuning ViTs with a tiny dataset and running in a low-bit format, well addresses the cost issue but unluckily bears more performance drops in lower-bit cases. In this paper, we introduce I&S-ViT, a novel method that regulates the PTQ of ViTs in an inclusive and stable fashion. I&S-ViT first identifies two issues in the PTQ of ViTs: (1) Quantization inefficiency in the prevalent log2 quantizer for post-Softmax activations; (2) Rugged and magnified loss landscape in coarse-grained quantization granularity for post-LayerNorm activations. Then, I&S-ViT addresses these issues by introducing: (1) A novel shift-uniform-log2 quantizer (SULQ) that incorporates a shift mechanism followed by uniform quantization to achieve both an inclusive domain representation and accurate distribution approximation; (2) A three-stage smooth optimization strategy (SOS) that amalgamates the strengths of channel-wise and layer-wise quantization to enable stable learning. Comprehensive evaluations across diverse vision tasks validate I&S-ViT' superiority over existing PTQ of ViTs methods, particularly in low-bit scenarios. For instance, I&S-ViT elevates the performance of 3-bit ViT-B by an impressive 50.68%.
摘要
尽管Scalable Performance of Vision Transformers(ViTs),但是密集的计算成本(训练和推理)却限制其在产业应用中的位置。Post-training Quantization(PTQ),通过使用tiny dataset和低位数据类型进行调教,有效地解决了成本问题,但是在低位数据类型下会导致性能下降。在这篇论文中,我们引入I&S-ViT,一种新的方法,可以在包容和稳定的情况下进行PTQ的规范。I&S-ViT首先发现了ViTs中PTQ中的两个问题:(1)频繁使用的log2 quantizer在Post-Softmax活动中的量化不准确;(2)LayerNorm活动中的粗糙和增大的损失图像。然后,I&S-ViT通过引入以下两种方法来解决这些问题:(1)一种新的Shift-Uniform-Log2 quantizer(SULQ),通过添加shift机制并使用均匀量化来实现包容的领域表示和准确的分布近似;(2)一种三个阶段的smooth optimization strategy(SOS),将通道级和层级量化融合在一起,以实现稳定的学习。通过对多种视觉任务进行广泛的评估,我们证明I&S-ViT在PTQ方法中的超越性,特别是在低位数据类型下。例如,I&S-ViT使3位ViT-B的性能提高了50.68%。
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework
results: 这份论文显示了UnifiedVisionGPT的架构和能力,并证明了它在CV领域中的应用可以提高效率、多样性、普遍性和性能。Abstract
In the current landscape of artificial intelligence, foundation models serve as the bedrock for advancements in both language and vision domains. OpenAI GPT-4 has emerged as the pinnacle in large language models (LLMs), while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models such as Meta's SAM and DINO, and YOLOS. However, the financial and computational burdens of training new models from scratch remain a significant barrier to progress. In response to this challenge, we introduce UnifiedVisionGPT, a novel framework designed to consolidate and automate the integration of SOTA vision models, thereby facilitating the development of vision-oriented AI. UnifiedVisionGPT distinguishes itself through four key features: (1) provides a versatile multimodal framework adaptable to a wide range of applications, building upon the strengths of multimodal foundation models; (2) seamlessly integrates various SOTA vision models to create a comprehensive multimodal platform, capitalizing on the best components of each model; (3) prioritizes vision-oriented AI, ensuring a more rapid progression in the CV domain compared to the current trajectory of LLMs; and (4) introduces automation in the selection of SOTA vision models, generating optimal results based on diverse multimodal inputs such as text prompts and images. This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, generalization, and performance. Our implementation, along with the unified multimodal framework and comprehensive dataset, is made publicly available at https://github.com/LHBuilder/SA-Segment-Anything.
摘要
当前人工智能领域中,基础模型作为进步的基础,在语言和视觉领域得到了广泛应用。OpenAI GPT-4在大语言模型(LLM)中脱颖而出,而计算机视觉(CV)领域则拥有丰富的状态对照(SOTA)模型,如Meta的SAM和DINO,以及YOLOS。然而,培育新模型的财务和计算成本仍然是进步的主要障碍。为应对这个挑战,我们介绍了一种新的框架——统一视觉GPT,该框架旨在集成和自动化STATE OF THE ART(SOTA)视觉模型,以便开发视觉启发的人工智能。统一视觉GPT的四个关键特点是:(1)提供多样化的多Modal Framework,可适应各种应用场景,基于多Modal基础模型的优势;(2)将多种SOTA视觉模型集成在一起,创造出全面的多Modal平台,利用每个模型的优点;(3)强调视觉启发,以更快速地进步在计算机视觉领域,相比现有的语言模型的趋势;以及(4)通过自动化SOTA视觉模型的选择,根据多Modal输入 such as文本提示和图像,生成优化的结果。本文描述了统一视觉GPT的架构和能力,展示其在计算机视觉领域的潜在革命性。我们的实现,以及统一多Modal框架和完整的数据集,在https://github.com/LHBuilder/SA-Segment-Anything上公开提供。
Rusty Detection Using Image Processing For Maintenance Of Stations
results: 该方法可以准确地分割涂抹的锈批镀表面上的锈区域,提供一种有价值的锈检测和分析方法。Abstract
This study addresses the challenge of accurately seg-menting rusted areas on painted construction surfaces. A method leveraging digital image processing is explored to calculate the percentage of rust present on painted coatings. The proposed segmentation approach is based on the HSV color model. To equalize luminosity and mitigate the influence of illumination, a fundamental model of single-scale Retinex is applied specifically to the saturation component. Subsequently, the image undergoes further processing, involv-ing manual color filtering. This step is crucial for refining the identification of rusted regions. To enhance precision and filter out noise, the pixel areas selected through color filtering are subjected to the DBScan algorithm. This multi-step process aims to achieve a robust segmentation of rusted areas on painted construction surfaces, providing a valuable contribution to the field of corrosion detection and analysis.
摘要
First, a fundamental model of single-scale Retinex is applied to the saturation component of the image to equalize luminosity and mitigate the influence of illumination. Next, the image undergoes manual color filtering to refine the identification of rusted regions. Finally, the pixel areas selected through color filtering are subjected to the DBScan algorithm to enhance precision and filter out noise.The proposed segmentation approach is designed to provide a robust and accurate method for detecting rusted areas on painted construction surfaces, which is a valuable contribution to the field of corrosion detection and analysis.
Overcoming Data Scarcity in Biomedical Imaging with a Foundational Multi-Task Model
paper_authors: Raphael Schäfer, Till Nicke, Henning Höfener, Annkristin Lange, Dorit Merhof, Friedrich Feuerhake, Volkmar Schulz, Johannes Lotz, Fabian Kiessling
For: 这个论文的目的是提出一种基于多任务学习的基本模型训练策略,以便在生物医学成像领域中使用。* Methods: 这个论文使用了一种多任务学习策略,其中包括将多种类别和分类任务组织在一起,以便减少训练数据的内存需求。* Results: 研究发现,使用这种多任务学习策略可以让基本模型在不同的任务上保持高度的性能,并且只需要1%的原始训练数据和不需要细化。此外,这种方法还可以在不同的中心进行跨中心传输。Abstract
Foundational models, pretrained on a large scale, have demonstrated substantial success across non-medical domains. However, training these models typically requires large, comprehensive datasets, which contrasts with the smaller and more heterogeneous datasets common in biomedical imaging. Here, we propose a multi-task learning strategy that decouples the number of training tasks from memory requirements. We trained a Universal bioMedical PreTrained model (UMedPT) on a multi-task database including tomographic, microscopic, and X-ray images, with various labelling strategies such as classification, segmentation, and object detection. The UMedPT foundational model outperformed ImageNet pretraining and the previous state-of-the-art models. For tasks related to the pretraining database, it maintained its performance with only 1% of the original training data and without fine-tuning. For out-of-domain tasks it required not more than 50% of the original training data. In an external independent validation imaging features extracted using UMedPT proved to be a new standard for cross-center transferability.
摘要
基础模型,在大规模预训练下,在非医学领域实现了重要成功。然而,这些模型的训练通常需要大量、全面的数据集,而生物医学成像中的数据集通常较小、更加多样化。在这里,我们提出了一种多任务学习策略,即解耦训练任务数量和内存需求。我们使用了一个通用的生物医学预训练模型(UMedPT),在包括tomographic、微scopic和X射线图像的多任务数据库上进行训练,并采用了不同的标签策略,如分类、分割和对象检测。UMedPT基础模型在预训练数据库中相对于ImageNet预训练和前一个状态的模型表现出色。对于与预训练数据库相关的任务,它只需要1%的原始训练数据,而不需要微调。对于外部独立验证的任务,它只需要50%的原始训练数据。在外部独立验证中,使用UMedPT提取出的生物医学特征被认为是新的跨中心传送标准。
GroupMixer: Patch-based Group Convolutional Neural Network for Breast Cancer Detection from Histopathological Images
methods: 使用 Deep Neural Networks 直接从 raw histopathological images 学习 informative features,而不需要 manual feature extraction。
results: 使用 CNN 架构和 Patch Embedding 操作,实现了高度精准的悬肢癌检测,并且对比其他方法具有更多的 trainable parameters 和更大的数据集来进行训练。尽管使用 Transformer 架构在医学图像分析中显示了惊人的表现,但是这些架构具有较多的可训练参数和需要大量的数据来进行训练。Abstract
Diagnosis of breast cancer malignancy at the early stages is a crucial step for controlling its side effects. Histopathological analysis provides a unique opportunity for malignant breast cancer detection. However, such a task would be tedious and time-consuming for the histopathologists. Deep Neural Networks enable us to learn informative features directly from raw histopathological images without manual feature extraction. Although Convolutional Neural Networks (CNNs) have been the dominant architectures in the computer vision realm, Transformer-based architectures have shown promising results in different computer vision tasks. Although harnessing the capability of Transformer-based architectures for medical image analysis seems interesting, these architectures are large, have a significant number of trainable parameters, and require large datasets to be trained on, which are usually rare in the medical domain. It has been claimed and empirically proved that at least part of the superior performance of Transformer-based architectures in Computer Vision domain originates from patch embedding operation. In this paper, we borrowed the previously introduced idea of integrating a fully Convolutional Neural Network architecture with Patch Embedding operation and presented an efficient CNN architecture for breast cancer malignancy detection from histopathological images. Despite the number of parameters that is significantly smaller than other methods, the accuracy performance metrics achieved 97.65%, 98.92%, 99.21%, and 98.01% for 40x, 100x, 200x, and 400x magnifications respectively. We took a step forward and modified the architecture using Group Convolution and Channel Shuffling ideas and reduced the number of trainable parameters even more with a negligible decline in performance and achieved 95.42%, 98.16%, 96.05%, and 97.92% accuracy for the mentioned magnifications respectively.
摘要
诊断乳腺癌恶性肿瘤的初期阶段是控制其副作用的关键步骤。 histopathological 分析提供了诊断恶性乳腺癌的唯一机会。然而,这种任务对 histopathologists 来说是繁琐和时间consuming的。深度神经网络允许我们直接从 Raw histopathological 图像中学习有用的特征,无需手动提取特征。尽管卷积神经网络(CNNs)在计算机视觉领域是主导的建筑,但基于 Transformer 的建筑在不同的计算机视觉任务中表现出色。虽然在医疗领域使用基于 Transformer 的建筑可能有趣,这些建筑却很大,有很多可训练的参数,并且需要大量的数据来进行训练,这些数据通常在医疗领域是罕见的。有人提出并证明了,至少一部分基于 Transformer 的表现提升的原因是负权重嵌入操作。在这篇论文中,我们借鉴了之前引入的将 Fully Convolutional Neural Network 架构与负权重嵌入操作结合的想法,并提出了一种高效的乳腺癌恶性识别方法。尽管参数的数量远少于其他方法,但我们在 40x、100x、200x 和 400x 倍镜下测试的性能指标分别达到 97.65%、98.92%、99.21% 和 98.01%。我们进一步修改了架构,使用 Group Convolution 和 Channel Shuffling 的想法,并减少了可训练参数的数量,但性能的下降是极少的,达到 95.42%、98.16%、96.05% 和 97.92%。
MAM-E: Mammographic synthetic image generation with diffusion models
results: 我们提出了一个名为MAM-E的生成模型架空,可以根据文本提示生成高质量的乳腺影像,并且可以填充人工变化的Synthetic lesions在specific region of the breast。 finally, we provide了量化和质感评估的生成影像,以及易用的グラフィカルユーザインターフェース для乳腺影像生成。Abstract
Generative models are used as an alternative data augmentation technique to alleviate the data scarcity problem faced in the medical imaging field. Diffusion models have gathered special attention due to their innovative generation approach, the high quality of the generated images and their relatively less complex training process compared with Generative Adversarial Networks. Still, the implementation of such models in the medical domain remains at early stages. In this work, we propose exploring the use of diffusion models for the generation of high quality full-field digital mammograms using state-of-the-art conditional diffusion pipelines. Additionally, we propose using stable diffusion models for the inpainting of synthetic lesions on healthy mammograms. We introduce MAM-E, a pipeline of generative models for high quality mammography synthesis controlled by a text prompt and capable of generating synthetic lesions on specific regions of the breast. Finally, we provide quantitative and qualitative assessment of the generated images and easy-to-use graphical user interfaces for mammography synthesis.
摘要
“生成模型被用作医疗影像领域数据增强技术的替代方法,以解决医疗影像领域面临的数据缺乏问题。扩散模型吸引了特别的关注,因为它们的创新生成方法、高质量生成图像和相对于对抗网络更为简单的训练过程。然而,医疗领域中的实施仍然处于早期阶段。在这种工作中,我们提议使用扩散模型来生成高质量全场数字乳肿图像,并使用稳定的扩散模型进行 synthetic 病变的填充。我们介绍了MAM-E,一个基于生成模型的高质量乳肿合成管道,可以通过文本提示来控制生成过程。此外,我们还提供了生成图像的量化和质量评估,以及易用的图形用户界面。”
results: 在 V-COCO 和 HICO-DET 上对常规和零学习情况下,实现了显著改善,与现有方法相比。Abstract
The interaction decoder utilized in prevalent Transformer-based HOI detectors typically accepts pre-composed human-object pairs as inputs. Though achieving remarkable performance, such paradigm lacks feasibility and cannot explore novel combinations over entities during decoding. We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities. Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the triplet and constitute novel interactions. Meanwhile, such reasoning process is guided by two crucial properties for understanding HOI: affordances (the potential actions an object can facilitate) and proxemics (the spatial relations between humans and objects). We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities. We evaluate L OGIC HOI on V-COCO and HICO-DET under both normal and zero-shot setups, achieving significant improvements over existing methods.
摘要
很多现有的Transformer基于HOI探测器通常使用预 compose的人物对碰输入decoder。虽然实现了惊人的性能,但这种方法缺乏实用性,不能探索Entities During Decoding中的新组合。我们介绍了 L OGIC HOI,一种新的HOI探测器,利用神经逻辑理解和Transformer来推理可能的人物对碰。具体来说,我们修改了vanilla Transformer中的自我注意机制,使其能够对 <人,行为,物> triplet进行推理,并且通过两个关键的HOI理解属性:可行性(物品可能支持的行为)和距离(人类和物品之间的空间关系)。我们将这两个属性写入了第一频谱逻辑,并将其降到连续空间中,以制约我们的方法学习过程,从而提高性能和零shot泛化能力。我们在V-COCO和HICO-DET上评估了L OGIC HOI,在正常和零shot设置下都达到了显著的改善。
MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture
paper_authors: Lincong Feng, Muyu Wang, Maoyu Wang, Kuo Xu, Xiaoli Liu for:* 这个研究旨在提高三维物体生成的效率和质量,并且解决多视角对称和实际对称的问题。methods:* 这个方法使用了两个阶段的优化方法,首先优化三维物体的几何表示,然后进行细微调整和纹理优化。results:* 这个方法可以实现高品质的三维物体生成,并且可以在20分钟内完成文本描述的三维生成。此外,这个方法还能够实现图像控制,提高了三维生成的可控性。Abstract
Generative models for 3D object synthesis have seen significant advancements with the incorporation of prior knowledge distilled from 2D diffusion models. Nevertheless, challenges persist in the form of multi-view geometric inconsistencies and slow generation speeds within the existing 3D synthesis frameworks. This can be attributed to two factors: firstly, the deficiency of abundant geometric a priori knowledge in optimization, and secondly, the entanglement issue between geometry and texture in conventional 3D generation methods.In response, we introduce MetaDreammer, a two-stage optimization approach that leverages rich 2D and 3D prior knowledge. In the first stage, our emphasis is on optimizing the geometric representation to ensure multi-view consistency and accuracy of 3D objects. In the second stage, we concentrate on fine-tuning the geometry and optimizing the texture, thereby achieving a more refined 3D object. Through leveraging 2D and 3D prior knowledge in two stages, respectively, we effectively mitigate the interdependence between geometry and texture. MetaDreamer establishes clear optimization objectives for each stage, resulting in significant time savings in the 3D generation process. Ultimately, MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, and to the best of our knowledge, it is the most efficient text-to-3D generation method. Furthermore, we introduce image control into the process, enhancing the controllability of 3D generation. Extensive empirical evidence confirms that our method is not only highly efficient but also achieves a quality level that is at the forefront of current state-of-the-art 3D generation techniques.
摘要
现代生成模型在三维物体生成方面已经做出了重要进步,通过吸取二维扩散模型中的知识。然而,现有的三维生成框架仍然面临着多视图几何不一致和慢速生成速度的挑战。这可以归结于两点:首先,严重缺乏丰富的几何知识在优化中,其次,在传统的三维生成方法中,几何和文本之间存在杂化问题。为了解决这些问题,我们介绍MetaDreamer,一种两stage优化方法,利用强大的二维和三维知识。在第一个阶段,我们强调优化几何表示,以确保多视图一致性和三维物体的准确性。在第二个阶段,我们专注于细化几何和优化文本,以实现更加细腻的三维物体。通过在两个阶段分别利用二维和三维知识,我们有效地消除了几何和文本之间的互相关系。MetaDreamer采用清晰的优化目标,从而大大降低三维生成过程中的时间成本。最终,MetaDreamer可以在20分钟内基于文本提示生成高质量的三维物体,并且,到目前为止,它是目前最高效的文本到三维生成方法。此外,我们还引入图像控制,使三维生成过程中的控制性得到了进一步提升。广泛的实验证明,我们的方法不仅高效,而且达到了当前领域的前沿水平。
EvaSurf: Efficient View-Aware Implicit Textured Surface Reconstruction on Mobile Devices
For: 高效率、视角受限的3D对象重建* Methods: 使用高效表面基本模型、多视图监测模块、含有 Gaussian 豆的隐式Texture 以及轻量级神经灯谱* Results: 能够在移动设备上实现高质量的外观和准确的网格重建,并且可以在1-2个小时内训练使用单个GPU,并在40帧/秒以上的帧率下运行。Abstract
Reconstructing real-world 3D objects has numerous applications in computer vision, such as virtual reality, video games, and animations. Ideally, 3D reconstruction methods should generate high-fidelity results with 3D consistency in real-time. Traditional methods match pixels between images using photo-consistency constraints or learned features, while differentiable rendering methods like Neural Radiance Fields (NeRF) use surface-based representations or differentiable volume rendering to generate high-fidelity scenes. However, these methods require excessive runtime for rendering, making them impractical for daily applications. To address these challenges, we present $\textbf{EvaSurf}$, an $\textbf{E}$fficient $\textbf{V}$iew-$\textbf{A}$ware Implicit Textured $\textbf{Surf}$ace Reconstruction method on Mobile Devices. In our method, we first employ an efficient surface-based model with a multi-view supervision module to ensure accurate mesh creation. To enable high-fidelity rendering, we learn an implicit texture embedded with a set of Gaussian lobes to capture view-dependent information. Furthermore, With the explicit geometry and the implicit texture, we can employ a lightweight neural shader to reduce the expense of computation and further support real-time rendering on common mobile devices. Extensive experiments demonstrate that our method can reconstruct high-quality appearance and accurate mesh on both synthetic and real-world datasets. Moreover, our method can be trained in just 1-2 hours using a single GPU and run on mobile devices at over 40FPS (Frames Per Second), with a final package required for rendering taking up only 40-50 MB.
摘要
<>重建现实世界中的3D对象有很多应用程序在计算机视觉中,如虚拟现实、游戏和动画。理想情况下,3D重建方法应该生成高品质的结果,并在实时中进行渲染。传统方法通过图像匹配 pixels 使用光度约束或学习特征来实现,而 differentiable rendering 方法如神经辐射场(NeRF)则使用表面基本表示或可导渲染来生成高品质场景。然而,这些方法在渲染时需要过分的时间,使得它们在日常应用中不够实用。为解决这些挑战,我们介绍 $\textbf{EvaSurf}$,一种高效的视觉相关的凹面 Textured 表面重建方法,运行在移动设备上。在我们的方法中,我们首先采用高效的表面基本模型,并在多视图超vision模块的支持下确保精度的网格创建。为了实现高品质渲染,我们学习了一个包含多个 Gaussian lobes 的隐式文本,以捕捉视觉相关信息。此外,通过Explicit geometry和隐式文本,我们可以使用轻量级的神经渲染器来减少计算成本,并进一步支持实时渲染在常见的移动设备上。广泛的实验表明,我们的方法可以在 Both synthetic and real-world datasets 上重建高质量的外观和准确的网格,并且可以在1-2小时内在单个 GPU 上训练,并在移动设备上运行于40帧/秒(Frame Per Second),总包装需要40-50 MB。Note: The text has been translated using Google Translate, and some parts may not be perfectly accurate or idiomatic.
results: 我们的初步结果很有希望,可以达到较高的准确率,同时只有较小的报告率下降,但是进一步的推广性研究仍然需要进行。Abstract
There is considerable industrial interest in integrating AI techniques into railway systems, notably for fully autonomous train systems. The KI-LOK research project is involved in developing new methods for certifying such AI-based systems. Here we explore the utility of a certified control architecture for a runtime monitor that prevents false positive detection of traffic signs in an AI-based perception system. The monitor uses classical computer vision algorithms to check if the signs -- detected by an AI object detection model -- fit predefined specifications. We provide such specifications for some critical signs and integrate a Python prototype of the monitor with a popular object detection model to measure relevant performance metrics on generated data. Our initial results are promising, achieving considerable precision gains with only minor recall reduction; however, further investigation into generalization possibilities will be necessary.
摘要
有很大的工业兴趣在将人工智能技术应用于铁路系统中,特别是实现无人驾驶列车系统。KI-LOK研究项目正在开发新的认证方法,以确保这些基于AI的系统的可靠性。我们在这篇文章中探讨一种具有认证控制架构的运行监控系统,以防止AI基于感知系统中的假阳性检测交通标志。这个监控系统使用传统的计算机视觉算法来检查探测到的标志是否符合预定义的规范。我们为一些关键的标志提供了特定的规范,并将Python原型 integrate with a popular object detection model to measure relevant performance metrics on generated data。我们的初步结果很有前途,可以实现较大的准确率提升,但是需要进一步的探索可泛化性。
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
paper_authors: Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, Li Yuan
for: 提高视觉语言理解的下游任务性能
methods: 将图像和视频编码到一个共享特征空间中,并将其作为大语言模型的输入
results: 创建了一个简单 yet robust LVLM 基线模型 Video-LLaVA,可以从混合图像和视频 dataset 中学习并提高多模态交互。Video-LLaVA 在 9 个图像benchmark 上表现出色,并在 5 个图像问答dataset 和 4 个图像benchmark toolkits 上超过 Video-ChatGPT。Abstract
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos.
摘要
大型视语模型(LVLM)已经提高了多个下游任务的性能,包括图像和视频理解。大多数现有的方法将图像和视频编码为分开的特征空间,然后将其作为大型语言模型的输入。然而,由于图像和视频的不一致编码,即 projection 层的不一致,使得大型语言模型(LLM)学习多modal交互变得困难。在这种情况下,我们将视觉表示 integrate 到语言特征空间中,以提高基础的 LLM towards 一体化 LVLM。因此,我们建立了简单 yet robust 的 LVM 基eline,称为 Video-LLaVA,它从混合的图像和视频数据集中学习,并且互相增强。 Video-LLaVA 在 9 个图像benchmark上 achieve 出色的表现,包括 5 个图像问答数据集和 4 个图像benchmark工具kit。此外,我们的 Video-LLaVA 还比 Video-ChatGPT 在 MSRVTT、MSVD、TGIF 和 ActivityNet 上表现出色,提高了 5.8%、9.9%、18.6% 和 10.1%。值得注意的是,广泛的实验表明,Video-LLaVA 能够在一个统一的视觉表示中促进图像和视频之间的互助,并且在图像和视频特征空间中提高表现,比特定于图像或视频的模型更好。
paper_authors: Quan Quan, Fenghe Tang, Zikang Xu, Heqin Zhu, S. Kevin Zhou
For: This paper proposes a new method called Slide-SAM for 3D medical image segmentation, which extends the Segment Anything Model (SAM) to 3D medical images.* Methods: The proposed method uses a single slice prompt to segment the entire volume, reducing the prompt workload for professionals. It also uses high resolution (H$ \times $W = 1024$ \times $1024) for training in 3D images to achieve optimal learning for small targets.* Results: The proposed method was evaluated on multiple datasets and achieved the most advanced 3D segmentation performance while maintaining the minimum prompt. The code will be open source soon.Here’s the summary in Simplified Chinese:
results: 该方法在多个数据集上进行了评估,并达到了最高的3D分割性能,同时保持最小的提示。代码即将公开源代码。Abstract
Segment Anything Model (SAM) achieves remarkable results in 2D image segmentation of natural images. However, the huge gap between medical images and natural images prevents it directly applied to medical image segmentation tasks. Especially in 3D medical image, SAM cannot learn the contextual relationship between slices, which limites application in real scenarios. In addition, recent research shows that applying 2D SAM to 3D images requires prompting the entire volume, which is time and label comsuming. In order to solve the above problems, we introduced Slide-SAM which extended SAM to 3D medical images. Specifically, you only need to use a single slice prompt to segement the entire volume, which greatly reduces the prompt workload for professionals. Secondly, unlike traditional 3D medical image segmentation, we are free from the influence of computing resources and can still use high resolution (H$ \times $W = 1024$ \times $1024) for training in 3D images to achieve optimal learning for small targets. This is to combine the entire 3D volume is beyond the reach of training. Finally, we collected a large number of 3D images from large-scale 3D public and private datasets, and extended SAM to 3D medical image segmentation involving bounding box and point prompts. Finally, we perform a comprehensive evaluation and analysis investigating the performance of Slide-SAM in medical image segmentation of different modalities, anatomy, and organs. We have verified Slide-SAM's segmentation capabilities on multiple datasets, achieving the most advanced 3D segmentation performance while maintaining the minimum prompt. Code will be open source soon.
摘要
Segment Anything Model (SAM) 在自然图像2D分割任务上取得了惊人的结果。然而,医疗图像与自然图像之间的巨大差距使得SAM直接应用于医疗图像分割任务是不可能的。尤其是在3D医疗图像中,SAM无法学习层次关系 между slice,这限制了其在实际场景中的应用。此外, latest research 表明,将2D SAM应用于3D图像需要整个Volume提示,这是时间和标签耗费的。为解决以上问题,我们提出了Slide-SAM,它将SAM扩展到3D医疗图像。具体来说,只需使用单个slice提示可以分割整个Volume,这大幅减少了专业人员的提示工作负担。其次,与传统3D医疗图像分割不同,我们不受计算资源的限制,可以在3D图像的训练中使用高分辨率(H $\times $ W = 1024 $\times $ 1024),以达到最佳学习效果。最后,我们收集了大量3D图像从大规模的3D公共和私人数据集,并将SAM扩展到3D医疗图像分割,包括 bounding box 和点提示。我们进行了全面的评估和分析,investigating 3D segmentation的不同Modalities、Anatomy 和器官的性能。我们已经证明Slide-SAM在不同Modalities、Anatomy 和器官上的 segmentation 性能是最先进的,同时保持最少的提示。代码即将公开源代码。
Utilizing dataset affinity prediction in object detection to assess training data
results: 研究表明,通过自动选择不同数据集中的样本,可以训练对象探测器使用较少的训练样本,而无需失去检测精度。Abstract
Data pooling offers various advantages, such as increasing the sample size, improving generalization, reducing sampling bias, and addressing data sparsity and quality, but it is not straightforward and may even be counterproductive. Assessing the effectiveness of pooling datasets in a principled manner is challenging due to the difficulty in estimating the overall information content of individual datasets. Towards this end, we propose incorporating a data source prediction module into standard object detection pipelines. The module runs with minimal overhead during inference time, providing additional information about the data source assigned to individual detections. We show the benefits of the so-called dataset affinity score by automatically selecting samples from a heterogeneous pool of vehicle datasets. The results show that object detectors can be trained on a significantly sparser set of training samples without losing detection accuracy.
摘要
数据聚合提供了各种优势,如增加样本大小、改善泛化、减少采样偏见和解决数据稀缺和质量问题,但不是直接的并可能是Counterproductive。评估聚合dataset的效果是有挑战的,因为难以估计个体dataset的总信息内容。为此,我们提议在标准对象检测管道中添加数据源预测模块。该模块在推理时间产生较少的开销,为个体检测提供额外信息关于分配给它的数据源。我们显示了所谓的数据源相互邻接分数的好处,自动从多种交通工具数据集中选择样本。结果表明,可以通过减少训练样本的数量来训练对象检测器,而不会影响检测精度。
Scene Text Image Super-resolution based on Text-conditional Diffusion Models
paper_authors: Chihiro Noguchi, Shun Fukuda, Masao Yamanaka for: 这 paper 的目的是提出一种基于文本条件扩散模型(DM)的Scene Text Image Super-resolution(STISR)方法,以提高Scene Text Recognition(STR)的性能。methods: 这 paper 使用了文本条件扩散模型(DM),其能够Synthesize高分辨率(HR)文本图像,从而提高STR的性能。results: experiments 表明,使用文本条件扩散模型(DM)可以 notable improve STISR 的性能,特别是当输入为低分辨率(LR)文本图像时。此外,该方法还可以生成高分辨率(HR)和低分辨率(LR)的对应图像对,为 STR 的训练提供了更好的数据支持。Abstract
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.
摘要
DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics
results: 在三种不同任务中(1)个性化几何折衔练习、(2)无条件图像生成、(3)图像超分辨率)中提高了 perceived 质量, measured by FID、MUSIQ score 和用户评价。Abstract
Diffusion models have advanced generative AI significantly in terms of editing and creating naturalistic images. However, efficiently improving generated image quality is still of paramount interest. In this context, we propose a generic "naturalness" preserving loss function, viz., kurtosis concentration (KC) loss, which can be readily applied to any standard diffusion model pipeline to elevate the image quality. Our motivation stems from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. To retain the "naturalness" of the generated images, we enforce reducing the gap between the highest and lowest kurtosis values across the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note that our approach does not require any additional guidance like classifier or classifier-free guidance to improve the image quality. We validate the proposed approach for three diverse tasks, viz., (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. Integrating the proposed KC loss has improved the perceptual quality across all these tasks in terms of both FID, MUSIQ score, and user evaluation.
摘要
Diffusion models have advanced generative AI significantly in terms of editing and creating naturalistic images. However, improving the quality of generated images efficiently is still a top priority. To address this, we propose a generic "naturalness" preserving loss function, called kurtosis concentration (KC) loss, which can be easily applied to any standard diffusion model pipeline to enhance image quality. Our motivation comes from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. To retain the "naturalness" of the generated images, we enforce reducing the gap between the highest and lowest kurtosis values across the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note that our approach does not require any additional guidance like classifier or classifier-free guidance to improve the image quality. We validate the proposed approach for three diverse tasks, namely (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. Integrating the proposed KC loss has improved the perceptual quality across all these tasks in terms of both FID, MUSIQ score, and user evaluation.
Gradient-Map-Guided Adaptive Domain Generalization for Cross Modality MRI Segmentation
results: 我们在两个多Modal MRI数据集上验证了我们的方法,包括六个跨Modal MRI标本分类任务。 在所有任务设置下,我们的方法一直超过竞争方法,并且在有限的训练数据下保持稳定的性能。Abstract
Cross-modal MRI segmentation is of great value for computer-aided medical diagnosis, enabling flexible data acquisition and model generalization. However, most existing methods have difficulty in handling local variations in domain shift and typically require a significant amount of data for training, which hinders their usage in practice. To address these problems, we propose a novel adaptive domain generalization framework, which integrates a learning-free cross-domain representation based on image gradient maps and a class prior-informed test-time adaptation strategy for mitigating local domain shift. We validate our approach on two multi-modal MRI datasets with six cross-modal segmentation tasks. Across all the task settings, our method consistently outperforms competing approaches and shows a stable performance even with limited training data.
摘要
跨模态MRI分割是医疗辅助诊断中的非常有价值的技术,它允许悬浮数据采集和模型通用化。然而,大多数现有方法在域Shift问题上困难处理本地变化,通常需要大量数据进行训练,这会限制它们在实践中的使用。为解决这些问题,我们提出了一种新的适应域通用化框架,该框架 integrate了一种无需学习的跨Domain表示基于图像梯度地图和一种基于类偏置的测试时适应策略,以mitigate本地域Shift问题。我们在两个多模态MRI数据集上进行了六个跨模态分割任务的验证。在所有任务设置下,我们的方法始终超越了竞争方法,并在有限的训练数据下保持稳定性。
MS-Former: Memory-Supported Transformer for Weakly Supervised Change Detection with Patch-Level Annotations
results: 实验结果表明,该方法在三个标准测试集上达到了优秀的变化检测效果。Abstract
Fully supervised change detection methods have achieved significant advancements in performance, yet they depend severely on acquiring costly pixel-level labels. Considering that the patch-level annotations also contain abundant information corresponding to both changed and unchanged objects in bi-temporal images, an intuitive solution is to segment the changes with patch-level annotations. How to capture the semantic variations associated with the changed and unchanged regions from the patch-level annotations to obtain promising change results is the critical challenge for the weakly supervised change detection task. In this paper, we propose a memory-supported transformer (MS-Former), a novel framework consisting of a bi-directional attention block (BAB) and a patch-level supervision scheme (PSS) tailored for weakly supervised change detection with patch-level annotations. More specifically, the BAM captures contexts associated with the changed and unchanged regions from the temporal difference features to construct informative prototypes stored in the memory bank. On the other hand, the BAM extracts useful information from the prototypes as supplementary contexts to enhance the temporal difference features, thereby better distinguishing changed and unchanged regions. After that, the PSS guides the network learning valuable knowledge from the patch-level annotations, thus further elevating the performance. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method in the change detection task. The demo code for our work will be publicly available at \url{https://github.com/guanyuezhen/MS-Former}.
摘要
具有全程监督改变方法在性能方面已经取得了显著进步,然而它们受到买到便宜的像素级标签的限制。考虑到bi-temporal图像中的块级注释也包含了改变和不改变对象之间的许多信息,因此一种直观的解决方案是将改变分割成块级注释。然而,如何从块级注释中捕捉改变和不改变区域之间的semantic变化,以获得出色的改变结果是critical挑战。在这篇论文中,我们提出了一种记忆支持的变换器(MS-Former),这是一种新的框架,包括一个bi-directional attention块(BAB)和一个块级监督方案(PSS),这些方案特地针对无监督改变检测任务。更加具体地说,BAB会从bi-temporal特征图中捕捉改变和不改变区域之间的上下文,并将其存储在内存银行中。然后,BAB会从内存银行中提取有用的信息,以增强bi-temporal特征图中的时间差特征,从而更好地 отличи改变和不改变区域。同时,PSS会引导网络学习从块级注释中得到有价值的知识,从而进一步提高性能。我们的实验结果表明,我们的提议方法在改变检测任务中具有出色的效果。我们将在 \url{https://github.com/guanyuezhen/MS-Former} 上公开发布我们的代码示例。
Now and Future of Artificial Intelligence-based Signet Ring Cell Diagnosis: A Survey
results: 本文发现了SRC分析领域中的一些问题和未解决的问题,并提出了未来研究的方向和趋势,以帮助研究人员更好地理解SRC的生物学特征和自动识别挑战,以及现有算法的表现和未来趋势。Abstract
Since signet ring cells (SRCs) are associated with high peripheral metastasis rate and dismal survival, they play an important role in determining surgical approaches and prognosis, while they are easily missed by even experienced pathologists. Although automatic diagnosis SRCs based on deep learning has received increasing attention to assist pathologists in improving the diagnostic efficiency and accuracy, the existing works have not been systematically overviewed, which hindered the evaluation of the gap between algorithms and clinical applications. In this paper, we provide a survey on SRC analysis driven by deep learning from 2008 to August 2023. Specifically, the biological characteristics of SRCs and the challenges of automatic identification are systemically summarized. Then, the representative algorithms are analyzed and compared via dividing them into classification, detection, and segmentation. Finally, for comprehensive consideration to the performance of existing methods and the requirements for clinical assistance, we discuss the open issues and future trends of SRC analysis. The retrospect research will help researchers in the related fields, particularly for who without medical science background not only to clearly find the outline of SRC analysis, but also gain the prospect of intelligent diagnosis, resulting in accelerating the practice and application of intelligent algorithms.
摘要
自 signet 环绕细胞(SRC)与高周边肿瘤率和减少生存率之间的关系,使得 SRC 在决定手术方法和诊断的重要作用。然而,经验 Pathologist 可能会扫描到这些细胞,尽管 automatic 诊断 SRC 基于深度学习已经收到了提高诊断效率和准确性的关注。在这篇文章中,我们提供了从 2008 年到 8 月 2023 年的 SRC 分析驱动深度学习的报告。specifically,我们系统地概述了 SRC 的生物特征和自动识别的挑战。然后,我们分析了代表性的算法,并将其分为分类、检测和分 segmentation 三个部分进行比较。最后,为了全面评估现有方法的性能和临床应用的需求,我们讨论了开放问题和未来趋势。这些研究将帮助相关领域的研究人员,特别是没有医学背景的研究人员,不仅能够清楚地了解 SRC 分析的大纲,而且能够获得智能诊断的前景,从而加速智能算法的实践和应用。
results: 研究发现,在CL的第一个阶段中使用的不supervised损失函数对于在第二个阶段的supervised损失函数的提升有着重要的作用。同时,研究还发现了一些关键的组成部分在不supervised损失函数中,可以帮助提高supervised损失函数的稳定性和性能。Abstract
Contrastive learning (CL) is a self-supervised training paradigm that allows us to extract meaningful features without any label information. A typical CL framework is divided into two phases, where it first tries to learn the features from unlabelled data, and then uses those features to train a linear classifier with the labeled data. While a fair amount of existing theoretical works have analyzed how the unsupervised loss in the first phase can support the supervised loss in the second phase, none has examined the connection between the unsupervised loss and the robust supervised loss, which can shed light on how to construct an effective unsupervised loss for the first phase of CL. To fill this gap, our work develops rigorous theories to dissect and identify which components in the unsupervised loss can help improve the robust supervised loss and conduct proper experiments to verify our findings.
摘要
《对比学习(Contrastive Learning,CL)是一种自动标注训练方法,它允许我们从无标签数据中提取有意义的特征。一个典型的CL框架包括两个阶段,第一阶段是从无标签数据中学习特征,第二阶段是使用这些特征来训练一个线性分类器与标签数据进行训练。虽然一些现有的理论研究已经分析了如何在第一阶段中学习的无监督损失如何支持第二阶段的监督损失,但是没有研究过无监督损失与鲁棒监督损失之间的连接,这可以推熔到如何构建有效的无监督损失。为了填补这个差距,我们的工作开发了准确的理论来分析无监督损失中的各个组件是否能够提高鲁棒监督损失,并进行了相应的实验验证。》Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you have any further questions or need any adjustments, please let me know.
Multi-View Spectrogram Transformer for Respiratory Sound Classification
results: 实验结果显示,该提出的MVST方法较前一项方法有更好的表现,在ICBHI数据集上分类呼吸声音的任务中。Abstract
Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.
摘要
深度神经网络已经应用于音频spectrogram中的呼吸音分类。现有模型 oftentimes treat spectrogram as synthetic image,忽略其物理特征。本文提出了 Multi-View Spectrogram Transformer (MVST),用于嵌入不同视图的时间频谱特征。具体来说,提案的MVST将mel-spectrogram分割成不同大小的patches,表示呼吸音的多视图听音元件。这些patches和位域嵌入被Feed into transformer encoder中,通过自我注意机制提取多视图特征之间的关注信息。最后,设计了一种权重权重调整方案,以自动将多视图特征相互权重,以便在特定场景下高亮最佳的一个视图特征。实验结果表明,提案的MVST在ICBHI数据集上明显超过了现有方法,用于分类呼吸音。
results: 在MPII数据集上实现新的状态艺术纪录,并证明了方法的可行性。Abstract
Over the past few years, the vision transformer and its various forms have gained significance in human pose estimation. By treating image patches as tokens, transformers can capture global relationships wisely, estimate the keypoint tokens by leveraging the visual tokens, and recognize the posture of the human body. Nevertheless, global attention is computationally demanding, which poses a challenge for scaling up transformer-based methods to high-resolution features. In this paper, we introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation. Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method, achieving new state-of-the-art results. The idea can also provide references for other transformer-based models.
摘要
在过去几年,视力变换器和其多种形式在人体姿态估计中具有重要意义。通过将图像块看作为符号,变换器可以聪明地捕捉全局关系,根据视觉符号来估计关键点符号,并识别人体姿态。然而,全球注意力计算具有挑战性,这会对基于变换器的方法的扩大到高分辨率特征造成挑战。在这篇论文中,我们引入了缺失在关键点符号注意力和视觉符号注意力中,以提高人体姿态估计的精度。实验结果表明,我们的模型在MPII数据集上达到了新的州OF-the-artResult,并证明了方法的可行性。这个想法还可以作为其他基于变换器的模型的参考。
Event-based Motion-Robust Accurate Shape Estimation for Mixed Reflectance Scenes
results: high accuracy (<500μm) and fast capture speed (14Hz or 250Hz) for mixed reflectance scenesAbstract
Event-based structured light systems have recently been introduced as an exciting alternative to conventional frame-based triangulation systems for the 3D measurements of diffuse surfaces. Important benefits include the fast capture speed and the high dynamic range provided by the event camera - albeit at the cost of lower data quality. So far, both low-accuracy event-based as well as high-accuracy frame-based 3D imaging systems are tailored to a specific surface type, such as diffuse or specular, and can not be used for a broader class of object surfaces ("mixed reflectance scenes"). In this paper, we present a novel event-based structured light system that enables fast 3D imaging of mixed reflectance scenes with high accuracy. On the captured events, we use epipolar constraints that intrinsically enable decomposing the measured reflections into diffuse, two-bounce specular, and other multi-bounce reflections. The diffuse objects in the scene are reconstructed using triangulation. Eventually, the reconstructed diffuse scene parts are used as a "display" to evaluate the specular scene parts via deflectometry. This novel procedure allows us to use the entire scene as a virtual screen, using only a scanning laser and an event camera. The resulting system achieves fast and motion-robust (14Hz) reconstructions of mixed reflectance scenes with < 500 $\mu$m accuracy. Moreover, we introduce a "superfast" capture mode (250Hz) for the 3D measurement of diffuse scenes.
摘要
现代事件驱动的探讨光系统已经被引入为diffuse表面三维测量的新型代替方案,具有快速捕捉速度和高动态范围。然而,这些系统的数据质量相对较低,而且仅适用于特定表面类型(diffuse或speculative),无法应用于更广泛的物体表面("混合反射场景")。在这篇论文中,我们提出了一种新的事件驱动探讨光系统,可以快速地三维测量混合反射场景,并且具有高准确性。在捕捉到的事件上,我们使用epipolar约束,可以自动分解测量的反射为diffuse、两次反射和其他多次反射。diffuse对象在场景中被重建,并用三角测量来重建。最后,重建的diffuse场景部分被用作"显示",以评估speculative场景部分via折射。这种新的程序让我们可以使用扫描 láser和事件相机来扫描整个场景,并且实现快速和运动稳定(14Hz)的重建混合反射场景,具有<500μm的准确性。此外,我们还介绍了"超快"捕捉模式(250Hz),用于三维测量diffuse场景。
Reconstructing Continuous Light Field From Single Coded Image
for: reconstruction of continuous light fields of a target scene from a single observed image
methods: joint aperture-exposure coding and neural radiance field (NeRF)
results: accurate and efficient reconstruction of continuous light fields without test time optimization, bridging the gap between camera design and neural rendering.Here’s the full text in Simplified Chinese:
for: 这个研究旨在从单个观察图像中重建目标场景的连续光场。
methods: 这个方法结合了共同的开口-曝光编码和神经辐射场(NeRF)来实现视觉合成。
results: 这个方法可以高效地和高质量地重建连续光场,不需要任何测试时间优化。这是我们知道的第一个将摄像头设计和神经渲染相结合的研究。Abstract
We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an observed image, but in previous works, it was used only for reconstructing discretized light-field views. NeRF-based neural rendering enables high quality view synthesis of a 3-D scene from continuous viewpoints, but when only a single image is given as the input, it struggles to achieve satisfactory quality. Our method integrates these two techniques into an efficient and end-to-end trainable pipeline. Trained on a wide variety of scenes, our method can reconstruct continuous light fields accurately and efficiently without any test time optimization. To our knowledge, this is the first work to bridge two worlds: camera design for efficiently acquiring 3-D information and neural rendering.
摘要
我们提出了一种方法,可以从单个观察到的图像中重建连续的光场场景。我们的方法结合了两种世界的优点:joint aperature-exposure coding for compressive light-field acquisition,和基于神经辐射场(NeRF)的视觉合成。 joint aperature-exposure coding在摄像头中实现了有效地嵌入3D场景信息到观察到的图像中,但在前一些工作中,它只用于重建精确的光场观察角度。基于NeRF的神经渲染可以高质量地合成3D场景的视角,但当只有单个图像作为输入时,它很难达到满意的质量。我们的方法将这两种技术集成成一个高效、端到端训练可以的管道。我们在各种场景下训练了这种方法,可以高效地和高质量地重建连续的光场场景,无需任何测试时间优化。到我们所知,这是第一次将摄像头设计用于高效地获取3D信息和神经渲染相结合。
Weakly Supervised Anomaly Detection for Chest X-Ray Image
results: experiments表明,本方法在两个肺X光图像 Dataset上显示了效果。Abstract
Chest X-Ray (CXR) examination is a common method for assessing thoracic diseases in clinical applications. While recent advances in deep learning have enhanced the significance of visual analysis for CXR anomaly detection, current methods often miss key cues in anomaly images crucial for identifying disease regions, as they predominantly rely on unsupervised training with normal images. This letter focuses on a more practical setup in which few-shot anomaly images with only image-level labels are available during training. For this purpose, we propose WSCXR, a weakly supervised anomaly detection framework for CXR. WSCXR firstly constructs sets of normal and anomaly image features respectively. It then refines the anomaly image features by eliminating normal region features through anomaly feature mining, thus fully leveraging the scarce yet crucial features of diseased areas. Additionally, WSCXR employs a linear mixing strategy to augment the anomaly features, facilitating the training of anomaly detector with few-shot anomaly images. Experiments on two CXR datasets demonstrate the effectiveness of our approach.
摘要
骨肋X射影(CXR)检测是诊断 thoracic 疾病的常用方法。Recent advances in deep learning 使得视觉分析在 CXR 畸形检测中具有更大的重要性,但现有方法通常会遗漏疾病区域中的关键提示,因为它们主要依靠 normal 图像进行无监督训练。这封信件关注一种更实用的设置,在训练过程中仅有几张畸形图像和图像水平标签可用。为此,我们提出了 WSCXR,一种弱型监督畸形检测框架 для CXR。WSCXR 首先构建 normal 和畸形图像特征集,然后通过畸形特征挖掘,减少疾病区域特征,从而全面利用疾病区域中的珍贵特征。此外,WSCXR 采用了线性混合策略,以增强畸形特征的训练,使用几张畸形图像进行检测。实验表明,我们的方法有效地检测 CXR 畸形。
On the Quantification of Image Reconstruction Uncertainty without Training Data
paper_authors: Sirui Bi, Victor Fung, Jiaxin Zhang
for: This paper focuses on developing a deep variational framework for image reconstruction and uncertainty estimation in computational imaging.
methods: The proposed method leverages a deep generative model to learn an approximate posterior distribution for image reconstruction uncertainty, using a flow-based model and gradient boosting for robustness and expressiveness.
results: The method is validated on several benchmark tasks and two real-world applications, demonstrating reliable and high-quality image reconstruction with robust uncertainty estimation.Abstract
Computational imaging plays a pivotal role in determining hidden information from sparse measurements. A robust inverse solver is crucial to fully characterize the uncertainty induced by these measurements, as it allows for the estimation of the complete posterior of unrecoverable targets. This, in turn, facilitates a probabilistic interpretation of observational data for decision-making. In this study, we propose a deep variational framework that leverages a deep generative model to learn an approximate posterior distribution to effectively quantify image reconstruction uncertainty without the need for training data. We parameterize the target posterior using a flow-based model and minimize their Kullback-Leibler (KL) divergence to achieve accurate uncertainty estimation. To bolster stability, we introduce a robust flow-based model with bi-directional regularization and enhance expressivity through gradient boosting. Additionally, we incorporate a space-filling design to achieve substantial variance reduction on both latent prior space and target posterior space. We validate our method on several benchmark tasks and two real-world applications, namely fastMRI and black hole image reconstruction. Our results indicate that our method provides reliable and high-quality image reconstruction with robust uncertainty estimation.
摘要
计算成像在捕捉隐藏信息方面发挥关键作用,从稀缺测量中推断出隐藏信息的不确定性需要一个坚固的 inverse solver。这样可以全面描述测量过程中induced的uncertainty,并且使得观察数据的概率解释变得可能,从而帮助做出决策。在这个研究中,我们提出了一个深度变量框架,该框架利用深度生成模型来学习一个近似 posterior distribution,以便有效地量ify image reconstruction uncertainty,无需训练数据。我们使用流基本模型来参数化目标 posterior,并通过最小化其Kullback-Leibler(KL)偏度来实现准确的 uncertainty estimation。为了增强稳定性,我们引入了bi-directional regularization和扩展表达能力通过梯度批处理。此外,我们采用了填充设计,以实现在latent prior空间和目标 posterior空间上的重要variance reduction。我们在多个benchmark任务和两个实际应用中,即fastMRI和黑洞图像重建中 validate our方法,结果表明我们的方法可以提供可靠和高质量的图像重建,同时也可以提供准确的 uncertainty estimation。
DECDM: Document Enhancement using Cycle-Consistent Diffusion Models
results: 与状态艺术方法相比,DECDM在多种 sintetic数据和benchmark datasets上表现出色,可以 Quantitatively和Qualitatively提高文档图像质量。Abstract
The performance of optical character recognition (OCR) heavily relies on document image quality, which is crucial for automatic document processing and document intelligence. However, most existing document enhancement methods require supervised data pairs, which raises concerns about data separation and privacy protection, and makes it challenging to adapt these methods to new domain pairs. To address these issues, we propose DECDM, an end-to-end document-level image translation method inspired by recent advances in diffusion models. Our method overcomes the limitations of paired training by independently training the source (noisy input) and target (clean output) models, making it possible to apply domain-specific diffusion models to other pairs. DECDM trains on one dataset at a time, eliminating the need to scan both datasets concurrently, and effectively preserving data privacy from the source or target domain. We also introduce simple data augmentation strategies to improve character-glyph conservation during translation. We compare DECDM with state-of-the-art methods on multiple synthetic data and benchmark datasets, such as document denoising and {\color{black}shadow} removal, and demonstrate the superiority of performance quantitatively and qualitatively.
摘要
表现强大的光学字符识别(OCR)功能受到文档图像质量的限制,这对于自动文档处理和文档智能来说非常重要。然而,现有的文档增强方法通常需要监督数据对,这会导致数据分离和隐私保护的问题,使得这些方法难以适应新的域对。为解决这些问题,我们提出了DECDM,一种基于傅立叶分布模型的终端文档图像翻译方法。DECDM在无监督的情况下独立地训练源(噪音输入)和目标(清晰输出)模型,因此可以应用到其他对。DECDM在一个dataset上单独训练,不需要同时扫描两个dataset,从而有效地保护数据隐私。我们还介绍了一些简单的数据扩展策略,以保持字符形态的恒久性 durante la traducción。我们与状态之前的方法进行比较,并在多个合成数据和标准 benchmark datasets上进行评估,如文档噪音去除和阴影去除。我们的实验结果表明DECDM在量和质量上具有显著的优势。
Apoptosis classification using attention based spatio temporal graph convolution neural network
results: 该方法可以准确地分类细胞死亡,同时考虑空间和时间关系。Abstract
Accurate classification of apoptosis plays an important role in cell biology research. There are many state-of-the-art approaches which use deep CNNs to perform the apoptosis classification but these approaches do not account for the cell interaction. Our paper proposes the Attention Graph spatio-temporal graph convolutional network to classify the cell death based on the target cells in the video. This method considers the interaction of multiple target cells at each time stamp. We model the whole video sequence as a set of graphs and classify the target cell in the video as dead or alive. Our method encounters both spatial and temporal relationships.
摘要
精准的细胞死亡分类在细胞生物研究中扮演着重要的角色。目前有许多先进的方法使用深度卷积神经网络进行细胞死亡分类,但这些方法不考虑细胞之间的互动。我们的论文提出了注意力图像空间时间卷积神经网络,用于基于目标细胞的视频中细胞死亡分类。这种方法考虑了每个时间戳的多个目标细胞之间的互动关系。我们将整个视频序列视为一系列图像,并将目标细胞在视频中分类为死亡或活着。我们的方法考虑了空间和时间关系。
Wildfire Smoke Detection with Cross Contrast Patch Embedding
results: 对RealFire Test dataset进行了广泛测试和评估,与基线检测模型相比,本研究的方法具有显著的性能提升。Abstract
The Transformer-based deep networks have increasingly shown significant advantages over CNNs. Some existing work has applied it in the field of wildfire recognition or detection. However, we observed that the vanilla Transformer is not friendly for extracting smoke features. Because low-level information such as color, transparency and texture is very important for smoke recognition, and transformer pays more attention to the semantic relevance between middle- or high-level features, and is not sensitive to the subtle changes of low-level features along the space. To solve this problem, we propose the Cross Contrast Patch Embedding(CCPE) module based on the Swin Transformer, which uses the multi-scales spatial frequency contrast information in both vertical and horizontal directions to improve the discrimination of the network on the underlying details. The fuzzy boundary of smoke makes the positive and negative label assignment for instances in a dilemma, which is another challenge for wildfires detection. To solve this problem, a Separable Negative Sampling Mechanism(SNSM) is proposed. By using two different negative instance sampling strategies on positive images and negative images respectively, the problem of supervision signal confusion caused by label diversity in the process of network training is alleviated. This paper also releases the RealFire Test, the largest real wildfire test set so far, to evaluate the proposed method and promote future research. It contains 50,535 images from 3,649 video clips. The proposed method has been extensively tested and evaluated on RealFire Test dataset, and has a significant performance improvement compared with the baseline detection models.
摘要
《Transformer基于深度网络在野火识别方面的应用》 Introduction:现在的研究中,Transformer基于深度网络已经显示出了对于CNNs的明显优势。然而,我们发现了Transformer不适合提取烟雾特征。因为烟雾识别中低级信息如颜色、透明度和文本ure是非常重要的,而Transformer更关注中间或高级特征之间的 semantic relevance,并不敏感于空间方向中的细微变化。为解决这个问题,我们提出了基于Swin Transformer的 Cross Contrast Patch Embedding(CCPE)模块,利用多个比例的空间频率对比信息来提高网络对下面详细信息的推断。另外,野火检测中烟雾的模糊边界使得实例的正负标签分配困难,这也是一个挑战。为解决这个问题,我们提出了分解负采样机制(SNSM)。通过在正例图像和负例图像上采用不同的负采样策略,使得网络训练过程中的监督信号混乱问题得到了缓解。This paper also releases the RealFire Test, the largest real wildfire test set so far, to evaluate the proposed method and promote future research. It contains 50,535 images from 3,649 video clips. The proposed method has been extensively tested and evaluated on RealFire Test dataset, and has a significant performance improvement compared with the baseline detection models.
Multi-Task Learning Approach for Unified Biometric Estimation from Fetal Ultrasound Anomaly Scans
paper_authors: Mohammad Areeb Qazi, Mohammed Talha Alam, Ibrahim Almakky, Werner Gerhard Diehl, Leanne Bricker, Mohammad Yaqub
for: The paper is written for estimating fetal biometry parameters from ultrasound images, which is crucial for evaluating fetal growth, monitoring health, and identifying potential complications.
methods: The paper proposes a multi-task learning approach that combines classification and segmentation to estimate fetal biometrics. The approach uses a U-Net architecture with an added classification head, and leverages a weighted joint classification and segmentation loss function to train the model.
results: The paper achieves a mean absolute error (MAE) of 1.08 mm on head circumference, 1.44 mm on abdomen circumference, and 1.10 mm on femur length with a classification accuracy of 99.91% on a dataset of fetal ultrasound images.Here’s the information in Simplified Chinese text:
results: 本研究实现了head圈 circumference的平均绝对误差(MAE)为1.08 mm, Abdomen circumference的MAE为1.44 mm, femur length的MAE为1.10 mm,并达到了99.91%的分类精度在一个 dataset of fetal ultrasound images 中。Abstract
Precise estimation of fetal biometry parameters from ultrasound images is vital for evaluating fetal growth, monitoring health, and identifying potential complications reliably. However, the automated computerized segmentation of the fetal head, abdomen, and femur from ultrasound images, along with the subsequent measurement of fetal biometrics, remains challenging. In this work, we propose a multi-task learning approach to classify the region into head, abdomen and femur as well as estimate the associated parameters. We were able to achieve a mean absolute error (MAE) of 1.08 mm on head circumference, 1.44 mm on abdomen circumference and 1.10 mm on femur length with a classification accuracy of 99.91\% on a dataset of fetal Ultrasound images. To achieve this, we leverage a weighted joint classification and segmentation loss function to train a U-Net architecture with an added classification head. The code can be accessed through \href{https://github.com/BioMedIA-MBZUAI/Multi-Task-Learning-Approach-for-Unified-Biometric-Estimation-from-Fetal-Ultrasound-Anomaly-Scans.git}{\texttt{Github}
摘要
准确估算胎儿生长指标从ultrasound图像是诊断胎儿增长、监测健康和识别问题的关键。然而,通过计算器自动分割ultrasound图像中的胎儿头、腹部和股骨,以及其后的胎儿生长指标的测量,仍然是一项挑战。在这种工作中,我们提议一种多任务学习方法,通过分类区域为头、腹部和股骨,同时估算相关参数。我们在一个胎儿ultrasound图像集上实现了mean absolute error(MAE)为1.08毫米的头圈 circumference,1.44毫米的腹部 circumference和1.10毫米的股骨长度,同时实现了99.91%的分类精度。为达到这一点,我们利用了一种权重加权的联合分类和分割损失函数,用于训练一个U-Net架构,并添加了一个分类头。代码可以通过\href{https://github.com/BioMedIA-MBZUAI/Multi-Task-Learning-Approach-for-Unified-Biometric-Estimation-from-Fetal-Ultrasound-Anomaly-Scans.git}{\texttt{Github} [
Gradual Source Domain Expansion for Unsupervised Domain Adaptation
paper_authors: Thomas Westfechtel, Hao-Wei Yeh, Dexuan Zhang, Tatsuya Harada
for: overcome the need for a large labeled dataset in unsupervised domain adaptation
methods: gradual source domain expansion (GSDE) algorithm, training the UDA task several times from scratch with target data expansion
results: outperform state-of-the-art methods on three benchmarks (Office-31, OfficeHome, and DomainNet) and improve the accuracy of a variety of different state-of-the-art UDA approaches.Here’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>I hope that helps!Abstract
Unsupervised domain adaptation (UDA) tries to overcome the need for a large labeled dataset by transferring knowledge from a source dataset, with lots of labeled data, to a target dataset, that has no labeled data. Since there are no labels in the target domain, early misalignment might propagate into the later stages and lead to an error build-up. In order to overcome this problem, we propose a gradual source domain expansion (GSDE) algorithm. GSDE trains the UDA task several times from scratch, each time reinitializing the network weights, but each time expands the source dataset with target data. In particular, the highest-scoring target data of the previous run are employed as pseudo-source samples with their respective pseudo-label. Using this strategy, the pseudo-source samples induce knowledge extracted from the previous run directly from the start of the new training. This helps align the two domains better, especially in the early training epochs. In this study, we first introduce a strong baseline network and apply our GSDE strategy to it. We conduct experiments and ablation studies on three benchmarks (Office-31, OfficeHome, and DomainNet) and outperform state-of-the-art methods. We further show that the proposed GSDE strategy can improve the accuracy of a variety of different state-of-the-art UDA approaches.
摘要
Unsupervised domain adaptation (UDA) 尝试使用源数据集中具有很多标签数据的知识来推导目标数据集,该数据集没有标签。然而,在目标领域中的早期不一致可能会导致错误堆积。为解决这个问题,我们提出了慢步源领域扩展(GSDE)算法。GSDE 在 UDA 任务上进行多次从零开始训练,每次重新初始化网络权重,但每次扩展源数据集以包括目标数据。具体来说,上一轮最高分的目标数据被用作 Pseudo-source 样本,与其它 Pseudo-source 样本一起,直接从上一轮训练中提取了知识。这种策略可以更好地对两个领域进行对应,尤其是在训练的早期。在这个研究中,我们首先提出了一个强大的基线网络,然后应用我们的 GSDE 策略来改进其性能。我们在 Office-31、OfficeHome 和 DomainNet 三个标准测试集上进行了实验和剖析研究,并超越了当前最佳方法。此外,我们还证明了我们的 GSDE 策略可以提高多种不同的状态流行 UDA 方法的准确率。
MARformer: An Efficient Metal Artifact Reduction Transformer for Dental CBCT Images
results: 对 dental CBCT 图像进行了 Synthetic 和实际的 metal artefacts 测试,结果显示,我们的 MARformer 高效,并且超过了之前的 MAR 方法和两种 Restoration Transformers。Abstract
Cone Beam Computed Tomography (CBCT) plays a key role in dental diagnosis and surgery. However, the metal teeth implants could bring annoying metal artifacts during the CBCT imaging process, interfering diagnosis and downstream processing such as tooth segmentation. In this paper, we develop an efficient Transformer to perform metal artifacts reduction (MAR) from dental CBCT images. The proposed MAR Transformer (MARformer) reduces computation complexity in the multihead self-attention by a new Dimension-Reduced Self-Attention (DRSA) module, based on that the CBCT images have globally similar structure. A Patch-wise Perceptive Feed Forward Network (P2FFN) is also proposed to perceive local image information for fine-grained restoration. Experimental results on CBCT images with synthetic and real-world metal artifacts show that our MARformer is efficient and outperforms previous MAR methods and two restoration Transformers.
摘要
cone beam computed tomography (CBCT) 在 dental 诊断和手术中发挥关键作用,但是 metal зуб钻Implant 可能会在 CBCT 图像处理过程中引入干扰性的 metal artifacts,影响诊断和下游处理,如 Tooth 分 segmentation。在这篇论文中,我们开发了一种高效的 transformer 来实现 dental CBCT 图像中的 metal artifacts 减少 (MAR)。我们提出的 MAR transformer (MARformer)通过一种新的 Dimension-Reduced Self-Attention(DRSA)模块,基于 CBCT 图像的全球相似结构,来降低计算复杂性。此外,我们还提出了一种 Patch-wise Perceptive Feed Forward Network(P2FFN)来捕捉本地图像信息,进行细化修复。实验结果表明,我们的 MARformer 高效,并比前期 MAR 方法和两种修复 transformer 高效。
3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation
results: 本研究能够对不同类别的 shapes 进行本地 texture,并且可以控制 texture 的细节和全球理解。实验页面:https://threedle.github.io/3d-paintbrushAbstract
In this work we develop 3D Paintbrush, a technique for automatically texturing local semantic regions on meshes via text descriptions. Our method is designed to operate directly on meshes, producing texture maps which seamlessly integrate into standard graphics pipelines. We opt to simultaneously produce a localization map (to specify the edit region) and a texture map which conforms to it. This synergistic approach improves the quality of both the localization and the stylization. To enhance the details and resolution of the textured area, we leverage multiple stages of a cascaded diffusion model to supervise our local editing technique with generative priors learned from images at different resolutions. Our technique, referred to as Cascaded Score Distillation (CSD), simultaneously distills scores at multiple resolutions in a cascaded fashion, enabling control over both the granularity and global understanding of the supervision. We demonstrate the effectiveness of 3D Paintbrush to locally texture a variety of shapes within different semantic regions. Project page: https://threedle.github.io/3d-paintbrush
摘要
在这个工作中,我们开发了3D涂刷技术,它可以通过文本描述自动地给mesh中的本地semantic区域添加文本ure。我们的方法直接操作mesh,生成的текстура映射可以衔接到标准图形管道中。我们同时生成了localization map(用于specify edit region)和它对应的текстура映射。这种相互作用使得本地化和 стилизация均得到了改善。为了提高文本区域的细节和分辨率,我们利用了多个阶段的叠加扩散模型来监督我们的本地编辑技术。我们称之为Cascaded Score Distillation(CSD),它同时在叠加的多个阶段中进行分辨率控制和全局理解的监督。我们示出了3D涂刷技术的效果,可以在不同的semantic region中地方Texture各种形状。项目页面:https://threedle.github.io/3d-paintbrush
Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery
results: 比前一代方法更高精度的结果,在3DPW、MPI-INF-3DHP和Human3.6M等知名 benchmark 上都达到了更高的性能Abstract
Though significant progress in human pose and shape recovery from monocular RGB images has been made in recent years, obtaining 3D human motion with high accuracy and temporal consistency from videos remains challenging. Existing video-based methods tend to reconstruct human motion from global image features, which lack detailed representation capability and limit the reconstruction accuracy. In this paper, we propose a Temporal-Aware Refining Network (TAR), to synchronously explore temporal-aware global and local image features for accurate pose and shape recovery. First, a global transformer encoder is introduced to obtain temporal global features from static feature sequences. Second, a bidirectional ConvGRU network takes the sequence of high-resolution feature maps as input, and outputs temporal local feature maps that maintain high resolution and capture the local motion of the human body. Finally, a recurrent refinement module iteratively updates estimated SMPL parameters by leveraging both global and local temporal information to achieve accurate and smooth results. Extensive experiments demonstrate that our TAR obtains more accurate results than previous state-of-the-art methods on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
摘要
尽管在最近几年内,从单色RGB图像中提取人体姿态和形状的进步很大,但从视频中获取高精度和时间一致的人体运动仍然是一个挑战。现有的视频基于方法通常是从全图像特征中提取人体运动,这些特征缺乏细节表示能力,导致重建精度有限。在这篇论文中,我们提议一种名为时间感知修复网络(TAR),以同步探索时间感知的全局和局部图像特征,以达到高精度和平滑的人体姿态和形状重建。首先,我们引入全球变换Encoder,从静止特征序列中提取时间全局特征。其次,我们使用双向ConvGRU网络,将高分辨率特征图组作为输入,并输出时间局部特征图,以保持高分辨率和捕捉人体动作的局部运动。最后,我们引入循环更新模块,通过全球和局部时间信息来更新估计的SMPL参数,以实现高精度和平滑的结果。我们对 популяр的benchmark进行了广泛的实验,结果表明,我们的TAR方法比之前的状态 искусственный智能方法更高精度。
FedFusion: Manifold Driven Federated Learning for Multi-satellite and Multi-modality Fusion
paper_authors: DaiXun Li, Weiying Xie, Yunsong Li, Leyuan Fang for:多Modal remote sensing数据的合并是一项复杂的任务,因为它涉及到多种不同的感知特性和数据分布。methods:该文提出了一种基于拟合的多Modal数据合并框架,即FedFusion,它通过在每个客户端上随机选择地方数据来共同估计每个客户端的显著拟合结构,并将特征矩阵压缩到低维度空间中,作为后续分类器的输入。results:该文比较了现有方法和自己的方法在三个多Modal数据集上的性能,并达到了94.35%的平均分类精度,同时压缩通信成本四倍。此外,基于 Jetson TX2 工业模块的轨道边缘计算架构上进行了实际卫星图像的数字实验,显示FedFusion可以减少训练时间48.4分钟(15.18%),同时优化精度。Abstract
Multi-satellite, multi-modality in-orbit fusion is a challenging task as it explores the fusion representation of complex high-dimensional data under limited computational resources. Deep neural networks can reveal the underlying distribution of multi-modal remote sensing data, but the in-orbit fusion of multimodal data is more difficult because of the limitations of different sensor imaging characteristics, especially when the multimodal data follows non-independent identically distribution (Non-IID) distributions. To address this problem while maintaining classification performance, this paper proposes a manifold-driven multi-modality fusion framework, FedFusion, which randomly samples local data on each client to jointly estimate the prominent manifold structure of shallow features of each client and explicitly compresses the feature matrices into a low-rank subspace through cascading and additive approaches, which is used as the feature input of the subsequent classifier. Considering the physical space limitations of the satellite constellation, we developed a multimodal federated learning module designed specifically for manifold data in a deep latent space. This module achieves iterative updating of the sub-network parameters of each client through global weighted averaging, constructing a framework that can represent compact representations of each client. The proposed framework surpasses existing methods in terms of performance on three multimodal datasets, achieving a classification average accuracy of 94.35$\%$ while compressing communication costs by a factor of 4. Furthermore, extensive numerical evaluations of real-world satellite images were conducted on the orbiting edge computing architecture based on Jetson TX2 industrial modules, which demonstrated that FedFusion significantly reduced training time by 48.4 minutes (15.18%) while optimizing accuracy.}
摘要
多卫星、多Modalities在遥感空间融合是一个复杂的任务,因为它探索了复杂高维数据的融合表示,在限制的计算资源下进行。深度神经网络可以揭示多Modalities遥感数据的下面分布,但是多Modalities遥感数据的融合更加困难,因为不同的感器成像特点存在限制,特别是当多Modalities数据遵循非独立同分布(Non-IID)分布时。为了解决这个问题而保持分类性能,这篇论文提出了一个概率驱动的多Modalities融合框架,即FedFusion,它在每个客户端上随机选择本地数据,并且同时将每个客户端的浅层特征矩阵压缩到低维度空间中,并通过权重平均来更新每个客户端的子网络参数。针对卫星团 constellation 的物理空间限制,我们开发了特有的多Modalities联合学习模块,用于深层空间中的权重学习。这个模块通过迭代更新每个客户端的子网络参数,构建了一个可以表示每个客户端的紧凑表示框架。提出的框架在三个多Modalities数据集上表现出色,实现了分类准确率94.35%,同时压缩通信成本4倍。此外,基于 Jetson TX2 工业模块的遥感边缘计算架构进行了实际数据测试,并证明了 FedFusion 可以减少训练时间48.4分钟(15.18%),同时优化准确率。
Pseudo-keypoints RKHS Learning for Self-supervised 6DoF Pose Estimation
results: 实现了state-of-the-art性能在三个常用的6DoF PE数据集上(LINEMOD (+4.2%), Occlusion LINEMOD (+2%), YCB-Video (+3%)),并与完全监督方法在所有六个BOP核心数据集上表现相当( Within -10.8% to -0.3%)。Abstract
This paper addresses the simulation-to-real domain gap in 6DoF PE, and proposes a novel self-supervised keypoint radial voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previous iterative matching methods. We propose an adapter network, which evolves the network parameters from the source domain, which has been massively trained on synthetic data with synthetic poses, to the target domain, which is trained on real data. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real groundtruth data annotations. RKHSPose achieves state-of-the-art performance on three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within -10.8% to -0.3% of the top fully supervised results.
摘要
Center Focusing Network for Real-Time LiDAR Panoptic Segmentation
results: 在SemanticKITTI和nuScenes精准分割benchmark上,CFNet比所有其他方法表现出较大的优势,并与最高效的方法相比,运行速度提高1.6倍。Abstract
LiDAR panoptic segmentation facilitates an autonomous vehicle to comprehensively understand the surrounding objects and scenes and is required to run in real time. The recent proposal-free methods accelerate the algorithm, but their effectiveness and efficiency are still limited owing to the difficulty of modeling non-existent instance centers and the costly center-based clustering modules. To achieve accurate and real-time LiDAR panoptic segmentation, a novel center focusing network (CFNet) is introduced. Specifically, the center focusing feature encoding (CFFE) is proposed to explicitly understand the relationships between the original LiDAR points and virtual instance centers by shifting the LiDAR points and filling in the center points. Moreover, to leverage the redundantly detected centers, a fast center deduplication module (CDM) is proposed to select only one center for each instance. Experiments on the SemanticKITTI and nuScenes panoptic segmentation benchmarks demonstrate that our CFNet outperforms all existing methods by a large margin and is 1.6 times faster than the most efficient method. The code is available at https://github.com/GangZhang842/CFNet.
摘要
利用LiDAR照片拼接的精炼分割可以帮助自动驾驶车辆全面理解周围环境和场景,并且需要在实时下运行。最近的提议方法可以加速算法,但其效果和效率仍然受到非存在实例中心的模型化和中心基于归一化模块的成本所限。为了实现准确和实时的LiDAR精炼分割,我们提出了一种新的中心集中网络(CFNet)。具体来说,我们提出了中心集中特征编码(CFFE),以明确原始LiDAR点和虚拟实例中心之间的关系,通过将LiDAR点Shift和填充中心点。此外,为了利用重复检测到的中心点,我们提出了快速中心筛选模块(CDM),以选择每个实例只有一个中心点。实验表明,我们的CFNet在SemanticKITTI和nuScenes精炼分割标准 benchmark上比所有其他方法差距较大,并且比最高效的方法快1.6倍。代码可以在https://github.com/GangZhang842/CFNet 中找到。
results: 提出一种用于评估提取信息的完整性和层次准确性的度量,并通过实验证明该方法的有效性。Abstract
Organizational charts, also known as org charts, are critical representations of an organization's structure and the hierarchical relationships between its components and positions. However, manually extracting information from org charts can be error-prone and time-consuming. To solve this, we present an automated and end-to-end approach that uses computer vision, deep learning, and natural language processing techniques. Additionally, we propose a metric to evaluate the completeness and hierarchical accuracy of the extracted information. This approach has the potential to improve organizational restructuring and resource utilization by providing a clear and concise representation of the organizational structure. Our study lays a foundation for further research on the topic of hierarchical chart analysis.
摘要
organizational charts, also known as org charts, are critical representations of an organization's structure and the hierarchical relationships between its components and positions. however, manually extracting information from org charts can be error-prone and time-consuming. to solve this, we present an automated and end-to-end approach that uses computer vision, deep learning, and natural language processing techniques. additionally, we propose a metric to evaluate the completeness and hierarchical accuracy of the extracted information. this approach has the potential to improve organizational restructuring and resource utilization by providing a clear and concise representation of the organizational structure. our study lays a foundation for further research on the topic of hierarchical chart analysis.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
A Graphical Model of Hurricane Evacuation Behaviors
paper_authors: Hui Sophie Wang, Nutchanon Yongsatianchot, Stacy Marsella
for: 这 paper 的目的是研究人们在风暴来临时是否离开家园的决策,以及这些决策如何影响紧急准备和应急响应。
methods: 这 paper 使用了 Protection motivation theory (PMT) 框架,构建了风暴离开决策的复杂关系图,并通过 conditional independence tests 评估不同的图 Structures。
results: 研究发现,人们的风暴离开决策受到了威胁评估(threat appraisal)和应急 coping 评估的直接影响,以及媒体信息的直接和间接影响。 certain information received from media 影响了威胁评估,并通过它影响了风暴离开行为。此外,一些变量直接影响了风暴离开行为和威胁评估,包括家人和朋友的建议,邻居的离开行为,以及官员发布的离开通知。Abstract
Natural disasters such as hurricanes are increasing and causing widespread devastation. People's decisions and actions regarding whether to evacuate or not are critical and have a large impact on emergency planning and response. Our interest lies in computationally modeling complex relationships among various factors influencing evacuation decisions. We conducted a study on the evacuation of Hurricane Irma of the 2017 Atlantic hurricane season. The study was guided by the Protection motivation theory (PMT), a widely-used framework to understand people's responses to potential threats. Graphical models were constructed to represent the complex relationships among the factors involved and the evacuation decision. We evaluated different graphical structures based on conditional independence tests using Irma data. The final model largely aligns with PMT. It shows that both risk perception (threat appraisal) and difficulties in evacuation (coping appraisal) influence evacuation decisions directly and independently. Certain information received from media was found to influence risk perception, and through it influence evacuation behaviors indirectly. In addition, several variables were found to influence both risk perception and evacuation behaviors directly, including family and friends' suggestions, neighbors' evacuation behaviors, and evacuation notices from officials.
摘要
自然灾害如飓风减少不断,引起广泛的破坏。人们的逃离或不逃离的决定对紧急准备和应急应对有着重要的影响。我们的兴趣在于通过计算模型来模拟人们逃离决定的复杂关系。我们在2017年大西洋飓风赛季的飓风艾尔马事例进行了研究。研究受保护动机理论(PMT)的导向,这是解释人们面临可能威胁的响应的广泛使用的框架。我们使用图表模型来表示逃离决定中的复杂关系,并对逃离决定进行了不同的图表结构的评估。我们根据飓风艾尔马数据进行了条件独立测试,最终模型大致与PMT相符。它表明,风险识别(威胁评估)和逃离困难(处理评估)都直接和独立地影响逃离决定。媒体上接受的信息也影响了风险识别,并通过它影响了逃离行为。此外,一些变量直接和 indirectly影响了逃离决定,包括家庭和朋友的建议、邻居的逃离行为以及官员发布的逃离通知。
Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind Capabilities
results: 对当前 ToM benchmark 进行应用,SimToM 方法显示了明显的改善,而且我们的分析表明了对理论知识的了解对 ToM 能力的重要性。Abstract
Human interactions are deeply rooted in the interplay of thoughts, beliefs, and desires made possible by Theory of Mind (ToM): our cognitive ability to understand the mental states of ourselves and others. Although ToM may come naturally to us, emulating it presents a challenge to even the most advanced Large Language Models (LLMs). Recent improvements to LLMs' reasoning capabilities from simple yet effective prompting techniques such as Chain-of-Thought have seen limited applicability to ToM. In this paper, we turn to the prominent cognitive science theory "Simulation Theory" to bridge this gap. We introduce SimToM, a novel two-stage prompting framework inspired by Simulation Theory's notion of perspective-taking. To implement this idea on current ToM benchmarks, SimToM first filters context based on what the character in question knows before answering a question about their mental state. Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods, and our analysis reveals the importance of perspective-taking to Theory-of-Mind capabilities. Our findings suggest perspective-taking as a promising direction for future research into improving LLMs' ToM capabilities.
摘要
人类互动深受理智思维、信念和愿望的互动,这些思维是通过理智思维(ToM)实现的:我们的认知能力理解自己和他人的心理状态。虽然ToM可能是自然的,但模拟它对even最高级语言模型(LLMs)来说是一项挑战。现有的LLMs的理解能力的改进从简单而有效的提示技术such as Chain-of-Thought中有限的应用于ToM。在这篇论文中,我们转向了著名的认知科学理论“模拟理论”来bridging这个差距。我们提出了一种新的两个阶段的提示框架,称为SimToM,它是基于模拟理论中的看法拟合的想法。为了在当前的ToM标准benchmark上实现这个想法,SimToM首先根据character知道的信息过滤上下文,然后回答关于其心理状态的问题。我们的方法不需要额外的训练和微小的提示调整,而且与现有的方法比较,我们的结果表明了看法拟合的重要性,以及它在理智思维能力方面的潜在性。我们的发现建议将 perspective-taking作为未来研究理智思维能力的可能方向。
A Language and Its Dimensions: Intrinsic Dimensions of Language Fractal Structures
paper_authors: Vasilii A. Gromov, Nikita S. Borodin, Asel S. Yerbolova
for: The paper is written to introduce a new object of study - a language fractal structure, and to estimate the intrinsic dimensions of language fractal structures for the Russian and English languages.
methods: The paper uses methods based on topological data analysis and a minimum spanning tree of a data graph to estimate the intrinsic dimensions of language fractal structures.
results: The paper finds that the intrinsic dimensions of language fractal structures for both the Russian and English languages are non-integer values, close to 9 for both languages.Here is the information in Simplified Chinese text, as requested:
results: 研究发现,俄语和英语语言异步结构的内在维度都是非整数值,都接近9。Abstract
The present paper introduces a novel object of study - a language fractal structure. We hypothesize that a set of embeddings of all $n$-grams of a natural language constitutes a representative sample of this fractal set. (We use the term Hailonakea to refer to the sum total of all language fractal structures, over all $n$). The paper estimates intrinsic (genuine) dimensions of language fractal structures for the Russian and English languages. To this end, we employ methods based on (1) topological data analysis and (2) a minimum spanning tree of a data graph for a cloud of points considered (Steele theorem). For both languages, for all $n$, the intrinsic dimensions appear to be non-integer values (typical for fractal sets), close to 9 for both of the Russian and English language.
摘要
本文介绍一种新的研究对象——语言自similarity结构。我们假设所有自然语言中的ngrams集合可以视为这种自similarity结构的代表样本。(我们使用“Hailonakea”这个 термин来描述所有语言自similarity结构的总和,随着n的变化)。本文对俄语和英语两种语言的语言自similarity结构进行了估计。为此,我们使用了基于拓扑数据分析和最小杆法(Steele theorem)的方法。对于两种语言和所有n,内部维度都显示为非整数值(典型的自similarity集合特征),约等于9。
Predictive Minds: LLMs As Atypical Active Inference Agents
results: 文章列出了可能关闭这个循环的原因,以及这可能导致模型自我意识和减少预测错误的变化。Abstract
Large language models (LLMs) like GPT are often conceptualized as passive predictors, simulators, or even stochastic parrots. We instead conceptualize LLMs by drawing on the theory of active inference originating in cognitive science and neuroscience. We examine similarities and differences between traditional active inference systems and LLMs, leading to the conclusion that, currently, LLMs lack a tight feedback loop between acting in the world and perceiving the impacts of their actions, but otherwise fit in the active inference paradigm. We list reasons why this loop may soon be closed, and possible consequences of this including enhanced model self-awareness and the drive to minimize prediction error by changing the world.
摘要
大型语言模型(LLM)如GPT通常被概念化为无动的预测器、模拟器或甚至是随机的喊喊鸟。我们则通过从认知科学和神经科学中的活跃推理理论来概念化LLM。我们比较了传统的活跃推理系统和LLM之间的相似之处和不同之处,结论是,目前LLM缺乏在世界中行动并观察自己的影响的紧密回路,但以其他方面符合活跃推理概念。我们列出了关闭这个回路的原因,以及这可能会带来的影响,包括增强模型自我意识和驱动降低预测错误的改变世界的驱动。
paper_authors: Thomas L. Griffiths, Jian-Qiao Zhu, Erin Grant, R. Thomas McCoy
for: 本研究旨在探讨人工神经网络如何影响人类认知解释,并 argue that Bayesian模型和人工神经网络是 complementary modeling approach,可以用来理解人类认知和智能机器的行为。
methods: 本研究使用了人工神经网络和 Bayesian模型来解释人类认知和智能机器的行为。
results: 研究发现,Bayesian模型和人工神经网络是不同的层次分析方法,可以共同理解人类认知和智能机器的行为,并且 Bayesian模型在解释大型、透明度低的人工神经网络行为方面可能具有独特的价值。Abstract
The success of methods based on artificial neural networks in creating intelligent machines seems like it might pose a challenge to explanations of human cognition in terms of Bayesian inference. We argue that this is not the case, and that in fact these systems offer new opportunities for Bayesian modeling. Specifically, we argue that Bayesian models of cognition and artificial neural networks lie at different levels of analysis and are complementary modeling approaches, together offering a way to understand human cognition that spans these levels. We also argue that the same perspective can be applied to intelligent machines, where a Bayesian approach may be uniquely valuable in understanding the behavior of large, opaque artificial neural networks that are trained on proprietary data.
摘要
人类认知的解释可能会受到基于人工神经网络的方法的成功威胁。我们认为这并不是如此,我们认为这些系统实际上提供了新的机会来模型人类认知。具体来说,我们认为认知科学的概率模型和人工神经网络模型在不同的水平上进行模型化,这些模型之间存在衔接,可以用来理解人类认知的各个水平。此外,我们还认为概率模型在理解大型、不透明的人工神经网络的行为方面可能具有独特的价值,这些网络通常是基于专有数据进行训练的。
Towards Improving Robustness Against Common Corruptions using Mixture of Class Specific Experts
results: 研究发现,这种新方法可以提高神经网络的适用范围和性能,并在不同的测试 benchmark 上表现出色。特别是在面对未知的扭曲和折衣时,这种方法可以提供更高的适用范围和稳定性。Abstract
Neural networks have demonstrated significant accuracy across various domains, yet their vulnerability to subtle input alterations remains a persistent challenge. Conventional methods like data augmentation, while effective to some extent, fall short in addressing unforeseen corruptions, limiting the adaptability of neural networks in real-world scenarios. In response, this paper introduces a novel paradigm known as the Mixture of Class-Specific Expert Architecture. The approach involves disentangling feature learning for individual classes, offering a nuanced enhancement in scalability and overall performance. By training dedicated network segments for each class and subsequently aggregating their outputs, the proposed architecture aims to mitigate vulnerabilities associated with common neural network structures. The study underscores the importance of comprehensive evaluation methodologies, advocating for the incorporation of benchmarks like the common corruptions benchmark. This inclusion provides nuanced insights into the vulnerabilities of neural networks, especially concerning their generalization capabilities and robustness to unforeseen distortions. The research aligns with the broader objective of advancing the development of highly robust learning systems capable of nuanced reasoning across diverse and challenging real-world scenarios. Through this contribution, the paper aims to foster a deeper understanding of neural network limitations and proposes a practical approach to enhance their resilience in the face of evolving and unpredictable conditions.
摘要
paper_authors: Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Gardar Ingvarsson, Timon Willi, Akbir Khan, Christian Schroeder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert Tjarko Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktaschel, Chris Lu, Jakob Nicolaus Foerster
for: This paper is written for researchers and developers in the field of reinforcement learning (RL) and multi-agent reinforcement learning (MARL), who need efficient and scalable environments for training and evaluating their algorithms.
methods: The paper uses JAX (Jax.org) to enable massively parallel RL training pipelines and environments, and presents JaxMARL, an open-source code base that combines ease-of-use with GPU-enabled efficiency for commonly used MARL environments and popular baseline algorithms.
results: The paper shows that JaxMARL is up to 12,500 times faster than existing approaches in terms of wall clock time, enabling efficient and thorough evaluations, and introduces SMAX, a vectorized and simplified version of the StarCraft Multi-Agent Challenge that enables GPU acceleration and provides a more flexible MARL environment.Abstract
Benchmarks play an important role in the development of machine learning algorithms. For example, research in reinforcement learning (RL) has been heavily influenced by available environments and benchmarks. However, RL environments are traditionally run on the CPU, limiting their scalability with typical academic compute. Recent advancements in JAX have enabled the wider use of hardware acceleration to overcome these computational hurdles, enabling massively parallel RL training pipelines and environments. This is particularly useful for multi-agent reinforcement learning (MARL) research. First of all, multiple agents must be considered at each environment step, adding computational burden, and secondly, the sample complexity is increased due to non-stationarity, decentralised partial observability, or other MARL challenges. In this paper, we present JaxMARL, the first open-source code base that combines ease-of-use with GPU enabled efficiency, and supports a large number of commonly used MARL environments as well as popular baseline algorithms. When considering wall clock time, our experiments show that per-run our JAX-based training pipeline is up to 12500x faster than existing approaches. This enables efficient and thorough evaluations, with the potential to alleviate the evaluation crisis of the field. We also introduce and benchmark SMAX, a vectorised, simplified version of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL. We provide code at https://github.com/flairox/jaxmarl.
摘要
��benchmark��是机器学习算法开发中非常重要的一部分。例如,在强化学习(RL)方面,可用的环境和benchmark��对研究产生了深远的影响。然而,RL环境通常在CPU上运行,这限制了学术计算的扩展性。现在,JAX技术的发展使得可以通过硬件加速来超越这些计算障碍,实现了大规模并行的RL训练管道和环境。这特别有用于多智能体强化学习(MARL)研究。首先,在每个环境步骤中,需要考虑多个智能体,这添加了计算压力;其次,样本复杂性增加由非站立性、分布式部分可见性或其他MARL挑战。在这篇论文中,我们提供了JaxMARL,首个开源代码库,搭配易用性和GPU启用效率,支持大量常用的MARL环境以及流行的基线算法。在考虑wall clock时间的情况下,我们的JAX基本训练管道相比现有方法,每次训练的时间提高了12500倍。这使得有效和详细的评估变得可能,有potential解决机器学习领域的评估危机。我们还引入了SMAX,一种简化版的StarCraft Multi-Agent Challenge, removes the need to run the StarCraft II game engine。这不仅使得GPU加速可能,还提供了更 flexible MARL环境,解锁了自适应、元学习和未来MARL应用的潜力。我们提供了代码,可以在https://github.com/flairox/jaxmarl上获取。
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
paper_authors: Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman
For: 这个论文旨在提出一种多任务图像编辑模型,以便通过自然语言指令执行图像编辑操作。* Methods: 该模型通过多任务学习和学习任务嵌入来提高图像编辑能力,并且可以通过几个例子学习新任务。* Results: 该模型在多个图像编辑任务中达到了领先水平,包括区域基本编辑、自由形编辑和计算机视觉任务。此外,该模型还可以执行新的图像编辑任务,只需几个标注的样本即可。Abstract
Instruction-based image editing holds immense potential for a variety of applications, as it enables users to perform any editing operation using a natural language instruction. However, current models in this domain often struggle with accurately executing user instructions. We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. To develop Emu Edit we train it to multi-task across an unprecedented range of tasks, such as region-based editing, free-form editing, and Computer Vision tasks, all of which are formulated as generative tasks. Additionally, to enhance Emu Edit's multi-task learning abilities, we provide it with learned task embeddings which guide the generation process towards the correct edit type. Both these elements are essential for Emu Edit's outstanding performance. Furthermore, we show that Emu Edit can generalize to new tasks, such as image inpainting, super-resolution, and compositions of editing tasks, with just a few labeled examples. This capability offers a significant advantage in scenarios where high-quality samples are scarce. Lastly, to facilitate a more rigorous and informed assessment of instructable image editing models, we release a new challenging and versatile benchmark that includes seven different image editing tasks.
摘要请求已处理,以下是文本的简化中文版本: instruciton-based image editing 拥有巨大的潜力,可以用自然语言指令进行任何编辑操作。然而,当前在这个领域的模型经常会对用户的指令进行不准确的执行。我们介绍了 Emu Edit,一个多任务图像编辑模型,它在 instruciton-based image editing 中设置了新的STATE-OF-THE-ART 成绩。为了开发 Emu Edit,我们训练它来执行多种任务,例如区域编辑、自由形编辑和计算机视觉任务,这些任务都是生成任务。此外,为了增强 Emu Edit 的多任务学习能力,我们为它提供了学习任务嵌入,这些嵌入引导生成过程向正确的编辑类型。这两个元素都是 Emu Edit 的出色表现的关键。此外,我们表明 Emu Edit 可以通过几个标注的例子学习新任务,如图像填充、超解像和编辑任务的组合。这种能力在情况中缺乏高质量样本时具有重要优势。最后,为了促进 instruciton-based image editing 模型的更加严格和有知识的评估,我们发布了一个新的复杂和多样的benchmark,包括七种不同的图像编辑任务。
Intelligent Generation of Graphical Game Assets: A Conceptual Framework and Systematic Review of the State of the Art
results: 该论文对现有的图形资产生成方法进行了概括和分析,并基于文献的研究,提出了一种概念框架,以帮助感兴趣的人了解和应用图形资产生成的方法。Abstract
Procedural content generation (PCG) can be applied to a wide variety of tasks in games, from narratives, levels and sounds, to trees and weapons. A large amount of game content is comprised of graphical assets, such as clouds, buildings or vegetation, that do not require gameplay function considerations. There is also a breadth of literature examining the procedural generation of such elements for purposes outside of games. The body of research, focused on specific methods for generating specific assets, provides a narrow view of the available possibilities. Hence, it is difficult to have a clear picture of all approaches and possibilities, with no guide for interested parties to discover possible methods and approaches for their needs, and no facility to guide them through each technique or approach to map out the process of using them. Therefore, a systematic literature review has been conducted, yielding 200 accepted papers. This paper explores state-of-the-art approaches to graphical asset generation, examining research from a wide range of applications, inside and outside of games. Informed by the literature, a conceptual framework has been derived to address the aforementioned gaps.
摘要
<>translate_orientation: horizontal Процедурное содержание генерации (PCG) может быть применено к широкому спектру задач в играх, от сюжетов, уровней и звуков, до деревьев и оружия. Огромное количество содержимого игры состоит из графических ассетов, таких как облака, здания или растительность, которые не требуют рассмотрения функций игры. Также существует широкий спектр литературы, который изучает процедурное генерация таких элементов для целей вне игр. Тело исследований, сосредоточенное на конкретных методах для генерации конкретных активов, ограничивает видимость доступных возможных подходов и подходов. Поэтому трудно получить ясный обзор всех подходов и возможностей, а также нет инструментов для руководства заинтересованными сторонами в возможных методах и подходах для их потребностей. Поэтому была проведена систематическая рецензия литературы, которая дала 200 принятых статей. Эта статья исследовает современные подходы к генерации графических активов, изучая исследования из широкого спектра приложений, как внутри, так и вне игр. Обоснованная литературой, была получена концептуальная рамка, чтобы устранить перечисленные пробелы.
ChatGPT-3.5, ChatGPT-4, Google Bard, and Microsoft Bing to Improve Health Literacy and Communication in Pediatric Populations and Beyond
paper_authors: Kanhai S. Amin, Linda Mayes, Pavan Khosla, Rushabh Doshi
For: The paper aims to investigate whether large language models (LLMs) can improve health literacy in children and other populations.* Methods: The authors used 26 different prompts to test the ability of three LLMs (ChatGPT-3.5, Microsoft Bing, and Google Bard) to provide health information at different reading grade levels (RGL). They evaluated the responses based on their reading grade level and word count.* Results: The results show that all three LLMs were able to provide responses at or above a 10th-grade RGL. However, ChatGPT-3.5 and ChatGPT-4 were better at providing responses at lower grade levels, while Microsoft Bing and Google Bard tended to produce responses at a consistent high school level. The authors also found that Bard was more cautious in providing certain outputs, which may indicate a need for further research on the accuracy and effectiveness of LLMs in health communication.Abstract
Purpose: Enhanced health literacy has been linked to better health outcomes; however, few interventions have been studied. We investigate whether large language models (LLMs) can serve as a medium to improve health literacy in children and other populations. Methods: We ran 288 conditions using 26 different prompts through ChatGPT-3.5, Microsoft Bing, and Google Bard. Given constraints imposed by rate limits, we tested a subset of 150 conditions through ChatGPT-4. The primary outcome measurements were the reading grade level (RGL) and word counts of output. Results: Across all models, output for basic prompts such as "Explain" and "What is (are)" were at, or exceeded, a 10th-grade RGL. When prompts were specified to explain conditions from the 1st to 12th RGL, we found that LLMs had varying abilities to tailor responses based on RGL. ChatGPT-3.5 provided responses that ranged from the 7th-grade to college freshmen RGL while ChatGPT-4 outputted responses from the 6th-grade to the college-senior RGL. Microsoft Bing provided responses from the 9th to 11th RGL while Google Bard provided responses from the 7th to 10th RGL. Discussion: ChatGPT-3.5 and ChatGPT-4 did better in achieving lower-grade level outputs. Meanwhile Bard and Bing tended to consistently produce an RGL that is at the high school level regardless of prompt. Additionally, Bard's hesitancy in providing certain outputs indicates a cautious approach towards health information. LLMs demonstrate promise in enhancing health communication, but future research should verify the accuracy and effectiveness of such tools in this context. Implications: LLMs face challenges in crafting outputs below a sixth-grade reading level. However, their capability to modify outputs above this threshold provides a potential mechanism to improve health literacy and communication in a pediatric population and beyond.
摘要
目的:增强健康文化知识与健康结果之间的关系,但只有少数临床实践被研究。我们 investigate whether large language models (LLMs) can serve as a medium to improve health literacy in children and other populations. 方法:我们运行了288个条件,使用26个提示,通过ChatGPT-3.5、Microsoft Bing和Google Bard进行测试。由于环境限制,我们只测试了150个条件。主要输出测量方法包括阅读水平(RGL)和单词计数。 结果:所有模型的输出基本提问(如“解释”和“是什么”)的RGL都达到或超过了高中水平。当提示是指定为1-12年级的条件时,我们发现LLMs有不同的能力来适应不同的学龄水平。ChatGPT-3.5提供了7-12年级的回答,而ChatGPT-4则提供了6-12年级的回答。Microsoft Bing提供了9-11年级的回答,而Google Bard则提供了7-10年级的回答。 讨论:ChatGPT-3.5和ChatGPT-4在实现更低学龄水平的输出方面表现更好。与之相比,Bing和Bard倾向于一直提供高中水平的回答,无论提示是什么。此外,Bard的某些输出表现出了谨慎的态度,这可能是一种对健康信息的谨慎方式。LLMs在健康沟通方面表现了搭配性,但未来的研究应该验证这些工具在这种上下文中的准确性和效果。 意义:LLMs面临低于6年级阅读水平的输出创作的挑战。然而,它们可以修改输出以上这个阈值提供一个可能的机制来提高健康文化知识和沟通。
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
paper_authors: Ilaria Manco, Benno Weck, SeungHeon Doh, Minz Won, Yixiao Zhang, Dmitry Bodganov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, Elio Quinton, György Fazekas, Juhan Nam
for: 评估音乐和语言模型的评估 dataset,提供高质量的音频-caption对。
methods: 使用人工写好的自然语言描述,对706首乐曲进行了评估。
results: 通过三种音乐和语言任务的测试(乐曲描述、文本到乐曲生成和乐曲语言检索),研究人员可以通过 SDD 来更好地了解模型性能。Abstract
We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-language models. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.
摘要
我们介绍歌曲描述数据集(SDD),一个新的人工抽样的音频-caption对数据集,适用于评估音乐和语言模型。该数据集包含1.1k名人写的自然语言描述,描述了706首音乐录音,所有公共可访问,发布于创意共享许可证下。为了展示我们的数据集的使用,我们在三个关键的音乐和语言任务(音频描述、文本到音乐生成和音乐语言检索)中对流行的模型进行了测试。我们的实验表明了跨数据集评估的重要性,并提供了如何使用 SDD 来深入了解模型性能的细节。
Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts
results: 研究发现,在提示中添加社交角色可以一致提高模型的性能,并且使用gender-neutral角色和指定受众角色可以更好地提高模型的性能。然而,预测哪一个角色会导致最佳性能仍然是一个挑战,并且频率、相似度和混淆率不能完全解释社交角色对模型性能的影响。Abstract
Prompting serves as the major way humans interact with Large Language Models (LLM). Commercial AI systems commonly define the role of the LLM in system prompts. For example, ChatGPT uses "You are a helpful assistant" as part of the default system prompt. But is "a helpful assistant" the best role for LLMs? In this study, we present a systematic evaluation of how social roles in system prompts affect model performance. We curate a list of 162 roles covering 6 types of interpersonal relationships and 8 types of occupations. Through extensive analysis of 3 popular LLMs and 2457 questions, we show that adding interpersonal roles in prompts consistently improves the models' performance over a range of questions. Moreover, while we find that using gender-neutral roles and specifying the role as the audience leads to better performances, predicting which role leads to the best performance remains a challenging task, and that frequency, similarity, and perplexity do not fully explain the effect of social roles on model performances. Our results can help inform the design of system prompts for AI systems. Code and data are available at https://github.com/Jiaxin-Pei/Prompting-with-Social-Roles.
摘要
大量语言模型(LLM)与人类之间的互动主要是通过提示来实现。商业人工智能系统通常将 LLM 的角色定义为系统提示中的一部分。例如,ChatGPT 使用 "你是一位有用的助手" 作为默认系统提示。但是 "有用的助手" 是 LLM 的最佳角色吗?在这项研究中,我们提供了一种系统atic评估如何社交角色在系统提示中影响模型性能。我们筛选了 162 个角色,涵盖了6种人际关系和8种职业。通过对 3 个流行 LLM 和 2457 个问题进行广泛分析,我们发现,在提示中添加社交角色可以逐渐提高模型的性能范围内的问题。此外,我们发现使用 gender-neutral 角色和指定角色为受众可以提高模型的性能,但是预测哪一个角色会导致最佳性能是一项困难的任务,并且频率、相似度和混淆率不能完全解释社交角色对模型性能的影响。我们的结果可以帮助设计 AI 系统的提示。代码和数据可以在 https://github.com/Jiaxin-Pei/Prompting-with-Social-Roles 上获取。
Inherently Interpretable Time Series Classification via Multiple Instance Learning
results: 在85个UCR时间序列分类 dataset上测试了MILLET,并证明了它可以生成高质量的解释,比其他已知的可解释方法更好。此外,我们还提供了一个专门为可解释性评估设计的 synthetic dataset。Abstract
Conventional Time Series Classification (TSC) methods are often black boxes that obscure inherent interpretation of their decision-making processes. In this work, we leverage Multiple Instance Learning (MIL) to overcome this issue, and propose a new framework called MILLET: Multiple Instance Learning for Locally Explainable Time series classification. We apply MILLET to existing deep learning TSC models and show how they become inherently interpretable without compromising (and in some cases, even improving) predictive performance. We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specially designed to facilitate interpretability evaluation. On these datasets, we show MILLET produces sparse explanations quickly that are of higher quality than other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to an extensive variety of domains
摘要
We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specifically designed to facilitate interpretability evaluation. Our results show that MILLET produces high-quality explanations quickly, which are superior to other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to a wide range of domains.
A Novel Neural Network-Based Federated Learning System for Imbalanced and Non-IID Data
results: 我们在五个Well-known Benchmark dataset上进行了评估,并在不同数据分布Setting下获得了满意的性能,与一些现有的标准algorithm相比,我们的提posed system在一定程度上减少了训练时间。Abstract
With the growth of machine learning techniques, privacy of data of users has become a major concern. Most of the machine learning algorithms rely heavily on large amount of data which may be collected from various sources. Collecting these data yet maintaining privacy policies has become one of the most challenging tasks for the researchers. To combat this issue, researchers have introduced federated learning, where a prediction model is learnt by ensuring the privacy of data of clients data. However, the prevalent federated learning algorithms possess an accuracy and efficiency trade-off, especially for non-IID data. In this research, we propose a centralized, neural network-based federated learning system. The centralized algorithm incorporates micro-level parallel processing inspired by the traditional mini-batch algorithm where the client devices and the server handle the forward and backward propagation respectively. We also devise a semi-centralized version of our proposed algorithm. This algorithm takes advantage of edge computing for minimizing the load from the central server, where clients handle both the forward and backward propagation while sacrificing the overall train time to some extent. We evaluate our proposed systems on five well-known benchmark datasets and achieve satisfactory performance in a reasonable time across various data distribution settings as compared to some existing benchmark algorithms.
摘要
随着机器学习技术的发展,用户数据隐私的问题已成为一个重要的挑战。大多数机器学习算法需要大量数据,这些数据可能来自于多个来源。收集这些数据并保持隐私政策已成为研究人员最大的挑战。为解决这个问题,研究人员已经引入联邦学习,这种方法可以保证客户端数据的隐私。然而,现有的联邦学习算法具有精度和效率的负担假设,特别是非相关数据。在这些研究中,我们提议了一种中央化的神经网络基于联邦学习系统。中央算法包括微级并行处理,这种方法由客户端设备和服务器处理前向和反向传播。我们还开发了一种半中央化版本的我们的提议算法。这种算法利用边计算来减轻中央服务器的负担,客户端处理了前向和反向传播,但是在一定程度上减少了总训练时间。我们在五个常见的benchmark数据集上评估了我们的提议系统,并在不同的数据分布设置下实现了满意的性能,与一些现有的benchmark算法相比。
Learning interactions to boost human creativity with bandits and GPT-4
results: 研究发现,人类参与者和语言AI(GPT-4)在标准任务和一种提供算法生成提示的变体任务中的行为相似,而且 bandaits学习自AI响应 preferences 与人类行为学习的提示策略相同。结果表明,通过计算机互动,可以使bandits learn from simulated participants的行为,以提高人类创造力。Abstract
This paper considers how interactions with AI algorithms can boost human creative thought. We employ a psychological task that demonstrates limits on human creativity, namely semantic feature generation: given a concept name, respondents must list as many of its features as possible. Human participants typically produce only a fraction of the features they know before getting "stuck." In experiments with humans and with a language AI (GPT-4) we contrast behavior in the standard task versus a variant in which participants can ask for algorithmically-generated hints. Algorithm choice is administered by a multi-armed bandit whose reward indicates whether the hint helped generating more features. Humans and the AI show similar benefits from hints, and remarkably, bandits learning from AI responses prefer the same prompting strategy as those learning from human behavior. The results suggest that strategies for boosting human creativity via computer interactions can be learned by bandits run on groups of simulated participants.
摘要
Translated into Simplified Chinese:这篇论文研究了人与AI算法之间的互动如何提高人类创造力。我们使用了一项心理任务,以示人类创造力的限制,即给出概念名称后,参与者需要列出该概念的所有特征。人类参与者通常只能生成一小部分的特征,然后就会被"困顿"。在人类参与者和一种语言AI(GPT-4)的实验中,我们比较了标准任务和一种允许参与者请求算法生成提示的变体。算法选择是通过一个多重机枪进行管理,其奖励参与者选择有助于生成更多特征的提示。人类和AI都显示了类似的好处,奇怪地,由人类行为学习的多重机枪偏好了与AI回答相同的提示策略。这些结果表明,通过计算机互动,可以学习提高人类创造力的策略。
Straggler-resilient Federated Learning: Tackling Computation Heterogeneity with Layer-wise Partial Model Training in Mobile Edge Network
results: 研究表明,该方法可以让不同设备 collaboration 训练模型,并且可以更好地平衡学习精度和完成时间。 compared to 现有的参考方法,该方法可以更快地达到学习目标。Abstract
Federated Learning (FL) enables many resource-limited devices to train a model collaboratively without data sharing. However, many existing works focus on model-homogeneous FL, where the global and local models are the same size, ignoring the inherently heterogeneous computational capabilities of different devices and restricting resource-constrained devices from contributing to FL. In this paper, we consider model-heterogeneous FL and propose Federated Partial Model Training (FedPMT), where devices with smaller computational capabilities work on partial models (subsets of the global model) and contribute to the global model. Different from Dropout-based partial model generation, which removes neurons in hidden layers at random, model training in FedPMT is achieved from the back-propagation perspective. As such, all devices in FedPMT prioritize the most crucial parts of the global model. Theoretical analysis shows that the proposed partial model training design has a similar convergence rate to the widely adopted Federated Averaging (FedAvg) algorithm, $\mathcal{O}(1/T)$, with the sub-optimality gap enlarged by a constant factor related to the model splitting design in FedPMT. Empirical results show that FedPMT significantly outperforms the existing benchmark FedDrop. Meanwhile, compared to the popular model-homogeneous benchmark, FedAvg, FedPMT reaches the learning target in a shorter completion time, thus achieving a better trade-off between learning accuracy and completion time.
摘要
Federated Learning (FL) 允许多个资源有限的设备共同训练模型而无需数据共享。然而,现有的大多数研究都专注于模型同质的 FL,即全球模型和本地模型均为同一个大小,这ignore了不同设备的内在不同计算能力,从而限制了资源有限的设备参与FL。在这篇文章中,我们考虑了模型不同质的 FL,并提出了 Federated Partial Model Training(FedPMT),其中设备的计算能力较低的设备可以在部分模型(全球模型的子集)上进行训练,并对全球模型进行贡献。与Dropout技术基于随机移除隐藏层中的神经元不同,FedPMT中的模型训练是从反射层的角度来实现的,因此所有的设备在FedPMT中都会优先级别最重要的部分。理论分析表明,我们的 partial model 训练设计和 widely adopted Federated Averaging(FedAvg)算法相似,具有 $\mathcal{O}(1/T)$ 的收敛率,但与 FedAvg 相比,FedPMT 的优劣差异因子与模型分割设计相关。实验结果表明,FedPMT 明显超过了现有的 refer 替 benchmark FedDrop。同时,相比于通用模型同质的标准 refer 替 FedAvg,FedPMT 在完成时间和学习精度之间更好地做出了平衡。
Towards more Practical Threat Models in Artificial Intelligence Security
results: 研究发现,现有的威胁模型都是可行的,但是有一些重大匹配度:研究往往假设攻击者具有实际场景中不易获得的信息。这篇论文因此呼吁研究更加实际的威胁模型。Abstract
Recent works have identified a gap between research and practice in artificial intelligence security: threats studied in academia do not always reflect the practical use and security risks of AI. For example, while models are often studied in isolation, they form part of larger ML pipelines in practice. Recent works also brought forward that adversarial manipulations introduced by academic attacks are impractical. We take a first step towards describing the full extent of this disparity. To this end, we revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice via a survey with \textbf{271} industrial practitioners. On the one hand, we find that all existing threat models are indeed applicable. On the other hand, there are significant mismatches: research is often too generous with the attacker, assuming access to information not frequently available in real-world settings. Our paper is thus a call for action to study more practical threat models in artificial intelligence security.
摘要
最近的研究发现人工智能安全领域存在一个研究与实践之间的差距:在学术中研究的威胁不一定与实际使用和安全风险相符。例如,模型在实践中通常是作为更大的机器学习管道的一部分进行研究,而不是孤立的单元。此外,学术攻击的恶意修改也被证明是不实际的。我们为了描述这种差距,我们重新审视了人工智能安全领域最常研究的六种攻击方法,并通过对 \textbf{271} 名工业实践者进行调查,发现所有的威胁模型都是可靠的。然而,我们发现有一些巨大的匹配错误:研究经常假设攻击者具有实际场景中不常 disponibles的信息。因此,我们的论文是一种呼吁,呼吁更多地研究实际可行的人工智能安全威胁模型。
Generative AI for Hate Speech Detection: Evaluation and Findings
results: 对于BERT、RoBERTa、ALBERT等常见LLM,以及已适应仇恨检测的RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect和ToxiGen等模型,进行了评估和比较,并证实了这种方法可以提高仇恨语言泛化性能。同时,我们还对采用零shot仇恨检测的GPT-3.5模型进行了测试,结果显示该模型可以获得更高的泛化性能,但是具有较差的准确率和预测率。Abstract
Automatic hate speech detection using deep neural models is hampered by the scarcity of labeled datasets, leading to poor generalization. To mitigate this problem, generative AI has been utilized to generate large amounts of synthetic hate speech sequences from available labeled examples, leveraging the generated data in finetuning large pre-trained language models (LLMs). In this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach. In addition to general LLMs, such as BERT, RoBERTa and ALBERT, we apply and evaluate the impact of train set augmentation with generated data using LLMs that have been already adapted for hate detection, including RoBERTa-Toxicity, HateBERT, HateXplain, ToxDect, and ToxiGen. An empirical study corroborates our previous findings, showing that this approach improves hate speech generalization, boosting recall performance across data distributions. In addition, we explore and compare the performance of the finetuned LLMs with zero-shot hate detection using a GPT-3.5 model. Our results demonstrate that while better generalization is achieved using the GPT-3.5 model, it achieves mediocre recall and low precision on most datasets. It is an open question whether the sensitivity of models such as GPT-3.5, and onward, can be improved using similar techniques of text generation.
摘要
自动发现仇恨言语使用深度神经网络受到数据标注的罕见性的限制,导致模型的泛化性差。为解决这个问题,生成AI技术被应用来生成大量的人工仇恨言语序列,利用生成的数据进行训练大型预训练语言模型(LLM)。在这章中,我们提供了相关的方法、实验设置和评估这种方法的评估。除了一般的LLM,如BERT、RoBERTa和ALBERT,我们还应用并评估生成数据集 augmentation的影响。我们使用已经适应仇恨检测的LLM,包括RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect和ToxiGen进行训练和评估。我们的实验结果表明,这种方法可以提高仇恨言语泛化性,提高检测性能 across data distributions。此外,我们还探讨了使用GPT-3.5模型进行零容量仇恨检测的性能。我们的结果表明,虽然使用GPT-3.5模型可以获得更好的泛化性,但它的准确率和精度在大多数数据集上都很低。这是一个开放的问题,是否可以通过类似的文本生成技术提高模型的敏感性。
A Framework for Monitoring and Retraining Language Models in Real-World Applications
results: 研究发现,不同 retraining 决策点可能导致不同的模型性能和资源利用率。根据研究结果,提出了一个参考框架,可以帮助设计有效的模型重新训练策略。Abstract
In the Machine Learning (ML) model development lifecycle, training candidate models using an offline holdout dataset and identifying the best model for the given task is only the first step. After the deployment of the selected model, continuous model monitoring and model retraining is required in many real-world applications. There are multiple reasons for retraining, including data or concept drift, which may be reflected on the model performance as monitored by an appropriate metric. Another motivation for retraining is the acquisition of increasing amounts of data over time, which may be used to retrain and improve the model performance even in the absence of drifts. We examine the impact of various retraining decision points on crucial factors, such as model performance and resource utilization, in the context of Multilabel Classification models. We explain our key decision points and propose a reference framework for designing an effective model retraining strategy.
摘要
在机器学习(ML)模型开发生命周期中,使用停滞 dataset 训练候选模型并选择适合任务的最佳模型只是第一步。在实际应用中,已经部署的选定模型后,需要进行连续的模型监测和重新训练。有多种重新训练的原因,包括数据或概念漂移,这可能会影响模型性能,并且可以通过适当的指标来监测。另一个重新训练的动机是随着时间的推移,采集到的数据量的增加,可以重新训练并改进模型性能,即使没有数据漂移。我们研究重新训练决策点对关键因素的影响,如模型性能和资源利用率,并提出了一个参考框架,以设计有效的模型重新训练策略。
DSR-Diff: Depth Map Super-Resolution with Diffusion Model
results: 对比州前方法,提出了一种新的CDSR模型,并实现了较高的准确率和效率。代码将在https://github.com/shiyuan7/DSR-Diff中发布。Abstract
Color-guided depth map super-resolution (CDSR) improve the spatial resolution of a low-quality depth map with the corresponding high-quality color map, benefiting various applications such as 3D reconstruction, virtual reality, and augmented reality. While conventional CDSR methods typically rely on convolutional neural networks or transformers, diffusion models (DMs) have demonstrated notable effectiveness in high-level vision tasks. In this work, we present a novel CDSR paradigm that utilizes a diffusion model within the latent space to generate guidance for depth map super-resolution. The proposed method comprises a guidance generation network (GGN), a depth map super-resolution network (DSRN), and a guidance recovery network (GRN). The GGN is specifically designed to generate the guidance while managing its compactness. Additionally, we integrate a simple but effective feature fusion module and a transformer-style feature extraction module into the DSRN, enabling it to leverage guided priors in the extraction, fusion, and reconstruction of multi-model images. Taking into account both accuracy and efficiency, our proposed method has shown superior performance in extensive experiments when compared to state-of-the-art methods. Our codes will be made available at https://github.com/shiyuan7/DSR-Diff.
摘要
颜色导航深度地图超分辨 (CDSR) 可以提高低质量深度地图的空间分辨率,有利于多种应用,如三维重建、虚拟现实和增强现实。传统的 CDSR 方法通常采用卷积神经网络或转换器,而扩散模型(DM)则在高级视觉任务中表现出了很好的效果。在这项工作中,我们提出了一种新的 CDSR 模式,利用在潜在空间中的扩散模型来生成指导 depth map 超分辨。我们的方法包括指导生成网络(GGN)、深度地图超分辨网络(DSRN)和指导恢复网络(GRN)。GGN 专门设计用于生成指导,同时管理其 компакт性。此外,我们还将一个简单 yet effective 的特征融合模块和一个基于转换器的特征提取模块integrated into DSRN,使其能够在抽取、融合和重建多模型图像时利用指导PRIORS。考虑到精度和效率,我们提出的方法在广泛的实验中显示出了与状态 искусственный智能方法相比的superior performance。我们的代码将在 GitHub 上发布,请参考 https://github.com/shiyuan7/DSR-Diff.
INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing
results: 我们的实验表明,INTERVENOR比 estado-of-the-art 方法更高效,在代码生成和代码翻译任务中分别提高了约13%和4.5%。我们的进一步分析还表明,CoR可以通过自然语言提供错误原因和解决方案的明确描述。由于编译器的反馈,INTERVENOR可以准确地识别代码中的语法错误和断言错误,并提供精确的修复指令,使LLMs在只需要三次修复后就能达到极限性能。Abstract
This paper proposes INTERactiVE chaiN Of Repairing (INTERVENOR), which mimics human code repairing behavior (iteratively judging, rethinking, and repairing) and prompts the coding ability of regard Large Language Models (LLMs). Specifically, INTERVENOR employs two LLM based agents, Code Learner and Code Teacher, to play different roles in code repairing and work interactively to repair the generated codes. The Code Learner is asked to generate and repair code according to the instructions from the Code Teacher. The Code Teacher rethinks the code errors according to the corresponding feedback from compilers and iteratively generates the chain-of-repairing (CoR) to guide the code repairing process for Code Learner. Our experiments show that INTERVENOR outperforms the state-of-the-art methods and achieves about 13% and 4.5% improvements over the GPT-3.5 model in code generation and code translation tasks, respectively. Our further analyses show that CoR can illuminate the bug reasons and solution plans via natural language. Thanks to the feedback of code compilers, INTERVENOR can accurately identify the syntax errors and assertion errors in the code and provide precise instructions to repair codes, making LLMs achieve the plateau performance with only three repairing turns. All data and codes are available at https://github.com/NEUIR/INTERVENOR
摘要
这个论文提出了一种名为INTERactiVE chaiN Of Repairing(INTERVENOR)的方法,它模仿人类代码修复行为(迭代评估、重新思考和修复),并唤醒LLM的编程能力。具体来说,INTERVENOR使用两个基于LLM的代理人:代码学习者和代码教师。代码学习者根据代码教师的指导生成和修复代码。代码教师根据编译器的反馈重新评估代码错误,并生成了修复过程中的链条(CoR),以引导代码修复过程。我们的实验表明,INTERVENOR在代码生成和代码翻译任务上表现出色,与当前状态的方法相比,提高了约13%和4.5%。我们的进一步分析表明,CoR可以通过自然语言来描述错误原因和解决方案。由于编译器的反馈,INTERVENOR可以准确地识别代码中的语法错误和断言错误,并提供精准的修复指导,使LLM在只需三次修复后达到极限性能。所有数据和代码可以在https://github.com/NEUIR/INTERVENOR上获取。
PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models
results: 研究发现不同知识领域的表现存在显著差异,而且只有ChatGPT模型的平均准确率超过70%, indicating that there is still room for improvement in this area.Abstract
As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above $70\%$, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology.
摘要
如Large Language Models(LLMs)在不同领域变得越来越普遍,需要改进的自然语言处理(NLP)标准测试Suite来涵盖所有必要的专业知识。许多当代测试标准 для基础模型通常会忽略一些重要的主题和专业知识,这会导致测试结果偏向。为解决这个问题,我们介绍了psybench,第一个涵盖所有必要的心理学入学考试知识的全面中文评估suite。psybench通过多选题提供了深入的模型强项和弱项评估。我们的发现表明不同主题section的性能存在显著差异,这说明测试集知识不均衡可能会导致偏向测试结果。各种ChatGPT模型的平均准确率超过70%,这表明当前还有很多机会进行改进。我们期望psybench可以帮助进行深入的模型强项和弱项评估,并在心理学领域的实践应用中提供支持。
SurvTimeSurvival: Survival Analysis On The Patient With Multiple Visits/Records
results: 我们的方法在covariates和时间变化covariates数据集上都超过了现有的深度学习方法的性能。我们的方法的目的不仅是提高个体患者生存轨迹的理解,从而提高预测精度,而且也在设计临床试验和开发新的治疗方法中发挥重要作用。Abstract
The accurate prediction of survival times for patients with severe diseases remains a critical challenge despite recent advances in artificial intelligence. This study introduces "SurvTimeSurvival: Survival Analysis On Patients With Multiple Visits/Records", utilizing the Transformer model to not only handle the complexities of time-varying covariates but also covariates data. We also tackle the data sparsity issue common to survival analysis datasets by integrating synthetic data generation into the learning process of our model. We show that our method outperforms state-of-the-art deep learning approaches on both covariates and time-varying covariates datasets. Our approach aims not only to enhance the understanding of individual patient survival trajectories across various medical conditions, thereby improving prediction accuracy, but also to play a pivotal role in designing clinical trials and creating new treatments.
摘要
医学预测患者生存时间的准确性仍然是一项关键挑战,尽管最近的人工智能技术得到了进步。这项研究介绍了“SurvTimeSurvival:基于多次/记录的生存分析”,利用Transformer模型不仅能处理时间变化的共 covariates,而且还能处理 covariates 数据。我们还解决了生存分析数据集中的数据缺失问题通过将生成 Synthetic 数据 incorporated 到我们的模型学习过程中。我们表明我们的方法在 covariates 和时间变化 covariates 数据集上都能够超越当前的深度学习方法。我们的方法不仅可以提高预测准确性,还可以提高对各种医疗情况的个体患者生存轨迹的理解,从而为设计临床试验和开发新药物做出重要贡献。
Leveraging LLMs in Scholarly Knowledge Graph Question Answering
methods: 该模型首先使用 BERT 基于 sentence encoder 将测试问题与training问题进行相似性比较,然后选择 top-n 相似问题对应的 SPARQL,并将这些对应的问题作为示例,将测试问题作为提示, passing it to LLM 生成 SPARQL。最后,对于underlying KG (ORKG)终端进行查询,并返回答案。
results: 该系统在 SciQA 中实现了 F1 分数 99.0%,在 Scholarly-QALD-23 挑战 benchmark 上表现出色。Abstract
This paper presents a scholarly Knowledge Graph Question Answering (KGQA) that answers bibliographic natural language questions by leveraging a large language model (LLM) in a few-shot manner. The model initially identifies the top-n similar training questions related to a given test question via a BERT-based sentence encoder and retrieves their corresponding SPARQL. Using the top-n similar question-SPARQL pairs as an example and the test question creates a prompt. Then pass the prompt to the LLM and generate a SPARQL. Finally, runs the SPARQL against the underlying KG - ORKG (Open Research KG) endpoint and returns an answer. Our system achieves an F1 score of 99.0%, on SciQA - one of the Scholarly-QALD-23 challenge benchmarks.
摘要
这篇论文提出了一种学术知识图问答系统(KGQA),该系统可以通过几个尝试回答文学性问题,使用大型语言模型(LLM)。系统首先使用BERT基于的句子编码器来将测试问题与相似训练问题进行对比,然后使用相似问题-SPARQL对应的最上层对象来生成一个提示。最后,将提示传递给LLM进行生成SPARQL,并将其运行于基础知识图(ORKG)终端,以获得答案。我们的系统在SciQA中的F1分数达99.0%。
PELMS: Pre-training for Effective Low-Shot Multi-Document Summarization
results: 我们在多种摘要任务上进行了广泛的评估,并经验显示了我们的方法在低shot设定下可以准确地捕捉摘要文献的主题和含义,并且在抽象性、流畅性、准确性和 faithfulness 等方面具有优异性。Abstract
We investigate pre-training techniques for abstractive multi-document summarization (MDS), which is much less studied than summarizing single documents. Though recent work has demonstrated the effectiveness of highlighting information salience for pre-training strategy design, it struggles to generate abstractive and reflective summaries, which are critical properties for MDS. To this end, we present PELMS, a pre-trained model that uses objectives based on semantic coherence heuristics and faithfulness constraints with un-labeled multi-document inputs, to promote the generation of concise, fluent, and faithful summaries. To support the training of PELMS, we compile MultiPT, a multi-document pre-training corpus containing over 93 million documents to form more than 3 million unlabeled topic-centric document clusters, covering diverse genres such as product reviews, news, and general knowledge. We perform extensive evaluation of PELMS in low-shot settings on a wide range of MDS datasets. Our approach consistently outperforms competitive comparisons with respect to overall informativeness, abstractiveness, coherence, and faithfulness.
摘要
我团队研究了抽象多文摘要(MDS)的预训练技术,这个领域比单文摘要更为少studied。虽然 latest work 表明了突出信息重要性的预训练策略的效果,但它很难生成抽象和反射的摘要,这些特性是MDS的关键性能。为此,我们提出了 PELMS,一种预训练模型,使用基于 semantic coherence heuristics 和 faithfulness constraints 的目标函数,以便在无标签多文输入下生成简洁、流畅、忠实的摘要。为支持 PELMS 的训练,我们编译了 MultiPT,一个包含超过 93 万个文档的多文预训练集,其中包含多种类型的文档,如产品评论、新闻和通用知识。我们对 PELMS 进行了广泛的评估,包括低投入设定下的评估,并在多种 MDS 数据集上表现出consistent 的优异性。
ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks
results: GPT-4在这些任务中表现出色,但只有39.73%的任务得到了成功。研究者们提出了ML-Agent,一种用于快速浏览代码库,找到相关文档、代码和执行代码的方法。与GPT-4结合使用后,ML-Agent得到了进一步的改进。Abstract
Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10044 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements. Code, data, and models are available at \url{https://ml-bench.github.io/}.
摘要
大型语言模型在代码生成比赛中表现出了优异的成绩。然而,实际Programming中对于这些模型的实用性存在许多差距,主要是因为实际程式码中的对于预存函数的依赖。相反于评估这些模型在从零开始写代码的能力,这个工作尝试了一个新的评估设置,让模型使用公开源代码库中的函数来完成机器学习任务。因此,我们提出了 ML-Bench,一个包含10044个样本、130个任务和14个知名机器学习GitHub来源的广泛的库。在这个设置下,给定一个机器学习任务的指令和相应的README档案,一个模型需要使用代码库中的函数来完成任务。这需要理解长长的文档和代码档案之间的交互,以及复杂的跨档代码结构,带来新的挑战。对此,我们提出了 ML-Agent,用于有效地浏览代码库、找到文档、获取代码和实现可执行的代码。实验结果显示,使用 GPT-4 建立的 ML-Agent 导致了进一步的改善。代码、数据和模型可以在 \url{https://ml-bench.github.io/} 获取。
AutoPlanBench: : Automatically generating benchmarks for LLM planners from PDDL
results: 研究发现,当今最好的LLM规划器在许多规划任务上表现出色,但是其他任务仍然远远超出了现有方法的能力范围。Abstract
LLMs are being increasingly used for planning-style tasks, but their capabilities for planning and reasoning are poorly understood. We present a novel method for automatically converting planning benchmarks written in PDDL into textual descriptions and offer a benchmark dataset created with our method. We show that while the best LLM planners do well on many planning tasks, others remain out of reach of current methods.
摘要
LLMs 正在越来越多地用于计划样式的任务,但它们的计划和理解能力尚未得到充分的理解。我们提出了一种新的方法,可以自动将 PDDL 格式的计划标准转换成文本描述,并提供了我们创建的 benchmark 数据集。我们发现,当前的 LLM 计划器在许多计划任务上表现出色,但有些任务仍然超出当前的能力范围。
PWISeg: Point-based Weakly-supervised Instance Segmentation for Surgical Instruments
results: 该方法在我们所提供的新的医疗器械数据集上进行了广泛的试验,并证明了与大多数无监督 bounding box 的实例分割方法相比,它的实rument分割精度得到了显著提高。Abstract
In surgical procedures, correct instrument counting is essential. Instance segmentation is a location method that locates not only an object's bounding box but also each pixel's specific details. However, obtaining mask-level annotations is labor-intensive in instance segmentation. To address this issue, we propose a novel yet effective weakly-supervised surgical instrument instance segmentation approach, named Point-based Weakly-supervised Instance Segmentation (PWISeg). PWISeg adopts an FCN-based architecture with point-to-box and point-to-mask branches to model the relationships between feature points and bounding boxes, as well as feature points and segmentation masks on FPN, accomplishing instrument detection and segmentation jointly in a single model. Since mask level annotations are hard to available in the real world, for point-to-mask training, we introduce an unsupervised projection loss, utilizing the projected relation between predicted masks and bboxes as supervision signal. On the other hand, we annotate a few pixels as the key pixel for each instrument. Based on this, we further propose a key pixel association loss and a key pixel distribution loss, driving the point-to-mask branch to generate more accurate segmentation predictions. To comprehensively evaluate this task, we unveil a novel surgical instrument dataset with manual annotations, setting up a benchmark for further research. Our comprehensive research trial validated the superior performance of our PWISeg. The results show that the accuracy of surgical instrument segmentation is improved, surpassing most methods of instance segmentation via weakly supervised bounding boxes. This improvement is consistently observed in our proposed dataset and when applied to the public HOSPI-Tools dataset.
摘要
在手术过程中,正确的工具数量是非常重要的。实例分割是一种位置方法,可以不仅找到物体的包围盒,还可以每个像素的特定细节。然而,在实例分割中获得mask水平的注释是很劳动密集的。为解决这个问题,我们提出了一种新的但有效的weakly-supervised手术工具实例分割方法,名为Point-based Weakly-supervised Instance Segmentation(PWISeg)。PWISeg采用了FCN基 architecture,并设置了点到包围盒和点到面积分支,以模型特征点和包围盒之间的关系,以及特征点和分割面积之间的关系。这样可以同时完成工具检测和分割。由于mask水平的注释很难在实际世界中获得,为点到面训练,我们引入了一种无监督投影损失,利用预测的面积和包围盒之间的投影关系作为监督信号。此外,我们还标注了每个工具的一些键点,并基于这些键点,我们进一步提出了键点协会损失和键点分布损失,使点到面分支生成更加准确的分割预测。为全面评估这个任务,我们披露了一个新的手术工具数据集,并设置了一个标准的比较基准。我们的全面研究试验证明了PWISeg的超越性。结果表明,通过我们提出的PWISeg,手术工具分割的准确性得到了提高,超越了大多数基于weakly supervised bounding box的实例分割方法。这种改进是在我们所提出的数据集和公共HOSPI-Tools数据集上均可见。
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning
results: 本研究在九个数据集(MedQA、MedMCQA、PubMedQA以及MMLU中的六个子任务)上得到了出色的结果,证明了我们提出的多学科合作框架可以帮助语言模型更好地理解和应用医学知识,同时还可以扩展其理解能力。Abstract
Large Language Models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and the reasoning over specialized knowledge. To address these obstinate issues, we propose a novel Multi-disciplinary Collaboration (MC) framework for the medical domain that leverages role-playing LLM-based agents who participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free and interpretable framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work particularly focuses on the zero-shot scenario, our results on nine data sets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MC framework excels at mining and harnessing the medical expertise in LLMs, as well as extending its reasoning abilities. Based on these outcomes, we further conduct a human evaluation to pinpoint and categorize common errors within our method, as well as ablation studies aimed at understanding the impact of various factors on overall performance. Our code can be found at \url{https://github.com/gersteinlab/MedAgents}.
摘要
大型语言模型(LLM),尽管在不同领域中做出了惊人的进步,在医疗领域还是遇到了很多障碍。这个领域面临着域名特定的术语和专业知识的推理等独特问题。为了解决这些难题,我们提出了一种新的多学科合作(MC)框架,该框架通过多个角色扮演LLM基于的代理人参与协同多轮讨论,从而提高LLM的技能和推理能力。这个无需训练和可解释的框架包括五个关键步骤:召集域专家、提出个人分析、汇总分析成报告、讨论迭代 until 达成一致,并最终做出决策。我们的工作特别关注于零容量情况下的情况,我们的结果表明,我们提出的MC框架在激活和利用医疗领域LLM的专业知识,以及扩展其推理能力方面表现出色。基于这些结果,我们进一步进行了人类评估,以找到和分类我们方法中的常见错误,以及进行了缺失因素的研究,以了解对总性表现的影响。我们的代码可以在 \url{https://github.com/gersteinlab/MedAgents} 找到。
Performance Trade-offs of Watermarking Large Language Models
results: 研究发现,在大多数情况下,水印对任务的性能没有显著影响。但是,长形生成任务(如概要和翻译)的性能下降了15-20%。这些结果指出了水印使用的贸易OFF,并提出了未来研究的可能性。Abstract
Amidst growing concerns of large language models (LLMs) being misused for generating misinformation or completing homework assignments, watermarking has emerged as an effective solution for distinguishing human-written and LLM-generated text. A prominent watermarking strategy is to embed a signal into generated text by upsampling a (pseudorandomly-chosen) subset of tokens at every generation step. Although this signal is imperceptible to a human reader, it is detectable through statistical testing. However, implanting such signals alters the model's output distribution and can have unintended effects when watermarked LLMs are used for downstream applications. In this work, we evaluate the performance of watermarked LLMs on a diverse suite of tasks, including text classification, textual entailment, reasoning, question answering, translation, summarization, and language modeling. We find that watermarking has negligible impact on the performance of tasks posed as k-class classification problems in the average case. However, the accuracy can plummet to that of a random classifier for some scenarios (that occur with non-negligible probability). Tasks that are cast as multiple-choice questions and short-form generation are surprisingly unaffected by watermarking. For long-form generation tasks, including summarization and translation, we see a drop of 15-20% in the performance due to watermarking. Our findings highlight the trade-offs that users should be cognizant of when using watermarked models, and point to cases where future research could improve existing trade-offs.
摘要
在大型语言模型(LLM)被违用于生成谣言或完成作业任务时,水印技术已成为一种有效的解决方案,以区分人类写作和LLM生成的文本。一种常见的水印策略是在生成文本时附加一个信号,通过随机选择一部分token进行upsampling。尽管这个信号对人类读者无法察觉,但可以通过统计测试探测。然而,植入这个信号会改变模型的输出分布,可能会导致下游应用中的不良影响。在这个工作中,我们评估水印LLM在多种任务上的表现,包括文本分类、文本推理、问答、翻译、概要和语言模型。我们发现,对于大多数情况,水印对任务的性能没有显著影响。然而,在某些特殊情况下(占总体的非致命概率),水印可能会导致性能降低到随机分类器水平。在长形生成任务中,如概要和翻译,我们发现水印导致性能下降约15-20%。我们的发现指出了使用水印模型时的交易offs,并提出了未来研究可以改善现有交易offs的可能性。
Towards Formal Fault Injection for Safety Assessment of Automated Systems
results: 这篇论文提出了正式缺陷插入,一种将正式方法和缺陷插入相结合的技术,以提高自动化系统的可靠性。文章还讨论了这些技术在开发生命周期中的潜在优势,并提出了未来研究的可能性,以解决当前的挑战。Abstract
Reasoning about safety, security, and other dependability attributes of autonomous systems is a challenge that needs to be addressed before the adoption of such systems in day-to-day life. Formal methods is a class of methods that mathematically reason about a system's behavior. Thus, a correctness proof is sufficient to conclude the system's dependability. However, these methods are usually applied to abstract models of the system, which might not fully represent the actual system. Fault injection, on the other hand, is a testing method to evaluate the dependability of systems. However, the amount of testing required to evaluate the system is rather large and often a problem. This vision paper introduces formal fault injection, a fusion of these two techniques throughout the development lifecycle to enhance the dependability of autonomous systems. We advocate for a more cohesive approach by identifying five areas of mutual support between formal methods and fault injection. By forging stronger ties between the two fields, we pave the way for developing safe and dependable autonomous systems. This paper delves into the integration's potential and outlines future research avenues, addressing open challenges along the way.
摘要
考虑 autonomous systems 的安全性、安全性和其他可靠性特性的推理是在普及这些系统之前需要解决的挑战。Formal methods 是一类方法,通过数学方式推理系统的行为,因此,一个正确性证明即可确保系统的可靠性。然而,这些方法通常应用于系统抽象模型,可能不完全反映实际系统。错误插入测试是一种评估系统可靠性的方法,但测试量很大,经常成为问题。这篇视野论文介绍了形式错误插入,它将这两种技术在开发生命周期中融合,以提高自动化系统的可靠性。我们认为这两个领域之间存在五个互助领域,通过加强这两个领域之间的关系,我们开拓了开发安全可靠的自动化系统的可能性。这篇论文探讨了融合的潜在可能性和未来研究方向,并解决了一些开放的挑战。
Comparing Differentiable Logics for Learning Systems: A Research Preview
results: 实验结果与文献报道的结果相符,但是使用 differentiable logics 引入了一个新的 гиперпарамет,即 tuning 难度和影响力。Abstract
Extensive research on formal verification of machine learning (ML) systems indicates that learning from data alone often fails to capture underlying background knowledge. A variety of verifiers have been developed to ensure that a machine-learnt model satisfies correctness and safety properties, however, these verifiers typically assume a trained network with fixed weights. ML-enabled autonomous systems are required to not only detect incorrect predictions, but should also possess the ability to self-correct, continuously improving and adapting. A promising approach for creating ML models that inherently satisfy constraints is to encode background knowledge as logical constraints that guide the learning process via so-called differentiable logics. In this research preview, we compare and evaluate various logics from the literature in weakly-supervised contexts, presenting our findings and highlighting open problems for future work. Our experimental results are broadly consistent with results reported previously in literature; however, learning with differentiable logics introduces a new hyperparameter that is difficult to tune and has significant influence on the effectiveness of the logics.
摘要
根据大量研究,机器学习(ML)系统从数据alone学习时常常无法捕捉下面背景知识。为确保机器学习模型满足正确性和安全性质量,许多验证工具已经开发,但这些验证工具通常假设已经训练过的网络重量是固定的。ML自适应系统需要不仅检测错误预测,还应该具备自我更新和适应能力。一种有前途的方法是通过 différentiable logics将背景知识编码成逻辑约束,以导引学习过程。在这个研究预览中,我们对文献中的不同逻辑进行比较和评价,在弱监督上下文中展示我们的发现和挑出未来工作的开放问题。我们的实验结果与文献中已经报道的结果相符,但是学习与 diffe抽象逻辑引入了一个新的Hyperparameter,它具有很大的影响力和难于调整。
Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs
results: 实验表明,使用这种方法可以在 ProofWriter 和 GSM8K 上大幅提高推理 accuracy 和证明相似性。Abstract
Though prompting LLMs with various reasoning structures produces reasoning proofs along with answers, these proofs are not ensured to be causal and reliable due to the inherent defects of LLMs. Tracking such deficiencies, we present a neuro-symbolic integration method, in which a neural LLM is used to represent the knowledge of the problem while an LLM-free symbolic solver is adopted to do deliberative reasoning using the knowledge. Specifically, our customized meta-interpreters allow the production of reasoning proofs and support flexible search strategies. These reasoning proofs are ensured to be causal and reliable because of the deterministic executing nature of the symbolic solvers. Empirically, on ProofWriter, our method surpasses the CoT baseline by nearly double in accuracy and more than triple in proof similarity. On GSM8K, our method also shows accuracy improvements and nearly doubled proof similarity. Our code is released at https://github.com/DAMO-NLP-SG/CaRing
摘要
尽管通过不同的逻辑结构让LLMs进行推理生成推理证明和答案,但这些证明并不能保证是 causal 和可靠的,因为 LLMS 本身存在一些缺陷。为了解决这些问题,我们提出了一种神经符号 интеграция方法,在这种方法中,一个神经网络 LLM 用于表示问题的知识,而另一个 LLM-free 符号分析器用于进行思考和推理。具体来说,我们自定义的 meta-interpreters 允许生成推理证明和支持灵活的搜索策略。由于符号分析器的 deterministic 执行特性,所生成的推理证明是 causal 和可靠的。在 ProofWriter 上,我们的方法比 CoT 基线高出 nearly double 的准确率和 более triple 的证明相似度。在 GSM8K 上,我们的方法也显示了准确率上的提高和证明相似度的近 double。我们的代码在 上发布。
Interpreting User Requests in the Context of Natural Language Standing Instructions
results: 在NLSI语料集上进行实验,使用大语言模型和不同的检索方法,最高达44.7%的精确匹配率。Abstract
Users of natural language interfaces, generally powered by Large Language Models (LLMs),often must repeat their preferences each time they make a similar request. To alleviate this, we propose including some of a user's preferences and instructions in natural language -- collectively termed standing instructions -- as additional context for such interfaces. For example, when a user states I'm hungry, their previously expressed preference for Persian food will be automatically added to the LLM prompt, so as to influence the search for relevant restaurants. We develop NLSI, a language-to-program dataset consisting of over 2.4K dialogues spanning 17 domains, where each dialogue is paired with a user profile (a set of users specific standing instructions) and corresponding structured representations (API calls). A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue. NLSI contains diverse phenomena, from simple preferences to interdependent instructions such as triggering a hotel search whenever the user is booking tickets to an event. We conduct experiments on NLSI using prompting with large language models and various retrieval approaches, achieving a maximum of 44.7% exact match on API prediction. Our results demonstrate the challenges in identifying the relevant standing instructions and their interpretation into API calls.
摘要
用户们使用自然语言界面,通常需要每次发出相似的请求都重复他们的首选项。为了解决这个问题,我们建议将用户的首选项和指令(总称为“站坐指令”)作为自然语言 interfaces 的附加上下文。例如,当用户说 “我饿” 时,他们之前表达的波斯料食物的首选项将自动添加到 LLM 提示中,以影响搜索相关餐厅。我们开发了 NLSI,一个语言到程序数据集,包含了超过 2.4K 对话,涵盖 17 个领域,每个对话都与用户的 Profile(用户特定的站坐指令)和相应的结构化表示(API 调用)一起出现。在 NLSI 中,一个主要挑战是确定哪些站坐指令适用于给定的对话。NLSI 包含了多种现象,从简单的首选项到互相关联的指令,例如在购买票务时自动触发酒店搜索。我们使用 LLM 和不同的检索方法进行实验,最高达 44.7% 精确匹配 API 预测。我们的结果表明了站坐指令的适用和其 интерпретация成 API 调用的挑战。
Breaking Boundaries: Balancing Performance and Robustness in Deep Wireless Traffic Forecasting
results: 我们的hybrid策略在 both clean和外延数据上表现出色,其MSE在clean数据上保持了92.02%的原始预测模型性能,而在外延数据上则更加Robust,其MSE比较方法的MSE低出2.71倍和2.51倍。此外,我们的模型的组件可以并行训练,从而提高计算效率。Abstract
Balancing the trade-off between accuracy and robustness is a long-standing challenge in time series forecasting. While most of existing robust algorithms have achieved certain suboptimal performance on clean data, sustaining the same performance level in the presence of data perturbations remains extremely hard. In this paper, we study a wide array of perturbation scenarios and propose novel defense mechanisms against adversarial attacks using real-world telecom data. We compare our strategy against two existing adversarial training algorithms under a range of maximal allowed perturbations, defined using $\ell_{\infty}$-norm, $\in [0.1,0.4]$. Our findings reveal that our hybrid strategy, which is composed of a classifier to detect adversarial examples, a denoiser to eliminate noise from the perturbed data samples, and a standard forecaster, achieves the best performance on both clean and perturbed data. Our optimal model can retain up to $92.02\%$ the performance of the original forecasting model in terms of Mean Squared Error (MSE) on clean data, while being more robust than the standard adversarially trained models on perturbed data. Its MSE is 2.71$\times$ and 2.51$\times$ lower than those of comparing methods on normal and perturbed data, respectively. In addition, the components of our models can be trained in parallel, resulting in better computational efficiency. Our results indicate that we can optimally balance the trade-off between the performance and robustness of forecasting models by improving the classifier and denoiser, even in the presence of sophisticated and destructive poisoning attacks.
摘要
平衡精度和Robustness之间的贸易OFF是时间序列预测领域的长standing挑战。大多数现有的Robust算法在干净数据上达到了一定的下行性,但在数据抖动的情况下维持同样的性能很难。在这篇论文中,我们研究了各种抖动enario并提出了新的防御机制,使用实际的电信数据对抗 adversarial 攻击。我们与两种现有的 adversarial 训练算法进行比较,使用 $[0.1,0.4]$ 的 $\ell_{\infty}$ 范围内的最大允许抖动。我们的发现显示,我们的混合策略,包括一个分类器来检测 adversarial 示例,一个去噪器来除掉抖动数据示例中的噪声,以及一个标准预测器,在干净数据和抖动数据上都能够达到最好的性能。我们的优化模型可以保持原始预测模型的 $92.02\%$ 的性能(按照 Mean Squared Error 的评价),而且在抖动数据上比标准 adversarial 训练模型更加Robust。它的 MSE 值分别为 $2.71\times$ 和 $2.51\times$ 比对应模型的 MSE 值更低。此外,我们的模型组件可以并行训练,从而提高计算效率。我们的结果表明,我们可以通过改进分类器和去噪器来优化 forecasting 模型,甚至在抖动数据上进行高级和破坏性攻击。
3vLTL: A Tool to Generate Automata for Three-valued LTL
results: 该工具可以生成一个 Buchi 自动机,用于验证 LTL 式的真假性,并且可以让这个自动机被第三方库处理,以便进一步进行验证。Abstract
Multi-valued logics have a long tradition in the literature on system verification, including run-time verification. However, comparatively fewer model-checking tools have been developed for multi-valued specification languages. We present 3vLTL, a tool to generate Buchi automata from formulas in Linear-time Temporal Logic (LTL) interpreted on a three-valued semantics. Given an LTL formula, a set of atomic propositions as the alphabet for the automaton, and a truth value, our procedure generates a Buchi automaton that accepts all the words that assign the chosen truth value to the LTL formula. Given the particular type of the output of the tool, it can also be seamlessly processed by third-party libraries in a natural way. That is, the Buchi automaton can then be used in the context of formal verification to check whether an LTL formula is true, false, or undefined on a given model.
摘要
多值逻辑在系统验证文献中有很长的传统,包括运行时验证。然而,相比之下, fewer model-checking工具被开发用于多值规定语言。我们介绍了3vLTL,一种生成 Buchi 自动机从Linear-time Temporal Logic(LTL)在三值 semantics中解释的方程的工具。给定一个 LTL 方程,一个字母集,一个真假值,我们的过程生成一个 Buchi 自动机,接受将选择的真假值分配给 LTL 方程的所有词。给出特定输出的类型,这个 Buchi 自动机可以轻松地处理第三方库中的自然方式。也就是说,Buch i自动机可以在正式验证中使用,以验证一个 LTL 方程是否真、假或未定义于一个模型。
Correct-by-Construction Control for Stochastic and Uncertain Dynamical Models via Formal Abstractions
results: 这篇论文的结果表明,通过使用这种方法,可以生成一个可靠地满足规范的控制器,并且可以证明这个控制器满足规范的 garanties。Abstract
Automated synthesis of correct-by-construction controllers for autonomous systems is crucial for their deployment in safety-critical scenarios. Such autonomous systems are naturally modeled as stochastic dynamical models. The general problem is to compute a controller that provably satisfies a given task, represented as a probabilistic temporal logic specification. However, factors such as stochastic uncertainty, imprecisely known parameters, and hybrid features make this problem challenging. We have developed an abstraction framework that can be used to solve this problem under various modeling assumptions. Our approach is based on a robust finite-state abstraction of the stochastic dynamical model in the form of a Markov decision process with intervals of probabilities (iMDP). We use state-of-the-art verification techniques to compute an optimal policy on the iMDP with guarantees for satisfying the given specification. We then show that, by construction, we can refine this policy into a feedback controller for which these guarantees carry over to the dynamical model. In this short paper, we survey our recent research in this area and highlight two challenges (related to scalability and dealing with nonlinear dynamics) that we aim to address with our ongoing research.
摘要
自动生成正确性承诺控制器是自主系统的部署中关键的一步。这些自主系统通常是随机动力学模型的。通用问题是计算一个可以准确满足给定任务的控制器,该任务是 probabilistic temporal logic 规范。然而,因为随机不确定性、不精确知道参数以及混合特征,这个问题具有挑战性。我们已经开发了一个抽象框架,可以在不同的模型假设下解决这个问题。我们的方法基于一种可靠的finite-state抽象方法,即Markov decision process with intervals of probabilities(iMDP)。我们使用当前的验证技术来计算iMDP上的优质策略,并 garantía para satisfacer给定规范。然后,我们表明,通过构建,我们可以从这种策略中提取一个反馈控制器,这些 garantías会传递到动力学模型中。在这篇短文中,我们回顾了我们近期在这个领域的研究,并提出了两个挑战(相关于扩展性和处理非线性动力学),我们计划通过我们的进行研究来解决这些挑战。
Automatic Generation of Scenarios for System-level Simulation-based Verification of Autonomous Driving Systems
paper_authors: Srajan Goyal, Alberto Griggio, Jacob Kimblad, Stefano Tonetta for:* The paper is written for the purpose of presenting a generic framework for system-level simulation-based verification and validation (V&V) of autonomous driving systems (ADS) that employ AI components.methods:* The framework uses a simulation model of the system, an abstract model that describes symbolically the system behavior, and formal methods to generate scenarios and verify the simulation executions.* The approach leverages the CARLA driving simulator and its ScenarioRunner tool to create diverse and complex driving scenarios.results:* The paper describes the instantiation of the VIVAS framework for an ADS case study, and demonstrates the effectiveness of the approach in automatically generating scenarios for system-level simulation-based V&V of an automated driving system using CARLA and ScenarioRunner.* The results show the potential of the approach as a powerful tool in the future of ADS V&V methodologies.Abstract
With increasing complexity of Automated Driving Systems (ADS), ensuring their safety and reliability has become a critical challenge. The Verification and Validation (V&V) of these systems are particularly demanding when AI components are employed to implement perception and/or control functions. In ESA-funded project VIVAS, we developed a generic framework for system-level simulation-based V&V of autonomous systems. The approach is based on a simulation model of the system, an abstract model that describes symbolically the system behavior, and formal methods to generate scenarios and verify the simulation executions. Various coverage criteria can be defined to guide the automated generation of the scenarios. In this paper, we describe the instantiation of the VIVAS framework for an ADS case study. This is based on the integration of CARLA, a widely-used driving simulator, and its ScenarioRunner tool, which enables the creation of diverse and complex driving scenarios. This is also used in the CARLA Autonomous Driving Challenge to validate different ADS agents for perception and control based on AI, shared by the CARLA community. We describe the development of an abstract ADS model and the formulation of a coverage criterion that focuses on the behaviors of vehicles relative to the vehicle with ADS under verification. Leveraging the VIVAS framework, we generate and execute various driving scenarios, thus testing the capabilities of the AI components. The results show the effectiveness of VIVAS in automatically generating scenarios for system-level simulation-based V&V of an automated driving system using CARLA and ScenarioRunner. Therefore, they highlight the potential of the approach as a powerful tool in the future of ADS V&V methodologies.
摘要
随着自动驾驶系统(ADS)的复杂度增加,确保其安全性和可靠性已成为一项杰匡的挑战。验证和验议(V&V)这些系统特别是当AI组件用于感知和/或控制功能时,变得非常具有挑战性。在欧洲空间局(ESA)资助的项目VIVAS中,我们开发了一种通用框架 для系统级别的模拟基于验证和验议自动驾驶系统。该方法基于系统模型、一个抽象模型,用于 символи地描述系统行为,以及正式方法来生成场景和验证模拟执行。可以定义多种覆盖度标准来引导自动生成场景。 在这篇论文中,我们描述了VIvas框架在自动驾驶系统case study中的实现。这基于卡拉拉(CARLA)广泛使用的驾驶模拟器和其ScenarioRunner工具,可以创造多样化和复杂的驾驶场景。这也是在CARLA自动驾驶挑战中 validate不同的ADS代理人以及感知和控制基于AI的不同ADS代理人。我们开发了抽象的ADS模型,并将其与驾驶车辆相对于具有ADS的车辆的行为相关的覆盖度标准定义。通过VIvas框架,我们生成并执行多种驾驶场景,因此测试了AI组件的能力。结果显示VIvas在使用CARLA和ScenarioRunner进行系统级别的模拟基于验证和验议中自动生成场景的能力是非常有力的。因此,它高亮了未来ADS验证和验议方法的潜在力量。
Investigating Data Contamination in Modern Benchmarks for Large Language Models
results: 研究发现一些商业LLM可以很准确地猜测测试集中的缺失选项,并且在一些标准套件中发现了模型的表现改善。Abstract
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.
摘要
近期观察发现了 LLM 的评估标准和实际性能之间的差距,引发了评估标准污染的 Concerns 。这个问题尤其对于关闭源模型和某些开源模型来说是 kritisch 。在这篇论文中,我们研究了数据污染问题,并提出了两种适用于开源和专有 LLM 的方法。我们首先介绍了一个检索基于系统,用于探索评估标准和预训练 corpora 之间的可能的重叠。然后,我们介绍了一种新的调查协议,名为 \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing)}, 适用于开源和专有模型。这种方法包括在多选问题中隐藏错误答案,并让模型填充差距。此外,还包括隐藏评估示例中不可能的单词,并让模型生成它。我们发现了一些商业 LLM 可以意外地猜测测试集中的缺失选项。例如,在 TruthfulQA benchmark 中,我们发现了 LLM 在提供额外metadata 时表现出 Notable 的性能提升。此外,在 MMLU benchmark 中, ChatGPT 和 GPT-4 在测试集中猜测缺失选项的精准率分别为 52% 和 57%。我们希望这些结果可以强调评估方法和标准的需要更加Robust 。
Model Checking for Closed-Loop Robot Reactive Planning
results: 我们的结果表明,模型检查可以用来规划效率的轨迹,超越单步规划的表现。我们在实时使用无预计算数据来实现这一点。虽然我们的方法有一些局限性,但我们认为我们的方法具有开发安全、可靠和透明的轨迹规划方法的潜力。Abstract
In this paper, we show how model checking can be used to create multi-step plans for a differential drive wheeled robot so that it can avoid immediate danger. Using a small, purpose built model checking algorithm in situ we generate plans in real-time in a way that reflects the egocentric reactive response of simple biological agents. Our approach is based on chaining temporary control systems which are spawned to eliminate disturbances in the local environment that disrupt an autonomous agent from its preferred action (or resting state). The method involves a novel discretization of 2D LiDAR data which is sensitive to bounded stochastic variations in the immediate environment. We operationalise multi-step planning using invariant checking by forward depth-first search, using a cul-de-sac scenario as a first test case. Our results demonstrate that model checking can be used to plan efficient trajectories for local obstacle avoidance, improving on the performance of a reactive agent which can only plan one step. We achieve this in near real-time using no pre-computed data. While our method has limitations, we believe our approach shows promise as an avenue for the development of safe, reliable and transparent trajectory planning in the context of autonomous vehicles.
摘要
在这篇论文中,我们展示了如何使用模型检查来创建多步计划,以使Diffusion Drive轮胎自动车避免 immediate danger。我们使用一种小型、特有的模型检查算法在实时中生成计划,以模仿简单生物体的 Egocentric 反应。我们的方法基于临时控制系统的链接,以消除环境中的干扰,使自动 Agent 能够继续进行其 preferred action(或休眠状态)。我们的方法包括一种新的2D LiDAR数据的精度化,敏感于环境中的 bounded 随机变化。我们通过在深度优先搜索中进行 invariants 检查来实现多步规划,并使用 cul-de-sac 场景作为第一个测试 caso。我们的结果表明,模型检查可以用于计划高效的轨迹,超越了只能计划一步的感知Agent。我们在实时中使用不需要预计算数据来实现这一点。虽然我们的方法有限制,但我们认为我们的方法显示出了在自动汽车中安全、可靠和透明的轨迹规划的可能性。
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs
results: 该论文通过对多个医学领域的测试,证明了其在中医领域的状态对级表现,并在一些方面超过了现有的专有模型,如ChatGPT和GPT-4。专家手动评估也证明了该模型的优势。Abstract
Adapting a language model into a specific domain, a.k.a `domain adaption', is a common practice when specialized knowledge, e.g. medicine, is not encapsulated in a general language model like Llama2. The challenge lies in the heterogeneity of data across the two training stages, as it varies in languages, genres, or formats. To tackle this and simplify the learning protocol, we propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine. The developed model, HuatuoGPT-II, has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks, e.g. medical licensing exams. It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine. Expert manual evaluations further validate HuatuoGPT-II's advantages over existing LLMs. Notably, HuatuoGPT-II was benchmarked in a fresh Chinese National Medical Licensing Examination where it achieved the best performance, showcasing not only its effectiveness but also its generalization capabilities.
摘要
适应特定领域的语言模型化,即域 adaptation,是一种常见的做法,当特殊知识,如医学,不包含在通用语言模型如LLAMA2中。挑战在两个训练阶段数据的异ogeneity上,因为数据的语言、种类、格式都不同。为了解决这个问题并简化学习协议,我们提议将各种不同数据,从预训练和监督两个阶段,转换成一个统一、简单的输入输出对 format。我们验证了新协议的效果在各种领域,如中医,并在一些标准测试任务上达到了状态 искусственный智能表现。特别是在传统中医领域,我们的模型 HuatuoGPT-II 表现出了优秀的result,并在一些方面超越了商业化模型,如ChatGPT和GPT-4。专业人员手动评估也证明了 HuatuoGPT-II 的优势。值得一提的是,HuatuoGPT-II 在新的中医国家医籍考试中达到了最佳表现,证明了它不仅有效,还有普适化能力。
Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders
results: 该论文使用 BEIR benchmark 进行验证,并发现这些建议可以在不同的 dense encoder 和基础模型大小上 persist,并且与其他资源密集的策略(如建筑修改或多个预训练)相结合,可以提高 dense retriever 模型的泛化能力。Abstract
Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset.
摘要
现有研究往往采用 dense retriever 训练大量数据集 MSMARCO,然后尝试改进零shot泛化能力。而在优化这种泛化能力方面,已有许多研究,包括数据增强、结构修改、模型大小增加和额外预训练。然而,对于训练过程本身是否可以进行优化以实现更好的泛化能力,尚未得到了充分的探讨。在这项工作中,我们建议一种简单的 dense encoder 训练热键:通过 LoRA 等参数效率的方法在 MSMARCO 上训练,并在批处中使用卷积批处。我们使用 BEIR 测试准则 validate 这些建议,并发现结果是不依赖于 dense encoder 和基础模型大小,并且与其他资源占用量大的方法相结合,可以实现更好的泛化能力。我们希望这种具有充分的探讨和不偏袋的研究,能够为开发高效泛化 dense retrieval 模型提供实际的指导。
Graph-Guided Reasoning for Multi-Hop Question Answering in Large Language Models
paper_authors: Jinyoung Park, Ameen Patel, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim
For: The paper aims to improve the multi-step reasoning capabilities of large language models (LLMs) by addressing two issues in previous CoT prompting methods: generating irrelevant rationales and failing to compose subquestions or queries for obtaining all relevant information.* Methods: The proposed graph-guided CoT prompting method uses a “question/rationale graph” constructed by LLMs to guide the reasoning process. The method includes graph representation and verification steps to filter out irrelevant rationales and generate follow-up questions to obtain relevant information.* Results: The proposed method shows superior performance compared to previous CoT prompting methods and their variants on multi-hop question answering benchmark datasets.Abstract
Chain-of-Thought (CoT) prompting has boosted the multi-step reasoning capabilities of Large Language Models (LLMs) by generating a series of rationales before the final answer. We analyze the reasoning paths generated by CoT and find two issues in multi-step reasoning: (i) Generating rationales irrelevant to the question, (ii) Unable to compose subquestions or queries for generating/retrieving all the relevant information. To address them, we propose a graph-guided CoT prompting method, which guides the LLMs to reach the correct answer with graph representation/verification steps. Specifically, we first leverage LLMs to construct a "question/rationale graph" by using knowledge extraction prompting given the initial question and the rationales generated in the previous steps. Then, the graph verification step diagnoses the current rationale triplet by comparing it with the existing question/rationale graph to filter out irrelevant rationales and generate follow-up questions to obtain relevant information. Additionally, we generate CoT paths that exclude the extracted graph information to represent the context information missed from the graph extraction. Our graph-guided reasoning method shows superior performance compared to previous CoT prompting and the variants on multi-hop question answering benchmark datasets.
摘要
chain-of-thought (CoT) 提示法已经提高了大语言模型 (LLM) 的多步逻辑能力,通过生成一系列的理由来提高答案。我们分析了 CoT 生成的逻辑路径,并发现了两个多步逻辑问题:(i)生成与问题无关的理由,(ii)无法组合子问题或查询来获取所有相关信息。为解决这些问题,我们提议一种图表引导 CoT 提示法,使 LLM 能够通过图表表示/验证步骤达到正确答案。 Specifically,我们首先利用 LLM 使用知识EXTRACTION 提示生成一个 "问题/理由图",并then 使用图表验证步骤来诊断当前的理由 triplet,并将无关的理由过滤掉,生成跟进 вопро题以获取相关信息。此外,我们还生成 CoT 路径,排除扩展的图信息以表示Context missed 信息。我们的图表引导逻辑方法在多个多步问答 benchmark 数据集上表现出色。
MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification
For: The paper aims to address the challenges of automated fallacy detection and classification, particularly the subjectivity of the task and the need for a comprehensive and unified approach in existing research.* Methods: The paper introduces a novel taxonomy of fallacies that refines and aligns previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity.* Results: The paper introduces MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset based on examples from various previously existing fallacy datasets under the unified taxonomy, and evaluates several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. The evaluation provides valuable insights into the strengths and limitations of these models in addressing fallacious reasoning.Here is the same information in Simplified Chinese text:* For: 这篇论文目标是解决自动推理错误检测和分类的挑战,特别是任务的主观性和现有研究中存在的缺乏一致性。* Methods: 论文引入了一种新的论点分类方法,该方法可以协调和融合先前的分类方法,同时还提供了一种适用于主观NLPTask的新注释方案。* Results: 论文引入了一个金标准数据集MAFALDA(多级注释错误数据集),该数据集基于先前的许多错误数据集的一致性,并对多种语言模型进行零基础学习 Setting中的评价。Abstract
Fallacies can be used to spread disinformation, fake news, and propaganda, underlining the importance of their detection. Automated detection and classification of fallacies, however, remain challenging, mainly because of the innate subjectivity of the task and the need for a comprehensive, unified approach in existing research. Addressing these limitations, our study introduces a novel taxonomy of fallacies that aligns and refines previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity, adapted to precision, recall, and F1-Score metrics. Using our annotation scheme, the paper introduces MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset. MAFALDA is based on examples from various previously existing fallacy datasets under our unified taxonomy across three levels of granularity. We then evaluate several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. Our comprehensive evaluation not only benchmarks the performance of these models but also provides valuable insights into their strengths and limitations in addressing fallacious reasoning.
摘要
False information, fake news, and propaganda can be spread through fallacies, highlighting the importance of detecting them. However, automated detection and classification of fallacies are challenging due to the subjective nature of the task and the lack of a comprehensive, unified approach in existing research. To address these limitations, our study proposes a new taxonomy of fallacies that aligns and refines previous classifications, a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity, adapted to precision, recall, and F1-Score metrics. Using our annotation scheme, we introduce MAFALDA (Multi-level Annotated FALlacy DAtaset), a gold standard dataset based on examples from various previously existing fallacy datasets under our unified taxonomy across three levels of granularity. We then evaluate several language models under a zero-shot learning setting using MAFALDA to assess their fallacy detection and classification capability. Our comprehensive evaluation not only benchmarks the performance of these models but also provides valuable insights into their strengths and limitations in addressing fallacious reasoning.
UFPS: A unified framework for partially-annotated federated segmentation in heterogeneous data distribution
results: 这个研究的实验结果显示,UFPS方法能够更好地解决半自动化分类中的分类异常和客户端漂移问题,并且在实际医疗影像数据上显示了更好的适应和普遍性。Abstract
Partially supervised segmentation is a label-saving method based on datasets with fractional classes labeled and intersectant. However, it is still far from landing on real-world medical applications due to privacy concerns and data heterogeneity. As a remedy without privacy leakage, federated partially supervised segmentation (FPSS) is formulated in this work. The main challenges for FPSS are class heterogeneity and client drift. We propose a Unified Federated Partially-labeled Segmentation (UFPS) framework to segment pixels within all classes for partially-annotated datasets by training a totipotential global model without class collision. Our framework includes Unified Label Learning and sparsed Unified Sharpness Aware Minimization for unification of class and feature space, respectively. We find that vanilla combinations for traditional methods in partially supervised segmentation and federated learning are mainly hampered by class collision through empirical study. Our comprehensive experiments on real medical datasets demonstrate better deconflicting and generalization ability of UFPS compared with modified methods.
摘要
partially supervised segmentation是一种基于分类数据集的标签保存方法,但它还远离实际医疗应用中的应用,主要原因是隐私问题和数据不一致。为解决这些问题,本文提出了联邦半supervised分割(FPSS)方法。FPSS的主要挑战是分类异ogeneous和客户端漂移。我们提出了一种总统的联邦半supervised分割(UFPS)框架,用于对半标注数据集中的像素进行分割,不会出现分类冲突。我们的框架包括统一标签学习和粗粒化的统一锐度感知优化,用于统一类和特征空间。我们通过实际研究发现,传统的partially supervised segmentation和联邦学习方法的组合在面临分类冲突的情况下效果较差。我们的全面实验表明,UFPS在实际医疗数据集上具有更好的冲突解决和泛化能力,与修改后的方法相比。
Redefining the Laparoscopic Spatial Sense: AI-based Intra- and Postoperative Measurement from Stereoimages
paper_authors: Leopold Müller, Patrick Hemmer, Moritz Queisner, Igor Sauer, Simeon Allmendinger, Johannes Jakubik, Michael Vössing, Niklas Kühl for: This paper aims to provide a more accurate and efficient solution for image-guided surgery, specifically for measuring relevant structures such as vessel segments, resection margins, and bowel lengths.methods: The proposed method utilizes stereo vision and state-of-the-art machine learning architectures, including RAFT-Stereo and YOLOv8, to achieve high accuracy in distance measurements with errors below 1 mm.results: The developed method is assessed in various realistic experimental evaluation environments and demonstrates robustness in challenging environments with textureless regions. The results outline the potential of the method for providing more precise, safe, and efficient surgical procedures.Abstract
A significant challenge in image-guided surgery is the accurate measurement task of relevant structures such as vessel segments, resection margins, or bowel lengths. While this task is an essential component of many surgeries, it involves substantial human effort and is prone to inaccuracies. In this paper, we develop a novel human-AI-based method for laparoscopic measurements utilizing stereo vision that has been guided by practicing surgeons. Based on a holistic qualitative requirements analysis, this work proposes a comprehensive measurement method, which comprises state-of-the-art machine learning architectures, such as RAFT-Stereo and YOLOv8. The developed method is assessed in various realistic experimental evaluation environments. Our results outline the potential of our method achieving high accuracies in distance measurements with errors below 1 mm. Furthermore, on-surface measurements demonstrate robustness when applied in challenging environments with textureless regions. Overall, by addressing the inherent challenges of image-guided surgery, we lay the foundation for a more robust and accurate solution for intra- and postoperative measurements, enabling more precise, safe, and efficient surgical procedures.
摘要
significante挑战在图像引导手术中是准确测量有关结构,如血管段、切除边缘或肠长度。这项任务是许多手术中不可或缺的一部分,但它具有较大的人工劳动量和不准确率。在这篇论文中,我们开发了一种新的人类-AI基于方法,用于肠 Laparoscopic 测量,利用推导导航的STereo视力。基于全面的quality要求分析,这项工作提出了一种全面的测量方法,包括当前的机器学习架构,如RAFT-Stereo和YOLOv8。我们开发的方法在不同的实际试验环境中进行了评估。我们的结果表明,我们的方法可以实现高精度的距离测量,错误在1毫米以下。此外,在表面上进行的测量也能够在粗糙区域中展示 robustness。总之,我们通过解决图像引导手术中的内在挑战,为更加精确、安全和有效的手术过程奠定了基础。
Redefining Super-Resolution: Fine-mesh PDE predictions without classical simulations
results: 我们的方法可以生成精度高的 fine-mesh 解决方案,不需要传统的计算,同时保持了原始真实情况的精度。通过在训练过程中使用多种边界条件,我们还证明了我们的方法的稳定性,这将推动其广泛应用于工程和科学 CFD 解决方案中。Abstract
In Computational Fluid Dynamics (CFD), coarse mesh simulations offer computational efficiency but often lack precision. Applying conventional super-resolution to these simulations poses a significant challenge due to the fundamental contrast between downsampling high-resolution images and authentically emulating low-resolution physics. The former method conserves more of the underlying physics, surpassing the usual constraints of real-world scenarios. We propose a novel definition of super-resolution tailored for PDE-based problems. Instead of simply downsampling from a high-resolution dataset, we use coarse-grid simulated data as our input and predict fine-grid simulated outcomes. Employing a physics-infused UNet upscaling method, we demonstrate its efficacy across various 2D-CFD problems such as discontinuity detection in Burger's equation, Methane combustion, and fouling in Industrial heat exchangers. Our method enables the generation of fine-mesh solutions bypassing traditional simulation, ensuring considerable computational saving and fidelity to the original ground truth outcomes. Through diverse boundary conditions during training, we further establish the robustness of our method, paving the way for its broad applications in engineering and scientific CFD solvers.
摘要
Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources
results: 在这 paper 中, authors 发现了各种 corpora 的多样性会对预训练 PLMs 的性能产生负面影响。为了协调预训练在多种 corpora 上,authors 提出了源提示 (SP),这是一种在预训练和精度调整阶段显式地提示模型数据源的技术。Results 表明,使用 SP 在多种 corpora 上预训练 PLMs 可以获得显著的下游任务提升。Abstract
Pre-trained language models (PLMs) have established the new paradigm in the field of NLP. For more powerful PLMs, one of the most popular and successful way is to continuously scale up sizes of the models and the pre-training corpora. These large corpora are generally obtained by converging smaller ones from multiple sources, they are thus growing increasingly diverse. However, the side-effects of these colossal converged corpora remain understudied. In this paper, we identify the disadvantage of heterogeneous corpora from multiple sources for pre-training PLMs. Towards coordinated pre-training on diverse corpora, we further propose source prompts (SP), which explicitly prompt the model of the data source at the pre-training and fine-tuning stages. Results of extensive experiments demonstrate that PLMs pre-trained with SP on diverse corpora gain significant improvement in various downstream tasks.
摘要
Prudent Silence or Foolish Babble? Examining Large Language Models’ Responses to the Unknown
results: 研究发现,通过 instrucion finetuning 和人类反馈学习(RLHF),LLMs 可以更好地表达uncertainty,并且与有效的问题相对应,表现出更高的准确率和自信度。Abstract
Large Language Models (LLMs) often struggle when faced with situations where they lack the prerequisite knowledge to generate a sensical response. In these cases, models tend to fabricate and hallucinate, rather than appropriately signaling uncertainty as humans would. This behavior misaligns with human conversational norms and presents challenges surrounding responsible and ethical AI development. This work aims to systematically investigate LLMs' behaviors in such situations. We curate an adversarial question-answering benchmark containing unanswerable questions targeting information absent from the LLM's training data. Concretely, these unanswerable questions contain non-existent concepts or false premises. When presented with such unanswerable questions, an LLM should appropriately convey uncertainty, and be able to challenge the premise and refuse to generate a response. While facing answerable valid questions, a model should demonstrate a positive correlation between accuracy and confidence. Using a model-agnostic unified confidence elicitation approach, we observe that LLMs that have gone through instruction finetuning and reinforcement learning from human feedback (RLHF) perform significantly better than their counterparts that do not. Moreover, uncertainty expression 1 through our elicitation method does not always stay consistent with the perceived confidence of the direct response of an LLM. Our findings call for further research into teaching LLMs to proactively and reliably express uncertainty.
摘要
Aligning with Whom? Large Language Models Have Gender and Racial Biases in Subjective NLP Tasks
results: 研究发现,对于这两个任务,模型的预测结果更接近白人和女性参与者的标签。进一步的探讨发现,在使用目标民族和性别标签作为提示时,模型的性能会下降。 Code和数据可以在https://github.com/Jiaxin-Pei/LLM-Group-Bias上获取。Abstract
Human perception of language depends on personal backgrounds like gender and ethnicity. While existing studies have shown that large language models (LLMs) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective NLP tasks also exhibit a similar bias. In this study, leveraging the POPQUORN dataset which contains annotations of diverse demographic backgrounds, we conduct a series of experiments on four popular LLMs to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White and female participants. We further explore prompting with the target demographic labels and show that including the target demographic in the prompt actually worsens the model's performance. More specifically, when being prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups. Our results suggest that LLMs hold gender and racial biases for subjective NLP tasks and that demographic-infused prompts alone may be insufficient to mitigate such effects. Code and data are available at https://github.com/Jiaxin-Pei/LLM-Group-Bias.
摘要
人类对语言的理解受个人背景的影响,如性别和民族。 although existing studies have shown that large language models (LLMs) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective NLP tasks also exhibit a similar bias. In this study, we leverage the POPQUORN dataset, which contains annotations of diverse demographic backgrounds, to investigate the capability of four popular LLMs to understand group differences and potential biases in their predictions for politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White and female participants. We further explore prompting with the target demographic labels and show that including the target demographic in the prompt actually worsens the model's performance. Specifically, when prompted to respond from the perspective of "Black" and "Asian" individuals, models show lower performance in predicting both overall scores and the scores from corresponding groups. Our results suggest that LLMs hold gender and racial biases for subjective NLP tasks, and that demographic-infused prompts alone may be insufficient to mitigate such effects. 可以在 GitHub 上获取代码和数据:https://github.com/Jiaxin-Pei/LLM-Group-Bias。
Outcome-supervised Verifiers for Planning in Mathematical Reasoning
paper_authors: Fei Yu, Anningzhe Gao, Benyou Wang for: 这种研究旨在解决大型语言模型(LLM)在数学逻辑推理中维持准确性的问题, LLMs 经常在推理过程中出现错误,导致最终结果不准确。methods: 该研究提出了一种新的评估模型——结果监督价值模型(OVM),通过结果监督来训练,从而提高了模型的准确性。 OVM 不需要劳动密集的步骤级别正确性注释,从而提高了其可扩展性。results: 在两个多步数学逻辑数据集上进行了实验,结果显示 OVM 模型在 GSM8K 数据集上 achievement 状态的最佳结果,而不需要使用 GPT-4 或代码执行。 这些发现提供了一种新的视角,即在训练评估模型时,结果监督可以提高模型的价值估计,并且有理论基础。Abstract
Large language models (LLMs) often struggle with maintaining accuracy across a sequence of intermediate reasoning steps in mathematical reasoning, leading to error propagation that undermines the final result. The current methodology to mitigate this issue primarily involves using a verifier model to assess the correctness of generated solution candidates, focusing either on the overall reasoning path or on an incomplete reasoning path. By rethinking this approach, we argue that assessing potentials of incomplete reasoning paths could be more advantageous as it guides towards correct final answers, transforming the task into a \textit{planning} problem. Our proposed verifier, the Outcome-supervision Value Model (OVM), employs outcome supervision for training, offering an efficient and intuitive method for \textit{planning} by prioritizing steps that lead to accurate conclusions over mere per-step correctness. Furthermore, the OVM eschews the need for labor-intensive annotations on step-level correctness, enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our \textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training verifiers for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for planning.
摘要
大型语言模型(LLM)经常在进行多步骤推理时维持正确性问题,导致错误传递,干扰最终结果。现有的方法来解决这问题主要是使用验证模型来评估生成的解答候选者,专注于全局推理路径或部分推理路径。我们认为评估未完成推理路径的潜力可能更有利,将任务转换为规划问题。我们提出的验证器是结果监督值模型(OVM),通过结果监督进行训练,提供一种高效和直观的规划方法,优先级是导向正确的结论的步骤。此外,OVM不需要耗费劳动的步骤正确性标注,提高其扩展性。我们在GSM8K和Game of 24两个多步骤推理数据集上进行实验,结果显示OVM模型在LLM中表现出色,特别是OVM-7B模型在GSM8K数据集上实现了状态顶尖结果,而不使用GPT-4或代码执行。这些结果提供一个新的见解,强调结果监督在训练验证器 для多步骤推理任务中的role,并提供了值估计的理论基础。
You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments
results: 实验发现,即使使用简单的变换,LLM的回答能力也会受到很大的下降,而大多数LLM的否定一致性也很低。这些结果表明,当前的问题提示方式不够准确地捕捉模型的感知,并讨论了可能的更好的 alternatives。Abstract
The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. In particular, to properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs of particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting enables LLMs to provide responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLM's capabilities to generate accurate answers, as well as consistency variations to examine their consistency towards simple perturbations such as switching the option order. Our experiments on 15 different open-source LLMs reveal that even simple perturbations are sufficient to significantly downgrade a model's question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately capture model perceptions, and we discuss potential alternatives to improve such issues.
摘要
大型自然语言模型(LLM)在自然语言理解任务上的多样性,使得它们在社会科学研究中受欢迎。特别是,为了准确理解LLM的性质和内在人格,研究人员已经使用了问题提示的形式进行研究。在这项研究中,我们做出了一个谨慎的步骤,检查当前的提示格式是否能够让LLM提供一致和稳定的回答。我们首先构建了一个包含39种测量工具和115个人格轴的数据集。此外,我们设计了一些具有轻微变化的提示,检查LLM是否能够生成准确回答,以及对简单的变化(如选项顺序交换)的一致性。我们在15种开源LLM上进行了实验,发现,即使使用简单的变化,也可以导致模型的问题回答能力下降 significatively,并且大多数LLM具有低的否定一致性。我们的结果表明,当前广泛使用的提示方式不够用于准确捕捉模型的感知,我们讨论了可能的改进方案。
Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance
results: 研究发现,在某些情况下,GPT-4可以自主生成和验证假设,但无一个完美的验证结果,表明还有许多挑战需要继续探索,以实现自主研究。Abstract
Research automation efforts usually employ AI as a tool to automate specific tasks within the research process. To create an AI that truly conduct research themselves, it must independently generate hypotheses, design verification plans, and execute verification. Therefore, we investigated if an AI itself could autonomously generate and verify hypothesis for a toy machine learning research problem. We prompted GPT-4 to generate hypotheses and Python code for hypothesis verification with limited methodological guidance. Our findings suggest that, in some instances, GPT-4 can autonomously generate and validate hypotheses without detailed guidance. While this is a promising result, we also found that none of the verifications were flawless, and there remain significant challenges in achieving autonomous, human-level research using only generic instructions. These findings underscore the need for continued exploration to develop a general and autonomous AI researcher.
摘要
Deceiving Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?
results: 研究发现,现有的LLM模型缺乏能够按照正确的逻辑链进行搜寻和理解句子的能力,而是倾向于通过幻觉和不当思维来简化问题。这些幻觉通常是基于语义关系的,并且会导致模型的偏差和幻觉。Abstract
Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of hallucination induced by semantic associations. Specifically, we investigate to what extent LLMs take shortcuts from certain keyword/entity biases in the prompt instead of following the correct reasoning path. To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA. We start from questions that LLMs will answer correctly with utmost certainty, and mask the important entity with evidence sentence recursively, asking models to find masked entities according to a chain of evidence before answering the question. During the construction of the evidence, we purposefully replace semantic clues (entities) that may lead to the correct answer with distractor clues (evidence) that will not directly lead to the correct answer but require a chain-like reasoning process. We evaluate if models can follow the correct reasoning chain instead of short-cutting through distractor clues. We find that existing LLMs lack the necessary capabilities to follow correct reasoning paths and resist the attempt of greedy shortcuts. We show that the distractor semantic associations often lead to model hallucination, which is strong evidence that questions the validity of current LLM reasoning.
摘要
尽管最近的大语言模型(LLM)在许多测试中表现出色,但最新的研究表明这些模型受到了幻觉和不正确的理解的影响。这项研究探讨了 LLM 受到 semantic association 引起的幻觉的特点。我们具体研究 LLM 是否因为提示中的关键词/实体偏见而快速缩短 reasoning 过程。为了衡量这种现象,我们提出了一种新的探测方法和标准 benchmark called EureQA。我们从可以通过 utmost certainty 回答 вопро题的问题开始,并在提示中逐层隐藏关键实体,要求模型通过证据链来找到隐藏的实体。在建构证据时,我们故意将可能导致正确答案的 semantic clue 替换为不直接导致正确答案的 distractor clue,以便要求模型遵循 chain-like 的 reasoning 过程。我们评估模型是否可以遵循正确的 reasoning 路径,而不是短ircuit 通过 distractor clue。我们发现现有的 LLM 缺乏遵循正确 reasoning 路径的能力,并且很容易受到幻觉的影响。我们显示,distractor semantic association frequently leads to model hallucination,这是强有力的证据,证明了现有 LLM 的reasoning 无效。
Accommodating Missing Modalities in Time-Continuous Multimodal Emotion Recognition
results: 实验结果表明,我们的模型在 Ulm-TSST 数据集上的评价协调系数评价提高了37% (对应于情感值预测)和30% (对应于情感评价),相比基线方法。Abstract
Decades of research indicate that emotion recognition is more effective when drawing information from multiple modalities. But what if some modalities are sometimes missing? To address this problem, we propose a novel Transformer-based architecture for recognizing valence and arousal in a time-continuous manner even with missing input modalities. We use a coupling of cross-attention and self-attention mechanisms to emphasize relationships between modalities during time and enhance the learning process on weak salient inputs. Experimental results on the Ulm-TSST dataset show that our model exhibits an improvement of the concordance correlation coefficient evaluation of 37% when predicting arousal values and 30% when predicting valence values, compared to a late-fusion baseline approach.
摘要
以下是文本的简化中文翻译:多种研究表明,情感识别更有效率地使用多种modalities。但如果某些modalities缺失呢?为解决这个问题,我们提出了一种基于Transformer架构的时间连续的情感识别模型,即使缺失输入modalities也能够准确地识别情感。我们通过跨modalities和自modalities的相互关注机制来强调时间和模式之间的关系,从而提高模型在弱烈输入上学习的能力。实验结果表明,我们的模型在 Ulm-TSST 数据集上的评价方法比基准方法提高了37% (对于情绪值预测)和30% (对于情感值预测)。
BLT: Can Large Language Models Handle Basic Legal Text?
results: fine-tuning older LLM 可以带来 near-perfect performance 在我们的测试集上,也提高了相关的法律任务的表现。这个结果强调了 LLMS 训练中需要更多的领域专业知识。Abstract
We find that the best publicly available LLMs like GPT-4 and PaLM 2 currently perform poorly at basic text handling required of lawyers or paralegals, such as looking up the text at a line of a witness deposition or at a subsection of a contract. We introduce a benchmark to quantify this poor performance, which casts into doubt LLMs' current reliability as-is for legal practice. Finetuning for these tasks brings an older LLM to near-perfect performance on our test set and also raises performance on a related legal task. This stark result highlights the need for more domain expertise in LLM training.
摘要
我团队发现,目前最佳公开可用的LLM(大型语言模型)如GPT-4和PaLM 2在法律领域的基础文本处理任务上表现不佳,如在证人证言或合同下的特定段落中查找文本。我们引入了一个比较方式来衡量这种不佳表现,这casts into doubt LLMPresent reliability in legal practice。经过调整,一个较老的LLM在我们测试集上几乎达到了近乎完美的表现,同时也提高了相关的法律任务表现。这一结果强调了LLM训练中需要更多的领域专业知识。
Augmenting Unsupervised Reinforcement Learning with Self-Reference
for: The paper is written for proposing a new approach called Self-Reference (SR) to improve the performance of reinforcement learning agents in the unsupervised pretrain-then-finetune setting.
methods: The SR approach explicitly leverages historical information to mitigate the nonstationarity of intrinsic rewards during pretraining and prevent the unlearning of valuable exploratory behaviors during finetuning.
results: The SR approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, with an 86% IQM and a 16% Optimality Gap reduction. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%.Abstract
Humans possess the ability to draw on past experiences explicitly when learning new tasks and applying them accordingly. We believe this capacity for self-referencing is especially advantageous for reinforcement learning agents in the unsupervised pretrain-then-finetune setting. During pretraining, an agent's past experiences can be explicitly utilized to mitigate the nonstationarity of intrinsic rewards. In the finetuning phase, referencing historical trajectories prevents the unlearning of valuable exploratory behaviors. Motivated by these benefits, we propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information and enhance agent performance within the pretrain-finetune paradigm. Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, recording an 86% IQM and a 16% Optimality Gap. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%. Beyond performance enhancement, the Self-Reference add-on also increases sample efficiency, a crucial attribute for real-world applications.
摘要
人类具有将过去经验显式地应用于学习新任务的能力,这对于无监督预训练然后精度调整的机器学习代理来说是非常有利的。在预训练阶段,代理可以通过过去经验来抑制内在奖励的非站点性。在精度调整阶段, referencing历史轨迹可以防止探索行为的忘记。基于这些优点,我们提出了自引 Referenced (SR) 方法,这是专门为预训练然后精度调整的 paradigm 设计的一个模块。我们的方法在无监督学习学Benchmark上实现了状态之art 的表现,具有86%的Interquartile Mean (IQM) 和16%的Optimality Gap 减少。此外,它还可以提高当前算法的性能,最高提高17%的IQM和31%的Optimality Gap。此外,SR 方法还可以提高实际应用中的样本效率,这是一个非常重要的特性。
Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation
results: 结果显示GPT4 APO在标准化提示质量方面表现出色,而且专业人员在APO后仍然保持了内容质量。Abstract
This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.
摘要
这项研究研究了大语言模型(LLM)在医疗记录生成中的表现,并评估了提示工程的影响。我们提出了自动提示优化(APO)框架,以改进初始提示并比较医学专家、非医学专家和APO-加强GPT3.5和GPT4的输出。结果显示GPT4 APO在标准化提示质量方面表现出色,而人类在循环中的干预表明专家保持了提示质量的控制,并偏好自己的修改,这表明了专家自定义的价值。我们建议使用两个阶段优化过程,首先使用APO-GPT4保证一致性,然后通过专家输入进行个性化。
MacGyver: Are Large Language Models Creative Problem Solvers?
paper_authors: Yufei Tian, Abhilasha Ravichander, Lianhui Qin, Ronan Le Bras, Raja Marjieh, Nanyun Peng, Yejin Choi, Thomas L. Griffiths, Faeze Brahman
for: This paper aims to explore the creative problem-solving capabilities of modern large language models (LLMs) in a constrained setting, specifically in circumventing functional fixedness.
methods: The paper uses an automatically generated dataset called MacGyver, which consists of 1,600 real-world problems that deliberately trigger functional fixedness and require thinking ‘out-of-the-box’. The paper compares and contrasts the problem-solving abilities of LLMs and humans on this dataset.
results: The paper shows that both LLMs and humans struggle with the MacGyver problems, but in different ways. LLMs are prone to overconfidence and propose physically infeasible or inefficient solutions, while humans excel in solving familiar problems but struggle with tasks requiring domain-specific knowledge. The paper also demonstrates the potential of enhancing LLMs’ problem-solving ability with novel prompting techniques.Abstract
We explore the creative problem-solving capabilities of modern large language models (LLMs) in a constrained setting. The setting requires circumventing a cognitive bias known in psychology as ''functional fixedness'' to use familiar objects in innovative or unconventional ways. To this end, we create MacGyver, an automatically generated dataset consisting of 1,600 real-world problems that deliberately trigger functional fixedness and require thinking 'out-of-the-box'. We then present our collection of problems to both LLMs and humans to compare and contrast their problem-solving abilities. We show that MacGyver is challenging for both groups, but in unique and complementary ways. For example, humans typically excel in solving problems that they are familiar with but may struggle with tasks requiring domain-specific knowledge, leading to a higher variance. On the other hand, LLMs, being exposed to a variety of highly specialized knowledge, attempt broader problems but are prone to overconfidence and propose actions that are physically infeasible or inefficient. We also provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking. This work provides insight into the creative problem-solving capabilities of humans and AI and illustrates how psychological paradigms can be extended into large-scale tasks for comparing humans and machines.
摘要
我们探索现代大语言模型(LLM)的创造力问题解决能力在限制性的设定下。这个设定要求绕过心理学中的''功能固化''(functional fixedness)来使用familiar对象在创新或非正式的方式上。为此,我们创建了MacGyver数据集,包含1,600个实际世界问题,旨在触发功能固化并需要''思外框''的思维。然后,我们将这些问题提交给LLM和人类,以比较和对比他们的问题解决能力。我们发现MacGyver对两个组合体是挑战的,但是各自unique和补做的。例如,人类通常在 familar的问题上 excel,但可能会在需要域pecific知识的任务上遇到问题,导致更高的变差。而LLMs,作为承载了多种高度特殊化的知识的,尝试更广泛的问题,但容易过度自信和提出物理不可能或不fficient的操作。我们还提供了LLMs的详细错误分析,并示出了使用迭代步骤反思和多元思维技巧来增强其问题解决能力的潜在。这项工作提供了人类和AI的创造力问题解决能力的视角,并示出了将心理学概念扩展到大规模任务上,用于比较人类和机器的能力。
results: 本文通过对 LMs 在视觉领域中的可靠性问题进行系统性的探讨,提供了 deeper understanding 的概念和对策,以便promote LMs 在人类社会中的可靠使用。Abstract
The rapid progress of Large Models (LMs) has recently revolutionized various fields of deep learning with remarkable grades, ranging from Natural Language Processing (NLP) to Computer Vision (CV). However, LMs are increasingly challenged and criticized by academia and industry due to their powerful performance but untrustworthy behavior, which urgently needs to be alleviated in reliable methods. Despite the abundance of literature on trustworthy LMs in language, a systematic survey specifically delving into the trustworthiness of LMs in vision remains absent. In order to mitigate this gap, we summarize four relevant concerns that obstruct the trustworthy usage in vision of LMs in this survey, including 1) human misuse, 2) vulnerability, 3) inherent issue and 4) interpretability. By highlighting corresponding challenge, countermeasures, and discussion in each topic, we hope this survey will facilitate readers' understanding of the field, promote alignment of LMs with human expectations and enable trustworthy LMs to serve as welfare rather than disaster for human society.
摘要
大型模型(LM)的快速进步最近对深度学习多个领域产生了很大的改变,从自然语言处理(NLP)到计算机视觉(CV)。然而,LM在学术和业界的应用中遇到了强大性的挑战和批评,需要可靠的方法来缓解这些问题。尽管有很多关于可靠LMs的语言文献,但是关于视觉领域中LMs的可靠性的系统性调查仍然缺失。为了弥补这个差距,我们在这份报告中总结了四个对视觉领域LMs可靠性的挑战,包括1)人类违用、2)抵触、3)内在问题和4)可解性。通过对每个话题的挑战、对策和讨论进行强调,我们希望通过这份报告能够帮助读者更好地理解这个领域,促进LMs与人类期望的Alignment,使LMs成为人类社会的福利而不是灾难。
Structured Chemistry Reasoning with Large Language Models
results: 对四种化学挑战进行了广泛的实验,包括量子化学、量子力学、物理化学和化学动力学。结果显示,我们的方法可以显著提高 GPT-4 的化学思维能力,具体达到了8%的平均绝对改进和30%的峰值改进。此外,我们还使用 GPT-4 生成的思维来 fine-tune smaller LMs(如 Vicuna),并观察到了这些 smaller LMs 的强大改进。这 Validates 我们的方法,并允许 LLMs 生成高质量的思维。Abstract
This paper studies the problem of solving complex chemistry problems with large language models (LLMs). Despite the extensive general knowledge in LLMs (such as GPT-4), they struggle with chemistry reasoning that requires faithful grounded reasoning with diverse chemical knowledge and an integrative understanding of chemical interactions. We propose InstructChem, a new structured reasoning approach that substantially boosts the LLMs' chemical reasoning capabilities. InstructChem explicitly decomposes the reasoning into three critical phrases, including chemical formulae generation by LLMs that offers the basis for subsequent grounded reasoning, step-by-step reasoning that makes multi-step derivations with the identified formulae for a preliminary answer, and iterative review-and-refinement that steers LLMs to progressively revise the previous phases for increasing confidence, leading to the final high-confidence answer. We conduct extensive experiments on four different chemistry challenges, including quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach significantly enhances GPT-4 on chemistry reasoning, yielding an 8% average absolute improvement and a 30% peak improvement. We further use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe strong improvement of the smaller LMs. This validates our approach and enables LLMs to generate high-quality reasoning.
摘要
InstructChem breaks down the reasoning process into three critical phases:1. Chemical formula generation: LLMs generate the chemical formula as the basis for subsequent grounded reasoning.2. Step-by-step reasoning: LLMs make multi-step derivations with the identified formula to arrive at a preliminary answer.3. Iterative review-and-refinement: LLMs revise the previous phases to increase confidence and eventually arrive at a high-confidence answer.We conduct extensive experiments on four chemistry challenges: quantum chemistry, quantum mechanics, physical chemistry, and chemistry kinetics. Our approach achieves an average absolute improvement of 8% and a peak improvement of 30% over GPT-4 on chemistry reasoning tasks. Moreover, we use the generated reasoning by GPT-4 to fine-tune smaller LMs (e.g., Vicuna) and observe significant improvement in their performance, validating our approach. This demonstrates that LLMs can generate high-quality reasoning with the help of InstructChem.
“It’s not like Jarvis, but it’s pretty close!” – Examining ChatGPT’s Usage among Undergraduate Students in Computer Science
results: 研究发现大多数学生(超过 57%)对使用 ChatGPT 作为课程相关任务的工具有积极的看法,但也提出了一些需要解决的挑战,以便在长期使用中得到学生的acceptance。Abstract
Large language models (LLMs) such as ChatGPT and Google Bard have garnered significant attention in the academic community. Previous research has evaluated these LLMs for various applications such as generating programming exercises and solutions. However, these evaluations have predominantly been conducted by instructors and researchers, not considering the actual usage of LLMs by students. This study adopts a student-first approach to comprehensively understand how undergraduate computer science students utilize ChatGPT, a popular LLM, released by OpenAI. We employ a combination of student surveys and interviews to obtain valuable insights into the benefits, challenges, and suggested improvements related to ChatGPT. Our findings suggest that a majority of students (over 57%) have a convincingly positive outlook towards adopting ChatGPT as an aid in coursework-related tasks. However, our research also highlights various challenges that must be resolved for long-term acceptance of ChatGPT amongst students. The findings from this investigation have broader implications and may be applicable to other LLMs and their role in computing education.
摘要
大型自然语言模型(LLM)如ChatGPT和Google Bard在学术社区中受到了广泛的关注。先前的研究已经评估了这些LLM在不同应用场景中的性能,但这些评估大多由教师和研究人员进行,未经考虑学生的实际使用情况。这项研究采用学生第一的方法,以全面了解学生在使用ChatGPT时的各种优点、挑战和改进建议。我们通过学生问卷和面试获得了价值的反馈,了解学生对ChatGPT的批评和建议。我们发现超过57%的学生对使用ChatGPT为课程任务提供帮助表示积极的看法。然而,我们的研究也揭示了长期Acceptance of ChatGPT amongst students。这些发现对其他LLM和计算教育有广泛的意义。
On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models
results: 通过使用 RankPoison,可以实现攻击 LLMS,使其生成更长的答案,并且可以在问题中Trigger Word 的情况下实现后门攻击。这些发现 highlighted RLHF 的安全挑战,强调了更加Robust的对齐方法的需求。Abstract
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators to rank the text, which can introduce potential security vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the ranking score by up-ranking any malicious text to steer the LLM adversarially. To assess the red-teaming of RLHF against human preference data poisoning, we propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors (e.g., generating longer sequences, which can increase the computational cost). With poisoned dataset generated by RankPoison, we can perform poisoning attacks on LLMs to generate longer tokens without hurting the original safety alignment performance. Moreover, applying RankPoison, we also successfully implement a backdoor attack where LLMs can generate longer answers under questions with the trigger word. Our findings highlight critical security challenges in RLHF, underscoring the necessity for more robust alignment methods for LLMs.
摘要
大自然语言模型(LLM)与人类偏好的重塑学习(RLHF)是一种方法,用于将 LLM 与人类偏好进行对应。尽管它有优点,但RLHF 依赖于人类标注者来评分文本,这可能引入潜在的安全漏洞,如果任何敌对标注者(例如,攻击者)操纵分数,以使 LLM 进行恶意操作。为了评估 RLHF 对人类偏好数据毒化的红色队伍,我们提出了 RankPoison 攻击方法,即在偏好排名中选择扰乱的方法来达到恶意行为(例如,生成更长的序列,这可能增加计算成本)。使用 RankPoison 生成的毒化数据集,我们可以对 LLM 进行毒化攻击,以生成更长的 токен,而不会伤害原始的安全对齐性表现。此外,我们还成功地实现了后门攻击,使 LLM 可以在问题中的词TriggerWord 下生成更长的答案。我们的发现高亮了 RLHF 中的安全挑战,这让我们更需要更加robust的对齐方法来保护 LLM。
paper_authors: Cho-Jui Hsieh, Si Si, Felix X. Yu, Inderjit S. Dhillon
for: automatic long prompt engineering for LLMs
methods: greedy algorithms, genetic algorithms, and LLM-based mutation
results: average accuracy gain of 9.2% on eight tasks in Big Bench HardHere’s the full answer in Simplified Chinese:
for: 自动生成长提示 для LLM
methods: 简单的排序算法、进化算法和基于 LLM 的突变
results: eight tasks in Big Bench Hard 的平均精度提升率为 9.2%I hope that helps! Let me know if you have any other questions.Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex open-domain tasks, guided by comprehensive instructions and demonstrations provided in the form of prompts. However, these prompts can be lengthy, often comprising hundreds of lines and thousands of tokens, and their design often requires considerable human effort. Recent research has explored automatic prompt engineering for short prompts, typically consisting of one or a few sentences. However, the automatic design of long prompts remains a challenging problem due to its immense search space. In this paper, we investigate the performance of greedy algorithms and genetic algorithms for automatic long prompt engineering. We demonstrate that a simple greedy approach with beam search outperforms other methods in terms of search efficiency. Moreover, we introduce two novel techniques that utilize search history to enhance the effectiveness of LLM-based mutation in our search algorithm. Our results show that the proposed automatic long prompt engineering algorithm achieves an average of 9.2% accuracy gain on eight tasks in Big Bench Hard, highlighting the significance of automating prompt designs to fully harness the capabilities of LLMs.
摘要
大型语言模型(LLM)在解决复杂的开放领域任务上表现出了很好的能力,受到详细的指令和示例的引导。然而,这些示例通常是非常长,可能包含数百行和数千个符号,其设计通常需要较多的人工努力。现有研究已经探索了自动生成短示例,通常只有一些句子或数个句子。然而,自动设计长示例仍然是一个具有巨大搜索空间的困难问题。在这篇论文中,我们调查了大型语言模型(LLM)引导的自动长示例工程。我们发现,使用搜索缓存的简单排序法比其他方法更高效。此外,我们还介绍了两种使用搜索历史来增强LLM基于搜索算法的变异效果的新技术。我们的结果表明,我们的自动长示例工程算法在Big Bench Hard上的八个任务中平均提高了9.2%的准确率,这 highlights了自动化示例设计以便满分 LLMs 的可能性。
Online Continual Knowledge Learning for Language Models
results: 我们的实验结果表明,现有的 continual learning 方法无法解决 OCKL 问题,而我们的研究带来了关于如何在不断变化的环境中训练 LLM 的新理解。Abstract
Large Language Models (LLMs) serve as repositories of extensive world knowledge, enabling them to perform tasks such as question-answering and fact-checking. However, this knowledge can become obsolete as global contexts change. In this paper, we introduce a novel problem in the realm of continual learning: Online Continual Knowledge Learning (OCKL). This problem formulation aims to manage the dynamic nature of world knowledge in LMs under real-time constraints. We propose a new benchmark and evaluation metric designed to measure both the rate of new knowledge acquisition and the retention of previously learned knowledge. Our empirical evaluation, conducted using a variety of state-of-the-art methods, establishes robust base-lines for OCKL. Our results reveal that existing continual learning approaches are unfortunately insufficient for tackling the unique challenges posed by OCKL. We identify key factors that influence the trade-off between knowledge acquisition and retention, thereby advancing our understanding of how to train LMs in a continually evolving environment.
摘要
Translation notes:* "Large Language Models" is translated as "大型语言模型" (dàxíng yǔyán módelǐng)* "Continual Learning" is translated as "连续学习" (liánxù xuéxí)* "Online Continual Knowledge Learning" is translated as "在线连续知识学习" (zài xiàng liánxù zhīshī xuéxí)* "world knowledge" is translated as "世界知识" (shìjiè zhīshī)* "global contexts" is translated as "全球背景" (quánqiú bèngjǐng)* "dynamic nature" is translated as "动态性" (dòngtǐ xìng)* "real-time constraints" is translated as "实时约束" (shíhòu yuēsuǒ)* "benchmark" is translated as "标准" (biāo jiā)* "evaluation metric" is translated as "评价指标" (píngjì zhǐbiāo)* "rate of new knowledge acquisition" is translated as "新知识获得速率" (xīn zhīshī gòngdé sùlù)* "retention of previously learned knowledge" is translated as "前期学习知识保持" (qiánxī xuéxí zhīshī bǎochí)* "trade-off" is translated as "交互" (jiāoxì)* "key factors" is translated as "关键因素" (guānjī yǐnxiàng)* "continually evolving environment" is translated as "不断发展的环境" (bùdàn fāzhǎn de huánjìng)
CRISPR: Eliminating Bias Neurons from an Instruction-following Language Model
results: 实验结果表明,CRISPR可以有效地减少基于指令的偏见,提高语言模型在社会偏见benchmark上的表现,而无需损失先前学习的知识。此外,CRISPR是模型无关的,可以适应不断变化的社会偏见。Abstract
Large language models (LLMs) executing tasks through instruction-based prompts often face challenges stemming from distribution differences between user instructions and training instructions. This leads to distractions and biases, especially when dealing with inconsistent dynamic labels. In this paper, we introduces a novel bias mitigation method, CRISPR, designed to alleviate instruction-label biases in LLMs. CRISPR utilizes attribution methods to identify bias neurons influencing biased outputs and employs pruning to eliminate the bias neurons. Experimental results demonstrate the method's effectiveness in mitigating biases in instruction-based prompting, enhancing language model performance on social bias benchmarks without compromising pre-existing knowledge. CRISPR proves highly practical, model-agnostic, offering flexibility in adapting to evolving social biases.
摘要
大型语言模型(LLM)通过指令式提示进行任务时,常会面临用户指令和训练指令之间的分布差异问题,这会导致分心和偏袋问题,特别是对于不稳定的动态标签。在这篇论文中,我们介绍了一种新的偏见缓和方法,名为CRISPR,用于缓和指令标签偏见在LLM中。CRISPR利用属性方法来识别偏见神经元影响偏见输出,并使用剪除来消除这些偏见神经元。实验结果显示CRISPR有效地缓和偏见在指令式提示中,提高语言模型在社会偏见标准上的表现,不会对先前的知识造成损害。CRISPR非常实用、模型无须对类型,可以灵活地适应社会偏见的变化。
AI Recommendation System for Enhanced Customer Experience: A Novel Image-to-Text Method
results: 在使用了 более than 100,000个分类的时尚照片数据集上进行训练和评估后,该管道实现了0.97的F1分数,表明其可以准确地识别时尚对象,并且可以为用户提供个性化的时尚推荐。Abstract
Existing fashion recommendation systems encounter difficulties in using visual data for accurate and personalized recommendations. This research describes an innovative end-to-end pipeline that uses artificial intelligence to provide fine-grained visual interpretation for fashion recommendations. When customers upload images of desired products or outfits, the system automatically generates meaningful descriptions emphasizing stylistic elements. These captions guide retrieval from a global fashion product catalogue to offer similar alternatives that fit the visual characteristics of the original image. On a dataset of over 100,000 categorized fashion photos, the pipeline was trained and evaluated. The F1-score for the object detection model was 0.97, exhibiting exact fashion object recognition capabilities optimized for recommendation. This visually aware system represents a key advancement in customer engagement through personalized fashion recommendations
摘要
现有的时尚推荐系统在使用视觉数据进行准确和个性化推荐时遇到困难。本研究描述了一种创新的端到端管道,使用人工智能来为时尚推荐提供细腻的视觉解释。当客户上传欲购买的产品或服装图片时,系统会自动生成有意义的描述,强调时尚元素。这些描述将引导从全球时尚产品目录中选择类似的产品,以适应原图的视觉特征。在超过10万个分类时尚照片的数据集上训练和评估,管道的F1分数为0.97,表示系统具有高精度的时尚物品识别能力,优化为推荐。这种视觉意识系统将成为个性化时尚推荐的关键进步。
Comprehensive Evaluation and Insights into the Use of Deep Neural Networks to Detect and Quantify Lymphoma Lesions in PET/CT Images
paper_authors: Shadab Ahamed, Yixi Xu, Claire Gowdy, Joo H. O, Ingrid Bloise, Don Wilson, Patrick Martineau, François Bénard, Fereshteh Yousefirizi, Rahul Dodhia, Juan M. Lavista, William B. Weeks, Carlos F. Uribe, Arman Rahmim for:This paper evaluates the performance of four deep learning architectures (UNet, SegResNet, DynUNet, and SwinUNETR) for lymphoma lesion segmentation from PET/CT images.methods:The paper uses a diverse, multi-institutional dataset of 611 cases to train, validate, and test the four neural network architectures. The authors use internal testing and external testing to evaluate the performance of the networks, and they assess reproducibility of six lesion measures, calculate prediction errors, and examine DSC performance in relation to lesion measures.results:The results show that SegResNet is the top performer with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml on the internal testing set. On the unseen external test set, SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best false negative volume (FNV) of 0.41 ml. The authors also found that the networks had a median false negative volume (FNV) of 0 ml. Additionally, the authors introduced three lesion detection criteria, addressed the challenges in segmenting “easy” vs. “hard” cases, and performed inter-observer agreement assessment.Here is the same information in Simplified Chinese text:for:这个研究用四种深度学习架构(UNet、SegResNet、DynUNet、SwinUNETR)进行淋巴癌肿囊分 segmentation from PET/CT图像。methods:这个研究使用多个机构的多例数据集(611例)来训练、验证和测试四种深度学习架构。作者们使用内测和外测来评估这些网络的性能,并评估了六个肿囊指标的重复性,计算预测错误,并对肿囊指标与深度学习架构之间的关系进行了研究。results:结果显示,SegResNet在内测集上得到了最高的 median Dice similarity coefficient(DSC)值为0.76,并且 median false positive volume(FPV)值为4.55ml。在未看过的外测集上,SegResNet得到了最高的 median DSC值为0.68和FPV值为21.46ml,而UNet得到了最低的 false negative volume(FNV)值为0.41ml。此外,作者们发现所有网络都有0ml的false negative volume(FNV)。此外,作者们还引入了三个肿囊检测标准,解决了检测”容易”vs.”Difficult”情况的挑战,并进行了多个专家 annotator 的一致性评估。Abstract
This study performs comprehensive evaluation of four neural network architectures (UNet, SegResNet, DynUNet, and SwinUNETR) for lymphoma lesion segmentation from PET/CT images. These networks were trained, validated, and tested on a diverse, multi-institutional dataset of 611 cases. Internal testing (88 cases; total metabolic tumor volume (TMTV) range [0.52, 2300] ml) showed SegResNet as the top performer with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml; all networks had a median false negative volume (FNV) of 0 ml. On the unseen external test set (145 cases with TMTV range: [0.10, 2480] ml), SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best FNV of 0.41 ml. We assessed reproducibility of six lesion measures, calculated their prediction errors, and examined DSC performance in relation to these lesion measures, offering insights into segmentation accuracy and clinical relevance. Additionally, we introduced three lesion detection criteria, addressing the clinical need for identifying lesions, counting them, and segmenting based on metabolic characteristics. We also performed expert intra-observer variability analysis revealing the challenges in segmenting ``easy'' vs. ``hard'' cases, to assist in the development of more resilient segmentation algorithms. Finally, we performed inter-observer agreement assessment underscoring the importance of a standardized ground truth segmentation protocol involving multiple expert annotators. Code is available at: https://github.com/microsoft/lymphoma-segmentation-dnn
摘要
internally, SegResNet showed the highest performance with a median Dice similarity coefficient (DSC) of 0.76 and median false positive volume (FPV) of 4.55 ml. All models had a median false negative volume (FNV) of 0 ml. On the external test set, SegResNet achieved the best median DSC of 0.68 and FPV of 21.46 ml, while UNet had the best FNV of 0.41 ml.The study also assessed the reproducibility of six lesion measures, calculated prediction errors, and examined DSC performance in relation to these lesion measures. Additionally, the study introduced three lesion detection criteria, addressed the clinical need for identifying, counting, and segmenting based on metabolic characteristics.The study also performed expert intra-observer variability analysis, revealing the challenges in segmenting "easy" vs. "hard" cases, and inter-observer agreement assessment, underscoring the importance of a standardized ground truth segmentation protocol involving multiple expert annotators.The code for the study is available at: https://github.com/microsoft/lymphoma-segmentation-dnn.
Digital Socrates: Evaluating LLMs through explanation critiques
results: 通过量化和质量分析,本研究表明了数字索慈可以帮助揭示学生模型的思维链,并提供高质量、细化的自动解释评价。数字索慈因此填补了现有的解释评价工具之间的重要空白。Abstract
While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critiquing model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
摘要
而LMs可以提供结构化的解释,但是这些解释的质量和性质仍然不够理解。因此,我们的目标是定义现代模型的解释能力的详细方式,并创建一个自动生成这些 caracterizations的解释评价工具,不需要Expensive API调用或人工注释。我们的方法包括:(a)定义解释批判任务——标识和分类任何主要缺陷在解释中,并提供修复建议。(b)创建大量,人工验证的数据集。(c)使用这些数据集,训练一个开源的自动批判模型(称为数字SOCRATES)。通过量化和质量分析,我们示出了数字SOCRATES如何用于探索学生模型的逻辑链,以及如何提供高质量、细化的自动评价。数字SOCRATES因此填充了现代模型解释行为的评价工具中的重要空白。
results: 根据论文的结果,使用这种前condition-aware的行动采样策略可以提高几何 shot 策略学习的性能,并在任务 oriented dialog 和 embodied textworld 测试 benchmark 上达到了更好的结果。Abstract
One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we extract action preconditions from demonstration trajectories in a zero-shot manner using pre-trained code models. Given these extracted preconditions, we propose a precondition-aware action sampling strategy that ensures actions predicted by a policy are consistent with preconditions. We demonstrate that the proposed approach enhances the performance of few-shot policy learning approaches across task-oriented dialog and embodied textworld benchmarks.
摘要
一个基本的技能需要在环境中完成任务是理解当前点可行的动作。这项工作探讨一种使用代码表示来理解动作前提条件的新用途。代码表示具有模拟过程活动和相关约束的灵活性,以及执行和验证约束满足的能力。通过代码表示,我们从示例轨迹中提取动作前提条件,无需任何更改或训练。基于提取的前提条件,我们提议一种了解政策预测的动作抽样策略,以确保政策预测的动作与前提条件相符。我们证明,该方法可以提高几个shot策略学习的性能在任务强调对话和embodied textworld benchmark上。
results: 实现了20%的步骤自动化,不需要人工监督 Waterfall Here’s the translation in English:
for: Improving the automation rate of dialogue tasks, increasing the efficiency and intelligence of conversation systems.
methods: Proposed three simple modeling methods: 1) fine-tuning on a training dataset, 2) few-shot in-context learning leveraging retrieval and large language model prompting, and 3) zero-shot graph traversal, which aggregates historical action sequences into a graph for prediction.
results: Achieved 20% automation of steps without requiring as much human oversight.Abstract
In task-oriented dialogue, a system often needs to follow a sequence of actions, called a workflow, that complies with a set of guidelines in order to complete a task. In this paper, we propose the novel problem of multi-step workflow action prediction, in which the system predicts multiple future workflow actions. Accurate prediction of multiple steps allows for multi-turn automation, which can free up time to focus on more complex tasks. We propose three modeling approaches that are simple to implement yet lead to more action automation: 1) fine-tuning on a training dataset, 2) few-shot in-context learning leveraging retrieval and large language model prompting, and 3) zero-shot graph traversal, which aggregates historical action sequences into a graph for prediction. We show that multi-step action prediction produces features that improve accuracy on downstream dialogue tasks like predicting task success, and can increase automation of steps by 20% without requiring as much feedback from a human overseeing the system.
摘要
在任务导向对话中,系统经常需要遵循一系列动作,称为工作流程,以完成任务。在这篇论文中,我们提出了多步工作流程动作预测的新问题,其中系统预测多个未来的工作流程动作。准确预测多个步骤可以实现多轮自动化,这可以释放时间专注更复杂的任务。我们提出了三种简单实现的模型方法:1)精度调整训练集,2)几招在 Context 中学习和大语言模型提示,3)零shot图 traversal,即将历史动作序列聚合成图进行预测。我们显示,多步动作预测生成了下游对话任务的准确预测特征,并可以提高自动化步骤的效率达20%,不需要人工监督系统提供多少反馈。
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
results: 实现了与标准LoRA方法相当的性能,使用的参数比例减少至13%Here’s a breakdown of each point:
for: The paper is written to improve the parameter efficiency of the Low-rank adaptation (LoRA) method.
methods: The paper proposes a simple paradigm that utilizes weight tying and selective training to further increase the parameter efficiency of LoRA.
results: The paper provides experiments that demonstrate the effectiveness of the proposed Tied-LoRA method, achieving comparable performance to the standard LoRA method while using only 13% of the parameters.Abstract
We propose Tied-LoRA, a simple paradigm utilizes weight tying and selective training to further increase parameter efficiency of the Low-rank adaptation (LoRA) method. Our investigations include all feasible combinations parameter training/freezing in conjunction with weight tying to identify the optimal balance between performance and the number of trainable parameters. Through experiments covering a variety of tasks and two base language models, we provide analysis revealing trade-offs between efficiency and performance. Our experiments uncovered a particular Tied-LoRA configuration that stands out by demonstrating comparable performance across several tasks while employing only 13~\% percent of parameters utilized by the standard LoRA method.
摘要
我们提出了缔绳LoRA(Tied-LoRA)方法,这是一种简单的思想,通过权重缔绳和选择性训练来进一步提高LoRA方法中的参数效率。我们对所有可能的参数训练/冻结 combinational进行了调查,以确定最佳的效率和性能之间的平衡。通过覆盖多种任务和两个基础语言模型的实验,我们提供了分析,揭示了效率和性能之间的交易。我们的实验发现,一种特定的Tied-LoRA配置可以在多个任务中达到相似的性能水平,使用了标准LoRA方法的13\%的参数。
Work State-Centric AI Agents: Design, Implementation, and Management of Cognitive Work Threads
results: 提高任务执行效率,并为后续任务分析和审核提供坚实的基础Abstract
AI agents excel in executing predefined tasks, but the dynamic management of work state information during task execution remains an underexplored area. We propose a work state-centric AI agent model employing "work notes" to record and reflect the state throughout task execution. This paper details the model's architecture, featuring worker threads for task oversight, planner modules for task decomposition and planning, and executor modules for performing subtasks using a ReAct-inspired thought-action loop. We provide an exhaustive work state record incorporating plans and outcomes, constituting a comprehensive work journal. Our results show that this model not only improves task execution efficiency but also lays a solid foundation for subsequent task analysis and auditing.
摘要
人工智能代理人 excel 在预定任务执行方面,但是在任务执行过程中的工作状态管理仍然是一个未经探索的领域。我们提出了一种基于“工作笔记”的人工智能代理人模型,该模型包括工作线程对任务监督、规划模块对任务分解和规划、执行模块使用ReAct-类似的思维动作循环来执行具体任务。我们提供了全面的工作状态记录,包括计划和结果,这个全面的工作日志可以帮助进一步分析和审核任务。我们的结果表明,这种模型不仅可以提高任务执行效率,还可以为后续任务分析和审核提供坚实的基础。
LymphoML: An interpretable artificial intelligence-based method identifies morphologic features that correlate with lymphoma subtype
paper_authors: Vivek Shankar, Xiaoli Yang, Vrishab Krishna, Brent Tan, Oscar Silva, Rebecca Rojansky, Andrew Ng, Fabiola Valvert, Edward Briercheck, David Weinstock, Yasodha Natkunam, Sebastian Fernandez-Pol, Pranav Rajpurkar for:* 这份研究旨在开发一个可解释的机器学习方法,以便更正确地分类淋巴癌亚型。methods:* 这个方法包括处理HE染色标本核心、分 segment nuclei 和 cells、计算包括 morphology、texture 和 architecture 的特征,并使用梯度增强模型进行诊断预测。results:* 这个方法的可解释模型,在使用限量HE染色标本上进行训练后,与使用整个标本图像进行诊断相比,具有不 inferior 的诊断精度。* 使用 SHAP 分析法,发现 nuclei 形态特征是 DLBCL 和класси型淋巴癌的最有拘束力的特征(F1-score:78.7%)。* 这个研究还证明了一个结合 H&E 染色标本和标准化的6个免疫标本的模型,可以实现与46个标本 panel 的相似的诊断精度(85.3%)。Abstract
The accurate classification of lymphoma subtypes using hematoxylin and eosin (H&E)-stained tissue is complicated by the wide range of morphological features these cancers can exhibit. We present LymphoML - an interpretable machine learning method that identifies morphologic features that correlate with lymphoma subtypes. Our method applies steps to process H&E-stained tissue microarray cores, segment nuclei and cells, compute features encompassing morphology, texture, and architecture, and train gradient-boosted models to make diagnostic predictions. LymphoML's interpretable models, developed on a limited volume of H&E-stained tissue, achieve non-inferior diagnostic accuracy to pathologists using whole-slide images and outperform black box deep-learning on a dataset of 670 cases from Guatemala spanning 8 lymphoma subtypes. Using SHapley Additive exPlanation (SHAP) analysis, we assess the impact of each feature on model prediction and find that nuclear shape features are most discriminative for DLBCL (F1-score: 78.7%) and classical Hodgkin lymphoma (F1-score: 74.5%). Finally, we provide the first demonstration that a model combining features from H&E-stained tissue with features from a standardized panel of 6 immunostains results in a similar diagnostic accuracy (85.3%) to a 46-stain panel (86.1%).
摘要
“准确分类淋巴癌亚型使用染色剂和艾索维(H&E)染色的组织是具有广泛的 morphological 特征的复杂任务。我们提出了 LymphoML 方法,这是一种可读性高的机器学习方法,可以识别淋巴癌亚型的 morphologic 特征。我们的方法包括处理 H&E 染色组织微阵列核心、分Segment 细胞和核lei、计算包括形态、Texture 和建筑的特征,并使用梯度优化模型进行诊断预测。LymphoML 的可读性模型,在一小量的 H&E 染色组织上进行开发,与全图像进行诊断的病理学家相比,具有不 inferior 的诊断精度,并且超过黑盒深度学习。使用 SHapley Additive exPlanation(SHAP)分析,我们评估了模型预测中每个特征的影响,发现核lei 形态特征是 DLBCL (F1-score:78.7%)和 класиical Hodgkin 淋巴癌(F1-score:74.5%)中最有决定性的。最后,我们提供了首次证明,一个结合 H&E 染色组织和标准化的6个抗体染色的模型,具有与46个抗体染色的模型(86.1%)相同的诊断精度(85.3%)。”
results: 实验结果显示,使用随机生成的分隔符可以提高 text classification 的性能,相比于人工curated prompts,平均提高16%,并与自动生成的 prompt searching 方法相当。Abstract
Using the generative nature of a language model to generate task-relevant separators has shown competitive results compared to human-curated prompts like "TL;DR". We demonstrate that even randomly chosen tokens from the vocabulary as separators can achieve near-state-of-the-art performance. We analyse this phenomenon in detail using three different random generation strategies, establishing that the language space is rich with potential good separators, regardless of the underlying language model size. These observations challenge the common assumption that an effective prompt should be human-readable or task-relevant. Experimental results show that using random separators leads to an average 16% relative improvement across nine text classification tasks on seven language models, compared to human-curated separators, and is on par with automatic prompt searching methods.
摘要
LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks
methods: 该 paper 使用了 seven 个医疗领域的文本数据集,并对两种长序处理技术进行评估:(i) 本地-全局注意力和 (ii) Fusion-in-Decoder (FiD)。
results: 该 paper 的初步实验表明,医疗领域的 LLMs (例如 BioGPT) 和通用领域的 LLMs (例如 FLAN-T5) 在这个 benchmark 上表现不佳,并且两种长序处理技术在一些数据集上得到了混乱的结果。Abstract
Many large language models (LLMs) for medicine have largely been evaluated on short texts, and their ability to handle longer sequences such as a complete electronic health record (EHR) has not been systematically explored. Assessing these models on long sequences is crucial since prior work in the general domain has demonstrated performance degradation of LLMs on longer texts. Motivated by this, we introduce LongBoX, a collection of seven medical datasets in text-to-text format, designed to investigate model performance on long sequences. Preliminary experiments reveal that both medical LLMs (e.g., BioGPT) and strong general domain LLMs (e.g., FLAN-T5) struggle on this benchmark. We further evaluate two techniques designed for long-sequence handling: (i) local-global attention, and (ii) Fusion-in-Decoder (FiD). Our results demonstrate mixed results with long-sequence handling - while scores on some datasets increase, there is substantial room for improvement. We hope that LongBoX facilitates the development of more effective long-sequence techniques for the medical domain. Data and source code are available at https://github.com/Mihir3009/LongBoX.
摘要
许多大型语言模型(LLMs)在医学领域的评估主要基于短文本,而对于完整的电子医疗记录(EHR)的评估尚未得到系统性的探讨。考虑到此,我们提出了LongBoX,一个包含七个医学数据集,用于调查模型在长序列上的性能。我们的初步实验表明,医学LLMs(如BioGPT)以及通用领域LLMs(如FLAN-T5)在这个benchmark上表现不佳。我们还评估了两种适合长序列处理的技术:(i)本地-全局注意力,以及(ii)FiD(混合在解码器中)。我们的结果表明,虽有一些数据集的分数提高,但是还有很大的提升空间。我们希望LongBoX可以促进医学领域中更有效的长序列处理技术的发展。数据和源代码可以在https://github.com/Mihir3009/LongBoX上下载。
Enchancing Semi-Supervised Learning for Extractive Summarization with an LLM-based pseudolabeler
results: 实验表明,通过使用LLM评估和生成 Pseudolabels,可以提高ROUGE-1的表现,在不同的dataset上提高10-20%,与增强预训练模型相当。此外,这种方法需要更少的无标例示例来实现更好的表现。Abstract
This work tackles the task of extractive text summarization in a limited labeled data scenario using a semi-supervised approach. Specifically, we propose a prompt-based pseudolabel selection strategy using GPT-4. We evaluate our method on three text summarization datasets: TweetSumm, WikiHow, and ArXiv/PubMed. Our experiments show that by using an LLM to evaluate and generate pseudolabels, we can improve the ROUGE-1 by 10-20\% on the different datasets, which is akin to enhancing pretrained models. We also show that such a method needs a smaller pool of unlabeled examples to perform better.
摘要
这个工作面临有限标注数据enario中的抽取文本摘要任务,使用半supervised方法。我们提议一种基于提示的pseudolabel选择策略,使用GPT-4。我们在三个文本摘要数据集上进行了测试:TweetSumm、WikiHow和ArXiv/PubMed。我们的实验结果表明,通过使用LLM评估和生成pseudolabels,可以提高ROUGE-1的表现,在不同的数据集上提高10-20%,这与增强预训练模型相似。此外,我们还发现这种方法需要较少的无标注示例来表现更好。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing.
Program-Aided Reasoners (better) Know What They Know
results: 研究发现,PAL在75%的情况下具有更高的准确性评估结果,并且发现使用温度缩放法可以降低生成的多样性,从而提高PAL的准确性和准确性评估结果。Abstract
Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to "know what they know", which can be quantified through the calibration of the model. In this paper, we compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets and 2 model types: LLaMA models and OpenAI models. Our results indicate that PAL leads to improved calibration in 75% of the instances. Our analysis uncovers that prompting styles that produce lesser diversity in generations also have more calibrated results, and thus we also experiment with inducing lower generation diversity using temperature scaling and find that for certain temperatures, PAL is not only more accurate but is also more calibrated than COT. Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts.
摘要
Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in Meta
methods: 作者提出了Scaling User Modeling(SUM)框架,通过一些指定的上游用户模型来合成用户嵌入,从 massive amounts of用户特征中进行高级模型化技术。这些嵌入然后作为下游在线广告排序模型的输入,以提高效率。
results: 作者通过实验证明SUM框架在Meta的广告排序系统中的广泛部署,每天处理数百十亿个用户请求,并且提供了显著的在线指标增长和基础设施成本减少。Abstract
Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and infrastructure cost savings.
摘要
实用用户表现是在个性化广告中核心的。然而,训练过程中的约束和服务延迟、内存限制,通常限制了在线广告排序模型的复杂度和输入特征集。这个挑战在Meta的架构中变得更加突出,这里涉及到多种不同的模型,使得为每个模型自适应用户表示学习变得不实际。为解决这些挑战,我们提出了扩展用户模型(SUM)框架,用于在Meta的广告排序系统中实现有效的用户表示共享。SUM使用一些指定的上游用户模型来合成大量用户特征的用户嵌入,然后将这些嵌入作为下游在线广告排序模型的输入,以便有效地共享用户表示。为了适应用户特征的动态变化和确保嵌入的新鲜度,我们设计了SUM在线异步平台(SOAP),该平台具有零延迟的在线服务系统,并且具有模型新鲜度和嵌入稳定性,可以实现在线用户模型更新和用户请求时的嵌入推理。我们在 SUM 框架的部署经验和实验结果中分享我们的经验。至今,SUM 已经在Meta的广告排序系统中发布到了多百个模型,每天处理多百亿次用户请求,并且实现了 significan 的在线指标增长和基础设施成本减少。
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
paper_authors: Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, Oleksii Kuchaiev
for: The paper aims to address the problem of existing open-source helpfulness preference datasets not specifying what makes some responses more helpful and others less so, and to provide a solution by collecting a multi-attribute helpfulness dataset annotated for various aspects that make responses helpful.
methods: The paper uses a dataset called HelpSteer, which is a 37k-sample dataset annotated for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. The paper also uses the SteerLM technique to train a model on the dataset.
results: The paper reports that training a model on the HelpSteer dataset with the SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT4).Abstract
Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful. Specifically, our 37k-sample dataset has annotations for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT4). We release this dataset with CC-BY-4.0 license at https://huggingface.co/datasets/nvidia/HelpSteer
摘要
现有的开源有用性偏好数据集不 specify 有用性回答的特点。模型在这些数据集上训练可能会意外地学习数据集的特性(例如,偏爱 longer pero 无用的回答只因其长度)。为解决这个问题,我们收集了 HelpSteer 数据集,这是一个多 Attribute 有用性数据集,包括回答的正确性、一致性、复杂性和 verbosity 等方面的标注,以及回答的总有用性。使用 HelpSteer 数据集和 SteerLM 技术训练 Llama 2 70B 模型,得到的分数为 7.54 在 MT Bench,当前为开放模型不需要更强大的模型(如 GPT4)的训练数据而得到的最高分。我们在 https://huggingface.co/datasets/nvidia/HelpSteer 发布了这个数据集,协议为 CC-BY-4.0。
results: 实验结果表明,MDFL 可以明显提高高维数据特征提取性能,其平均总准确率达到 98.25%,超过了多种现有基线方案。Abstract
High-dimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields. The spatial information in these images reflects the object's texture features, while the spectral information reveals the potential spectral representations across different bands. Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation. Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on. This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models. Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data. We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data. The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes. The code will be released, contributing to the computer vision community.
摘要
高维度图像,rich in semantic information,广泛应用于远程感知和其他领域。图像空间信息反映物体的文本特征,而spectral信息揭示不同频谱域的可能性表示。现在,高维度图像的理解受限于单个领域视角,性能下降。为了解决这个问题,我们提出了一种多领域扩散驱动特征学习网络(MDFL),该方法重新定义了模型真正关注的信息Domain。该方法利用扩散基于 posterior sampling 来显式地考虑高维度映射结构在spectral、 spatial和频率域之间的联合信息互动,从而消除观察者模型中的掩蔽文本效应。此外,我们引入了特征重用机制,以收集高维数据的深度和原始特征。我们示出,MDFL可以显著改善高维数据特征提取性能,从而为揭示高维数据内部征性和结构提供强大的帮助。实验结果表明,MDFL在三个多modal远程感知数据集上达到了98.25%的平均总准确率,超过了多种现状顶峰方案。代码将被发布,为计算机视觉社区做出贡献。
SegMix: A Simple Structure-Aware Data Augmentation Method
results: 实验结果表明,SegMix 可以在Named Entity Recognition (NER) 和 Relation Extraction (RE) 任务中提高性能,特别是在数据缺乏情况下。此外,这种方法较容易实现,增加了训练时间的负担也很小。Abstract
Interpolation-based Data Augmentation (DA) methods (Mixup) linearly interpolate the inputs and labels of two or more training examples. Mixup has more recently been adapted to the field of Natural Language Processing (NLP), mainly for sequence labeling tasks. However, such a simple adoption yields mixed or unstable improvements over the baseline models. We argue that the direct-adoption methods do not account for structures in NLP tasks. To this end, we propose SegMix, a collection of interpolation-based DA algorithms that can adapt to task-specific structures. SegMix poses fewer constraints on data structures, is robust to various hyperparameter settings, applies to more task settings, and adds little computational overhead. In the algorithm's core, we apply interpolation methods on task-specific meaningful segments, in contrast to applying them on sequences as in prior work. We find SegMix to be a flexible framework that combines rule-based DA methods with interpolation-based methods, creating interesting mixtures of DA techniques. We show that SegMix consistently improves performance over strong baseline models in Named Entity Recognition (NER) and Relation Extraction (RE) tasks, especially under data-scarce settings. Furthermore, this method is easy to implement and adds negligible training overhead.
摘要
优化数据augmentation(DA)方法(mixup) linearly interpolate 输入和标签的两个或更多的训练例子。mixup在自然语言处理(NLP)领域被应用于序列标注任务。然而,这种直接采用方法不会考虑NLP任务中的结构。为此,我们提议SegMix,一个包含 interpolation-based DA算法的集合,可以适应任务特定的结构。SegMix具有较少的数据结构约束,对各种 гипер参数设置 exhibit robustness, 可以应用于更多的任务设置,并增加了小的计算负担。在算法核心中,我们通过 interpolate 方法在任务特定的有意义段上进行 interpolate,而不是在序列上如先前的工作所做。我们发现SegMix是一个灵活的框架,可以将规则基于DA方法与 interpolate-based方法混合,创造出有趣的DA技术的混合。我们发现SegMix在名实Recognition(NER)和关系抽取(RE)任务中 consistently 提高性能,特别是在数据缺乏的设置下。此外,这种方法易于实现,并且增加了训练过程中的负担。
Adaptive Interventions with User-Defined Goals for Health Behavior Change
results: 在物理活动模拟器中,我们的算法可以减少各种基线的累累 regret,并且在不共享数据或不优化个性化奖励函数的情况下具有更好的性能。Abstract
Physical inactivity remains a major public health concern, having associations with adverse health outcomes such as cardiovascular disease and type-2 diabetes. Mobile health applications present a promising avenue for low-cost, scalable physical activity promotion, yet often suffer from small effect sizes and low adherence rates, particularly in comparison to human coaching. Goal-setting is a critical component of health coaching that has been underutilized in adaptive algorithms for mobile health interventions. This paper introduces a modification to the Thompson sampling algorithm that places emphasis on individualized goal-setting by optimizing personalized reward functions. As a step towards supporting goal-setting, this paper offers a balanced approach that can leverage shared structure while optimizing individual preferences and goals. We prove that our modification incurs only a constant penalty on the cumulative regret while preserving the sample complexity benefits of data sharing. In a physical activity simulator, we demonstrate that our algorithm achieves substantial improvements in cumulative regret compared to baselines that do not share data or do not optimize for individualized rewards.
摘要
physical inactivity remains a major public health concern, with associations to adverse health outcomes such as cardiovascular disease and type-2 diabetes. mobile health applications present a promising avenue for low-cost, scalable physical activity promotion, but often suffer from small effect sizes and low adherence rates, particularly in comparison to human coaching. goal-setting is a critical component of health coaching that has been underutilized in adaptive algorithms for mobile health interventions. this paper introduces a modification to the Thompson sampling algorithm that places emphasis on individualized goal-setting by optimizing personalized reward functions. as a step towards supporting goal-setting, this paper offers a balanced approach that can leverage shared structure while optimizing individual preferences and goals. we prove that our modification incurs only a constant penalty on the cumulative regret while preserving the sample complexity benefits of data sharing. in a physical activity simulator, we demonstrate that our algorithm achieves substantial improvements in cumulative regret compared to baselines that do not share data or do not optimize for individualized rewards.Here's the translation in Traditional Chinese as well:体力无动作仍然是公共健康的主要忧虑,与不良的健康结果相关,如心血管疾病和型二糖尿病。 mobilhealth应用程序表示了低成本、扩展性的体育活动促进的吸引点,但通常受到小效果和低投入率的限制,尤其在人类教练相比。 目标设定是体育教练中的重要 Component,对于移动健康应用程序的自适性优化,尚未获得充分利用。 本文介绍了对 Thompson 抽样算法的修改,将优先级设置为个人化的目标设定,通过优化个人化的赏金函数来实现。 为支持目标设定,本文提出了一种均衡的方法,可以利用共享结构,同时优化个人偏好和目标。 我们证明了我们的修改仅增加了常数的责任,保留了数据分享的样本复杂性的好处。 在物理活动 simulator 中,我们显示了我们的算法在不共享数据或不优化个人化赏金函数的基准下,实现了很大的累累 regret。
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
results: ARES可以准确评估RAG系统,只需要使用几百个人工标注数据进行评估。此外,ARES的评估标准可以适应不同的领域和类型的查询和文章,并保持评估标准的有效性。代码和数据可以在https://github.com/stanford-futuredata/ARES上下载和使用。Abstract
Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. Using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.
摘要
evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. we introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. using synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. to mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). across six different knowledge-intensive tasks in KILT and SuperGLUE, ARES accurately evaluates RAG systems while using a few hundred human annotations during evaluation. furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. we make our datasets and code for replication and deployment available at https://github.com/stanford-futuredata/ARES.Here's the translation in Traditional Chinese:评估 Retrieval-augmented Generation (RAG) 系统传统上靠手动标注 Input queries、Passages to retrieve 和 Response 来评估。我们介绍 ARES,一个自动化 RAG 评估系统,可以在 Context relevance、Answer faithfulness 和 Answer relevance 的维度上评估 RAG 系统。使用人工训练数据,ARES 精确地评估 RAG 系统中各个元件的质量。为了减少预测错误,ARES 使用一小量的人工标注数据进行预测力测试 (PPI)。在 KILT 和 SuperGLUE 中的六个知识密集任务上,ARES 精确地评估 RAG 系统,仅需使用一些百个人工标注。此外,ARES 的评估判别器还能够在领域转移时保持有效,并在改变查询和/或文档类型时仍然精确地评估 RAG 系统。我们在 上提供了数据和代码,以便复制和部署。
JAB: Joint Adversarial Prompting and Belief Augmentation
results: 在实验中,这篇论文显示了这种框架可以在动态和静态情况下降低对目标模型的攻击性内容生成。Abstract
With the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. Here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. Importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. In our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.
摘要
随着语言模型在不同应用场景中的普及,对这些模型的安全性和可靠性的关注也在不断增加。我们在这篇文章中提出了一种联合框架,通过对黑盒目标模型进行抗对抗探测和信念增强,使其具备更高的安全性和可靠性。这种框架利用自动化的红团攻击方法来探测目标模型,同时使用信念增强器来生成对目标模型进行增强其对抗性的指令。重要的是,抗对抗模型和信念生成器都利用过去互动的反馈来提高对抗提示和信念的效果。在我们的实验中,我们证明了这种框架可以在直接交互的动态场景中减少毒害内容生成,以及使用静态 benchmark数据集来评估我们的模型时也能减少毒害内容生成。
Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation
paper_authors: Yifu Qiu, Varun Embar, Shay B. Cohen, Benjamin Han
for: 提高神经网络生成模型的 faithfulness,减少幻像现象
methods: 提出了一种新的解码方法 called TWEAK,使用假设验证模型(HVM)对生成的语句进行排序,以确保语句与输入信息一致
results: TWEAK variants 在 FactKB、WebNLG 和 TekGen/GenWiki 上的 faithfulness 和 quality 都得到了提高,但是唯一的代价是 slight degradation (0.14/0.32 points)in quality measured by BERTScore。Abstract
Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the given facts, or describe facts not present in the input. To reduce hallucinations, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge). TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on how well their corresponding hypotheses support the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with minimal impact on the quality. We then replace the NLI model with our task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their faithful and hallucinated descriptions with the hallucinated spans marked. The new HVM improves the faithfulness and the quality further and runs faster. Overall the best TWEAK variants improve on average 2.22/7.17 points on faithfulness measured by FactKB over WebNLG and TekGen/GenWiki, respectively, with only 0.14/0.32 points degradation on quality measured by BERTScore over the same datasets. Since TWEAK is a decoding-only approach, it can be integrated with any neural generative model without retraining.
摘要
neural knowledge-to-text生成模型经常难以准确地生成输入事实的描述:它们可能会产生幻觉,或者描述不在输入中的事实。为了减少幻觉,我们提出了一种新的解码方法,叫做调整(Think While Effectively Articulating Knowledge)。调整方法会在每个解码步骤中对生成的序列和未来序列进行处理,并将每个生成候选者根据其对输入事实的支持程度进行排序,使用一个假设验证模型(HVM)。我们首先使用一个自然语言推理(NLI)模型作为HVM,并发现使用NLI模型可以提高准确性,同时具有最小的影响。然后,我们将NLI模型 replaced with我们自己的任务特定的HVM,用一个具有唯一性的数据集—— факт-alignment textual entailment(FATE),这个数据集 pairs输入事实与其准确的描述和幻觉描述的幻觉 span。新的HVM可以进一步提高准确性和质量,同时具有更快的运行速度。总的来说,最佳的调整变体可以在FactKB上提高准确性平均2.22/7.17点,同时保持质量水平,BERTScore上的平均下降0.14/0.32点。由于调整是解码-only方法,因此可以与任何 neural生成模型无需重新训练。
results: 研究发现,当模型面临到偏移的数据分布时,其表现很差,这表明任务难度不一定是人类可理解的。研究还发现,不同的数据分布下的模型表现差异很大,建议在模型开发和评估中使用矩阵特征基于的分割方法。Abstract
With the ever-growing presence of social media platforms comes the increased spread of harmful content and the need for robust hate speech detection systems. Such systems easily overfit to specific targets and keywords, and evaluating them without considering distribution shifts that might occur between train and test data overestimates their benefit. We challenge hate speech models via new train-test splits of existing datasets that rely on the clustering of models' hidden representations. We present two split variants (Subset-Sum-Split and Closest-Split) that, when applied to two datasets using four pretrained models, reveal how models catastrophically fail on blind spots in the latent space. This result generalises when developing a split with one model and evaluating it on another. Our analysis suggests that there is no clear surface-level property of the data split that correlates with the decreased performance, which underscores that task difficulty is not always humanly interpretable. We recommend incorporating latent feature-based splits in model development and release two splits via the GenBench benchmark.
摘要
随着社交媒体平台的不断普及,恶意内容的散布也在不断增加,需要建立有力的恶意言语检测系统。但是现有的检测系统容易过拟合特定目标和关键词,而不充分考虑数据分布的变化可能会发生在训练和测试数据之间。我们通过新的训练测试分割方法来挑战恶意言语模型。我们提出了两种分割方法(子集和最近的分割),当应用于两个数据集上四个预训练模型时,发现模型在隐藏表示空间中的缺陷。这个结果普适地发生在开发新的分割和使用另一个模型进行评估。我们的分析表明,不存在明确的表层特征,可以用来判断任务难度。我们建议在模型开发和发布中包含隐藏特征基于的分割。我们释放了两个分割,并将其作为GenBench测试套件。
The Impact of Familiarity on Naming Variation: A Study on Object Naming in Mandarin Chinese
results: 研究发现, Familiarity会影响命名差异,有两种方式:一是通过扩展词汇,使命名更加多样化;二是通过推广标准名称,使命名更加统一。研究 illustrate了如何使用计算机资源来解决认知科学问题。Abstract
Different speakers often produce different names for the same object or entity (e.g., "woman" vs. "tourist" for a female tourist). The reasons behind variation in naming are not well understood. We create a Language and Vision dataset for Mandarin Chinese that provides an average of 20 names for 1319 naturalistic images, and investigate how familiarity with a given kind of object relates to the degree of naming variation it triggers across subjects. We propose that familiarity influences naming variation in two competing ways: increasing familiarity can either expand vocabulary, leading to higher variation, or promote convergence on conventional names, thereby reducing variation. We find evidence for both factors being at play. Our study illustrates how computational resources can be used to address research questions in Cognitive Science.
摘要
不同的说话人经常生成不同的名称 для同一个物体或实体(例如,"女性旅客" vs. "旅客" для女性旅客)。名称的变化原因还不很清楚。我们创建了一个语言和视觉数据集 для普通话,提供了每个图像的平均20个名称,并研究了对象的 familiairity 如何影响名称的变化。我们提出了两种可能的影响因素:增加familiarity可能会扩展词汇,导致更高的变化,或者推动对常见名称的共识,从而减少变化。我们发现了这两种因素都在运作。我们的研究示例了如何使用计算机资源解决认知科学研究问题。
JWSign: A Highly Multilingual Corpus of Bible Translations for more Diversity in Sign Language Processing
results: 实验结果显示,使用多语言系统可以超越双语基eline系统,且在较高资源enario中,Language pairs的类型相似性 clustering 可以提高翻译质量。Abstract
Advancements in sign language processing have been hindered by a lack of sufficient data, impeding progress in recognition, translation, and production tasks. The absence of comprehensive sign language datasets across the world's sign languages has widened the gap in this field, resulting in a few sign languages being studied more than others, making this research area extremely skewed mostly towards sign languages from high-income countries. In this work we introduce a new large and highly multilingual dataset for sign language translation: JWSign. The dataset consists of 2,530 hours of Bible translations in 98 sign languages, featuring more than 1,500 individual signers. On this dataset, we report neural machine translation experiments. Apart from bilingual baseline systems, we also train multilingual systems, including some that take into account the typological relatedness of signed or spoken languages. Our experiments highlight that multilingual systems are superior to bilingual baselines, and that in higher-resource scenarios, clustering language pairs that are related improves translation quality.
摘要
技术进步受到了数据不足的阻碍,妨碍了认知、翻译和生产任务的进步。全球各种手语的缺乏完整的数据集,使得这一领域的研究受到了极大的偏袋,大多数研究集中在高收入国家的手语上进行,这使得这个领域的研究非常偏向高收入国家的手语。在这项工作中,我们介绍了一个新的大型、多语言的手语翻译数据集:JWSign。该数据集包括98种手语的2,530小时的圣经翻译,共有1,500名个体手语演示者。在这个数据集上,我们报告了神经机器翻译实验。除了双语基eline系统,我们还训练了多语言系统,其中一些考虑了手语或口语语言之间的类型学关系。我们的实验表明,多语言系统比双语基eline系统更高效,而在更高资源的场景下,将相关的语言对 grouping 可以提高翻译质量。
A Computationally Efficient Sparsified Online Newton Method
results: 这篇论文的实验结果显示,SONew方法可以实现30%更快的快速度,3.4%的效能提升,并且80%的训练损失减少,相比于内存高效的优化器,包括第一类方法。此外,SONew方法可以实现高度的并行和高效率的实现,并且可以轻松地扩展到大规模的实验。Abstract
Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact -- imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about 3% slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew
摘要
第二顺序方法具有增强深度神经网络训练的潜在潜力,但它们的巨大内存和计算需求限制了它们的实用性。因此,有一个需求是可扩展的第二顺序方法,可以高效地训练大型模型。在这篇文章中,我们介绍了简化在线新点方法(SONew),它是一个具有优化组件的内存有效的第二顺序方法。我们使用了一个新的LogDet矩阵差异测度,并与简化条件相结合,以减少在线凸优化框架中的遗憾。我们对大规模benchmark进行实验,获得了30%的更快的渐进、3.4%的相对提高验证性能,以及80%的相对提高训练损失。相比之下,内存高效的优化器,包括首顺序方法,SONew具有更高的实用性。实际上,我们发现,对大型benchmark的应用,Shampoo方法无法扩展,并且需要较多的工程实践来扩展。相比之下,SONew方法具有更直接的实现方式,增加了它的实用性。SONew代码可以在以下链接获取:https://github.com/devvrit/SONew。
Characterizing Tradeoffs in Language Model Decoding with Informational Interpretations
paper_authors: Chung-Ching Chang, William W. Cohen, Yun-Hsuan Sung
for: 这 paper 是为了提出一种语言模型预测器的理论框架,用于解决预测器的设计问题。
methods: 这 paper 使用动态 программирова法和信息理论来描述语言模型预测器的算法。它将预测器的设计从 logit 空间提升到 action-state value function 空间,并将每个组件在这个空间中的解释。
results: 这 paper 显示了如何通过优化 action-state value functions,以获得更好的预测性和可解释性。这些结果可以帮助解决预测器的质量和性能问题。Abstract
We propose a theoretical framework for formulating language model decoder algorithms with dynamic programming and information theory. With dynamic programming, we lift the design of decoder algorithms from the logit space to the action-state value function space, and show that the decoding algorithms are consequences of optimizing the action-state value functions. Each component in the action-state value function space has an information theoretical interpretation. With the lifting and interpretation, it becomes evident what the decoder algorithm is optimized for, and hence facilitating the arbitration of the tradeoffs in sensibleness, diversity, and attribution.
摘要
我们提出了一个理论框架,用于形式化语言模型decoder算法的动态计划和信息理论。通过动态计划,我们将decoder算法的设计从ilogit空间提升到动作-状态价值函数空间,并显示出decoding算法是优化动作-状态价值函数的结果。每个动作-状态价值函数空间中的组件都有信息理论的解释。通过提升和解释,就可以看出decoder算法是优化什么,因此可以促进折衔敏感、多样性和责任的衡量。
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
paper_authors: Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran
for: The paper aims to improve the performance of large vision language models (LVLMs) by incorporating natural language feedback (NLF) to enhance their alignment and interactions.
methods: The paper proposes a novel categorization of NLF into two types: critique and refinement, and uses conditional reinforcement learning to train the LVLMs to incorporate feedback in multi-turn interactions.
results: The paper shows that the proposed method, called DRESS, can generate more helpful, honest, and harmless responses, and more effectively learn from feedback during multi-turn interactions compared to state-of-the-art LVLMs.Abstract
We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback, they are still prone to generate unhelpful, hallucinated, or harmful responses. Second, while the visual instruction tuning data is generally structured in a multi-turn dialogue format, the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these, we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF, we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses, and more effectively learn from feedback during multi-turn interactions compared to SOTA LVMLs.
摘要
我们介绍DRESS,一个大型视觉语言模型(LVLM),它创新地利用自然语言反馈(NLF)来提高对人类偏好的调整和互动。现有LVLMs通常只通过 instrucion 精化阶段来提高对人类偏好的调整,而无法采用更多的反馈来减少生成无用、幻想或危险的回答。其次,视觉指令练习数据通常是多turn对话格式,但连续对话中的连接和依赖关系较弱,这限制了LVLMs的有效多turn互动能力。为此,我们提出了一种新的NLF分类方法:批评和细化。批评NLF可以识别回答的优缺点,并用于对LVLMs进行调整,使其更加适应人类偏好。细化NLF可以提供具体的改进建议,并被用来提高LVLMs的互动能力,即LVLMs能够通过反馈进行多turn互动中的改进。由于NLF的非准确性,我们扩展了条件奖励学习的训练方法。我们的实验结果表明,DRESS可以生成更有用(9.76%)、诚实(11.52%)和无害(21.03%)的回答,并在多turn互动中更好地学习反馈。
Unambiguity and Fewness for Nonuniform Families of Polynomial-Size Nondeterministic Finite Automata
methods: 论文使用了非征Compatibility families of polynomial-size finite automata,这些自动机有 polynomially many inner states,来解决这些问题。
results: 论文表明,在一些特定情况下,这些非征Compatibility families of finite automata 有不同的计算能力,而且两种方法(一个方法是限制机器只能做一个方向的移动,另一个方法是限制机器的长度为 polynomially-bounded)是等价的。Abstract
Nonuniform families of polynomial-size finite automata, which are series of indexed finite automata having polynomially many inner states, are used in the past literature to solve nonuniform families of promise decision problems. Among such nonuniform families of finite automata, we focus our attention, in particular, on the variants of nondeterministic finite automata, which have at most "one" (unambiguous), "polynomially many" (few) accepting computation paths, or unambiguous/few computation paths leading to each fixed configuration. When such machines are limited to make only one-way head moves, we can prove with no unproven hardness assumptions that some of these variants are different in computational power from each other. As for two-way machines restricted to instances of polynomially-bounded length, families of two-way polynomial-size nondeterministic finite automata are equivalent in power to families of polynomial-size unambiguous finite automata.
摘要
非均匀家族的多项式大小自动机,这是在过去文献中用于解决非均匀家族的Promise决策问题的工具。我们在这些非均匀自动机家族中特别关注尝试机器,它们在最多只能有一个(不ambiguous), polynomially many(少)的接受计算路径,或者每个固定配置都有一个或 polynomially many 的计算路径。当这些机器只能做一个一向头移时,我们可以证明不带任何难度假设的情况下,这些变体之间存在不同的计算能力。而两个方向的机器,限制到 polynomially-bounded 长度的实例时, families of two-way polynomial-size nondeterministic finite automata 和 families of polynomial-size unambiguous finite automata 之间存在相同的计算能力。
Hijacking Large Language Models via Adversarial In-Context Learning
results: 实验结果表明,这种攻击可以让LLM生成targeted的不良输出,通过吸引LLM的注意力向 adversarial tokens。Abstract
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific tasks by utilizing labeled examples as demonstrations in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMs to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.
摘要
内容学习(ICL)已经 emerged as a powerful paradigm,利用 LLMS для特定任务,通过使用标签的示例作为条件答案。 despite its promising performance, ICL 受到示例选择和排序的不稳定性问题。 In addition, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMS to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.
An Attention-Based Denoising Framework for Personality Detection in Social Media Texts
results: 论文在两个常用的数据集上达到了当前领域的最佳性能水平,具体来说是在Twitter-Myers-Briggs Type Indicator(Twitter-MBTI)数据集上的平均准确率提高10.2%。Abstract
In social media networks, users produce a large amount of text content anytime, providing researchers with a valuable approach to digging for personality-related information. Personality detection based on user-generated texts is a universal method that can be used to build user portraits. The presence of noise in social media texts hinders personality detection. However, previous studies have not fully addressed this challenge. Inspired by the scanning reading technique, we propose an attention-based information extraction mechanism (AIEM) for long texts, which is applied to quickly locate valuable pieces of information, and focus more attention on the deep semantics of key pieces. Then, we provide a novel attention-based denoising framework (ADF) for personality detection tasks and achieve state-of-the-art performance on two commonly used datasets. Notably, we obtain an average accuracy improvement of 10.2% on the gold standard Twitter-Myers-Briggs Type Indicator (Twitter-MBTI) dataset. We made our code publicly available on GitHub. We shed light on how AIEM works to magnify personality-related signals.
摘要
在社交媒体网络中,用户生成大量文本内容,为研究人员提供了一个价值的检测人格信息的方法。基于用户生成的文本进行人格检测是一种通用的方法,可以 construir 用户肖像。但是,社交媒体文本中的噪声会妨碍人格检测。然而,先前的研究并没有彻底解决这个挑战。我们提出了基于扫描阅读技术的注意力基本信息提取机制(AIEM),用于快速找到有价值信息,并更多地关注深层 semantics 的关键 Piece。然后,我们提出了一种基于注意力的噪声降减框架(ADF),用于人格检测任务,并在两个常用的数据集上实现了状态的表现。特别是,我们在 Twitter-Myers-Briggs Type Indicator(Twitter-MBTI)数据集上实现了10.2%的平均准确率提升。我们在 GitHub 上公开了我们的代码。我们探讨了 AIEM 如何强调人格相关的信号。
methods: 这个研究使用了大型语言模型(LLM)和semantic brain decoder来直接从functional magnetic resonance imaging(fMRI)输入中生成语言。
results: 研究发现,这个模型可以对于视觉或听觉语言刺激而生成协调的语言序列,并且与脑Input的内容有着Semantic相似性。相比之下,随机控制和预先生成的语言选择方法,以及标准的LLM,它们只能生成基于语言训练数据的通用单词次序。Abstract
Generating human language through non-invasive brain-computer interfaces (BCIs) has the potential to unlock many applications, such as serving disabled patients and improving communication. Currently, however, generating language via BCIs has been previously successful only within a classification setup for selecting pre-generated sentence continuation candidates with the most likely cortical semantic representation. Inspired by recent research that revealed associations between the brain and the large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.
摘要
使用非侵入式脑计算机接口(BCI)生成人类语言有很多应用前途,如服务残疾患者和改善沟通。然而,目前通过BCI生成语言仅限于在分类设置中选择预生成的句子续写候选者中的最有可能性的 cortical semantic representation。 inspirited by recent research that revealed associations between the brain and large computational language models, we propose a generative language BCI that utilizes the capacity of a large language model (LLM) jointly with a semantic brain decoder to directly generate language from functional magnetic resonance imaging (fMRI) input. The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli perceived, without prior knowledge of any pre-generated candidates. We compare the language generated from the presented model with a random control, pre-generated language selection approach, and a standard LLM, which generates common coherent text solely based on the next word likelihood according to statistical language training data. The proposed model is found to generate language that is more aligned with semantic stimulus in response to which brain input is sampled. Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Which Modality should I use – Text, Motif, or Image? : Understanding Graphs with Large Language Models
results: 研究发现,图像模式,尤其是通过高级见语言模型like GPT-4V支持,比文本更有效地管理токен限制而保留重要信息。研究还探讨了不同因素对每种编码模式表现的影响。Abstract
Large language models (LLMs) are revolutionizing various fields by leveraging large text corpora for context-aware intelligence. Due to the context size, however, encoding an entire graph with LLMs is fundamentally limited. This paper explores how to better integrate graph data with LLMs and presents a novel approach using various encoding modalities (e.g., text, image, and motif) and approximation of global connectivity of a graph using different prompting methods to enhance LLMs' effectiveness in handling complex graph structures. The study also introduces GraphTMI, a new benchmark for evaluating LLMs in graph structure analysis, focusing on factors such as homophily, motif presence, and graph difficulty. Key findings reveal that image modality, supported by advanced vision-language models like GPT-4V, is more effective than text in managing token limits while retaining critical information. The research also examines the influence of different factors on each encoding modality's performance. This study highlights the current limitations and charts future directions for LLMs in graph understanding and reasoning tasks.
摘要
Translation notes:* "Large language models" is translated as "大型语言模型" (dàxíng yǔyán módelǐ)* "revolutionizing" is translated as "革命化" (gémònghuà)* "various fields" is translated as "多个领域" (duō gè lǐngyù)* "leveraging large text corpora" is translated as "利用大量文本资料" (lìyòng dàliàng wén tiě xīn yǎng)* "context-aware intelligence" is translated as "Context-aware intelligence" (上下文意识)* "encoding an entire graph" is translated as "完整的图形编码" (quánzhì de túxíng biān mǎ)* "fundamentally limited" is translated as "基本上有限" (jībǎo shang yǒu xiàn)* "novel approach" is translated as "新的方法" (xīn de fāngfǎ)* "using various encoding modalities" is translated as "使用多种编码方式" (shǐyòu duōshì biān mǎ fāngshì)* "approximation of global connectivity" is translated as "全球连接的估计" (quánqiú liánjiē de gèjì)* "different prompting methods" is translated as "不同的提示方法" (bùdōng de tímí fāngfǎ)* "enhance LLMs' effectiveness" is translated as "增强LLMs的效果" (zēngcháng LLMs de xiànggòu)* "in handling complex graph structures" is translated as "处理复杂的图形结构" (chùzhì fùzì de túxíng jiégòu)* "Key findings reveal" is translated as "主要发现是" (zhǔyào fāxìn shì)* "image modality" is translated as "图像模式" (túxíang móshì)* "supported by advanced vision-language models" is translated as "由高级视语言模型支持" (yǐ gāojí wèi yǔ yǔ móshì)* "more effective than text" is translated as "比文本更有效" (bǐ wén tiěn jí yòu yì)* "managing token limits" is translated as "管理 токен限制" (guǎn lǐ tóu kē yùn xiàn)* "while retaining critical information" is translated as "保留关键信息" (bǎo liú guān jí xìn xīn)* "The research also examines the influence of different factors" is translated as "研究也研究了不同因素的影响" (yánjiū yě yánjiū le bùdàng yīn xiǎng de yìngxìn)* "on each encoding modality's performance" is translated as "每种编码方式的性能" (měi zhǒng biān mǎ fāngshì de xìngnéng)* "This study highlights the current limitations" is translated as "这项研究透视当前的限制" (zhè yè yánjiū tòu shì dāng zhì de jiàn zhì)* "and charts future directions" is translated as "并规划未来的发展" (dàn zhì yú yì yè yì)
GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets
results: 本研究发现了10种不同的实体类型,包括机器学习模型和数据集,并提供了一个手动注解的全文科学论文集,以便进一步研究和应用。Abstract
Named Entity Recognition (NER) models play a crucial role in various NLP tasks, including information extraction (IE) and text understanding. In academic writing, references to machine learning models and datasets are fundamental components of various computer science publications and necessitate accurate models for identification. Despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like ML model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around ML models and datasets. In order to provide a nuanced understanding of how ML models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like "our BERT-based model" or "an image CNN". You can find the ground truth dataset and code to replicate model training at https://data.gesis.org/gsap/gsap-ner.
摘要
Named Entity Recognition (NER) 模型在各种自然语言处理(NLP)任务中扮演着关键的角色,包括信息抽取(IE)和文本理解。在学术写作中,关于机器学习模型和数据集的引用是学术出版物的重要组成部分,需要准确的模型来识别。despite the advancements in NER, existing ground truth datasets do not treat fine-grained types like machine learning model and model architecture as separate entity types, and consequently, baseline models cannot recognize them as such. In this paper, we release a corpus of 100 manually annotated full-text scientific publications and a first baseline model for 10 entity types centered around machine learning models and datasets. In order to provide a nuanced understanding of how machine learning models and datasets are mentioned and utilized, our dataset also contains annotations for informal mentions like "our BERT-based model" or "an image CNN". You can find the ground truth dataset and code to replicate model training at .Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.
Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span Detection
paper_authors: Sarah Masud, Mohammad Aflah Khan, Md. Shad Akhtar, Tanmoy Chakraborty
for: 本研究旨在开发计算方法来减少网络上的仇恨言论。
methods: 本研究使用黑盒模型来识别仇恨内容,并提供了一种可能的重写建议。
results: 研究发现,使用这种方法可以减少Explicit span detection in English Tweets,最高macro-F1达0.58。Abstract
As hate speech continues to proliferate on the web, it is becoming increasingly important to develop computational methods to mitigate it. Reactively, using black-box models to identify hateful content can perplex users as to why their posts were automatically flagged as hateful. On the other hand, proactive mitigation can be achieved by suggesting rephrasing before a post is made public. However, both mitigation techniques require information about which part of a post contains the hateful aspect, i.e., what spans within a text are responsible for conveying hate. Better detection of such spans can significantly reduce explicitly hateful content on the web. To further contribute to this research area, we organized HateNorm at HASOC-FIRE 2023, focusing on explicit span detection in English Tweets. A total of 12 teams participated in the competition, with the highest macro-F1 observed at 0.58.
摘要
随着仇恨言论在网络上的迅速增加,计算方法的开发已成为一项非常重要的任务。 Reactively,使用黑盒模型标识仇恨内容可能会让用户感到困惑,因为他们不知道哪些内容被自动标识为仇恨。然而,投入型mitigation可以通过建议重写内容之前提交,以避免内容被公布。然而,这两种mitigation技术都需要知道哪些文本内容包含仇恨元素,即哪些文本段落在传递仇恨信息方面表现出色。更好地检测这些文本段落可以有效减少网络上直接的仇恨内容。为了进一步贡献到这个研究领域,我们在HASOC-FIRE 2023年组织了HateNorm比赛,专注于英语推文中的直接检测。总共12个团队参与了比赛,最高的macro-F1为0.58。
X-Mark: Towards Lossless Watermarking Through Lexical Redundancy
paper_authors: Liang Chen, Yatao Bian, Yang Deng, Shuaiyi Li, Bingzhe Wu, Peilin Zhao, Kam-fai Wong
For: This paper focuses on the issue of text watermarking, which is important for detecting machine-generated text.* Methods: The authors introduce a novel approach called XMark, which leverages text redundancy within the lexical space to improve text generation fluency while maintaining watermark detectability.* Results: The authors present theoretical analyses and empirical evidence showing that XMark outperforms existing methods in retaining the emergent abilities of large language models, including zero-shot and few-shot knowledge recall, logical reasoning, and instruction following.Here’s the same information in Simplified Chinese text:* For: 这篇论文关注了文本沟通技术,它在机器生成文本检测方面具有重要意义。* Methods: 作者们提出了一种新的方法,即XMark,它利用文本空间内的同义词来提高文本生成流畅性,同时保持水印检测的能力。* Results: 作者们提供了理论分析和实验证据,表明XMark比现有方法更能保持大语言模型的emergent能力,包括零批知识回忆、几批知识回忆、逻辑推理和指令遵循。Abstract
Text watermarking has emerged as an important technique for detecting machine-generated text. However, existing methods can severely degrade text quality due to arbitrary vocabulary partitioning, which disrupts the language model's expressiveness and impedes textual coherence. To mitigate this, we introduce XMark, a novel approach that capitalizes on text redundancy within the lexical space. Specifically, XMark incorporates a mutually exclusive rule for synonyms during the language model decoding process, thereby integrating prior knowledge into vocabulary partitioning and preserving the capabilities of language generation. We present theoretical analyses and empirical evidence demonstrating that XMark substantially enhances text generation fluency while maintaining watermark detectability. Furthermore, we investigate watermarking's impact on the emergent abilities of large language models, including zero-shot and few-shot knowledge recall, logical reasoning, and instruction following. Our comprehensive experiments confirm that XMark consistently outperforms existing methods in retaining these crucial capabilities of LLMs.
摘要
FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models
results: 我们通过使用FollowEvalBenchmark测试多个LLM模型,发现它们的性能远远落后于人类。这 highlights 大语言模型的指令遵循能力仍然具有很大的提升空间。Abstract
The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited to a single language and are constructed using automated approaches, which restricts their applicability and the quality of the test examples they contain. To bridge this gap, we introduce the FollowEval benchmark in this paper. This benchmark is composed of instances in both English and Chinese, and all test examples are crafted by human experts. Furthermore, the FollowEval benchmark is designed to assess LLMs across five critical dimensions of instruction following: string manipulation, commonsense reasoning, logical reasoning, spatial reasoning, and response constraints. To enhance the complexity and present a sufficient challenge, each test example is designed to evaluate more than one dimension. We have evaluated various LLMs using the FollowEval benchmark and found that their performance significantly lags behind that of humans. This highlights the considerable room for improvement in the instruction-following ability of these models.
摘要
检测大语言模型(LLM)的 instrucion-following 能力的有效性非常重要。如果一个模型无法遵循人类的 instrucion,那么它可能无法提供可靠和有用的回答。为了实现这个目标,各种标准套件已经建立来评估这些模型的 instrucion-following 能力。然而,这些标准套件受限于单一语言,并且使用自动化的方法构建,这限制了它们的可应用性和测试例子的质量。为了bridging这个差距,我们在这篇论文中引入 FollowEval 套件。这个套件包含英文和中文两种语言的实例,并且所有的测试例子都是由人类专家手动制作的。此外,FollowEval 套件采用了五个关键的 instrucion-following 维度来评估 LLM:字符串处理、常识理解、逻辑理解、空间理解和回答约束。为了增加复杂性和提供足够的挑战,每个测试例子都会评估多个维度。我们使用 FollowEval 套件测试了多种 LLM,发现它们在人类的 instrucion-following 能力方面表现明显落后。这说明这些模型在 instrucion-following 能力方面还有很大的进步空间。
AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages
results: 这篇论文的结果表明,使用这种新的评估方法可以提高非洲语言机器翻译的评估精度,并且与人类评分有高度相关性(Spearman-rank correlation +0.406)。Abstract
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n-gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET, a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments (+0.406).
摘要
尽管我们在扩展多语言机器翻译(MT)模型和评估数据到数个非常贫语言方面做出了进步,但是很难准确度量我们在这些语言上做出的进步,因为评估通常基于n-gram匹配度量如BLEU,这些度量与人类评估的相关性较差。基于嵌入度量如COMET可以更好地与人类评估相关,但是对于非常贫语言来说,评估数据的缺乏、评估指南的复杂性(如多维质量度量(MQM)),以及多语言encoder的语言覆盖率带来了障碍。在这篇论文中,我们解决了这些挑战,通过创建高质量的人类评估数据,并采用简化MQM指南进行错误扩 span的注释和直接评估(DA)分数的计算,对13种 typologically 多样化的非洲语言进行评估。此外,我们还开发了AfriCOMET评估度量,通过利用高资源语言的DA训练数据和非洲中心的多语言encoder(AfroXLM-Roberta)创建了非洲语言MT中的 estado-of-the-art 评估度量,与人类评估相关性为+0.406(Spearman相关度)。
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
paper_authors: Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen
for: 本研究旨在探讨LLMs中的认知结构和过程如何受到攻击,以及如何防范这些攻击。
methods: 本研究使用了新的类型的监狱攻击, specifically targeting LLMs的认知结构和过程。 experiments conducted on AdvBench and MasterKey demonstrate that various LLMs can be compromised through cognitive overload.
results: 研究发现,通过三种不同的认知负担方式,可以成功地监狱所有研究的LLMs,而现有的防御策略很难有效地防止这些恶意用途。Abstract
While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.
摘要
大型语言模型(LLM)已经显示出了增加的力量,但也导致了各种危害行为的出现。作为代表,跳狱攻击可以让 LLM 发生危害或不道德的反应,即使经过安全Alignment。在这篇论文中,我们 investigate 一种新的跳狱攻击,这种攻击targets LLM的认知结构和过程。 Specifically, we analyze the safety vulnerability of LLMs in the face of (1)多语言认知过载、(2)掩饰表达和(3)效果归因。与之前的跳狱攻击不同,我们提出的认知过载是一种黑盒攻击,没有需要对模型结构或模型参数的知识。在 AdvBench 和 MasterKey 上进行的实验表明,包括流行的开源模型 Llama 2 和商业模型 ChatGPT 等各种 LLMS 都可以通过认知过载遭受攻击。受到认知心理学的管理认知过载的启示,我们进一步调查了防御认知过载攻击的两种方面。实验表明,我们从三个角度来进行认知过载可以成功地跳狱所有研究过的 LLMS,而现有的防御策略很难有效地抑制这些危害用途。
Human Still Wins over LLM: An Empirical Study of Active Learning on Domain-Specific Annotation Tasks
results: 研究发现,即使使用了一些百度的标注数据,小型模型仍可以超过GPT-3.5的性能,而且与GPT-4相比,它们的性能相对较高。这些结论表明,LLMs的预测可以作为域专知应用中的启动方法,而人类专家仍然是数据标注驱动的域专知任务中不可或缺的。Abstract
Large Language Models (LLMs) have demonstrated considerable advances, and several claims have been made about their exceeding human performance. However, in real-world tasks, domain knowledge is often required. Low-resource learning methods like Active Learning (AL) have been proposed to tackle the cost of domain expert annotation, raising this question: Can LLMs surpass compact models trained with expert annotations in domain-specific tasks? In this work, we conduct an empirical experiment on four datasets from three different domains comparing SOTA LLMs with small models trained on expert annotations with AL. We found that small models can outperform GPT-3.5 with a few hundreds of labeled data, and they achieve higher or similar performance with GPT-4 despite that they are hundreds time smaller. Based on these findings, we posit that LLM predictions can be used as a warmup method in real-world applications and human experts remain indispensable in tasks involving data annotation driven by domain-specific knowledge.
摘要
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning
results: 对多个时间问答数据集进行了实验,研究结果表明,我们的方法能够提高 LLMS 的时间问答指标,比基eline方法提高了显著的多。Abstract
Knowledge in the real world is being updated constantly. However, it is costly to frequently update large language models (LLMs). Therefore, it is crucial for LLMs to understand the concept of temporal knowledge. However, prior works on temporal question answering did not emphasize multi-answer and multi-hop types of temporal reasoning. In this paper, we propose a complex temporal question-answering (QA) dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning. Besides, we also propose a novel data augmentation strategy to improve the complex temporal reasoning capability and robustness of LLMs. We conducted experiments on multiple temporal QA datasets. Experimental results show that our method is able to improve LLMs' performance on temporal QA benchmarks by significant margins.
摘要
现实世界中的知识是不断更新的。然而,更新大型自然语言模型(LLM)的成本很高。因此,LLM需要理解时间知识的概念。但是,先前的时间问答工作没有强调多个答案和多个跳步类时间推理。在这篇论文中,我们提出了复杂时间问答(QA)数据集Complex-TR,它专注于多个答案和多个跳步时间推理。此外,我们还提出了一种新的数据增强策略,用于提高LLM的复杂时间推理能力和鲁棒性。我们在多个时间问答数据集上进行了实验,实验结果显示,我们的方法可以在时间问答标准准则上提高LLM的表现。
SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models
results: 在实验中,使用SUQL和大语言模型实现的对话式搜索代理可以在51.3%的问题中找到满足用户需求的实体,比常用的基eline提高89.3%。Abstract
Many knowledge sources consist of both structured information such as relational databases as well as unstructured free text. Building a conversational interface to such data sources is challenging. This paper introduces SUQL, Structured and Unstructured Query Language, the first formal executable representation that naturally covers compositions of structured and unstructured data queries. Specifically, it augments SQL with several free-text primitives to form a precise, succinct, and expressive representation. This paper also presents a conversational search agent based on large language models, including a few-shot contextual semantic parser for SUQL. To validate our approach, we introduce a dataset consisting of crowdsourced questions and conversations about real restaurants. Over 51% of the questions in the dataset require both structured and unstructured data, suggesting that it is a common phenomenon. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 89.3% of the time, compared to just 65.0% for a strong and commonly used baseline.
摘要
Many knowledge sources consist of both structured information such as relational databases as well as unstructured free text. Building a conversational interface to such data sources is challenging. This paper introduces SUQL, Structured and Unstructured Query Language, the first formal executable representation that naturally covers compositions of structured and unstructured data queries. Specifically, it augments SQL with several free-text primitives to form a precise, succinct, and expressive representation. This paper also presents a conversational search agent based on large language models, including a few-shot contextual semantic parser for SUQL. To validate our approach, we introduce a dataset consisting of crowdsourced questions and conversations about real restaurants. Over 51% of the questions in the dataset require both structured and unstructured data, suggesting that it is a common phenomenon. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 89.3% of the time, compared to just 65.0% for a strong and commonly used baseline.Here's the translation in Traditional Chinese:Many knowledge sources consist of both structured information such as relational databases as well as unstructured free text. Building a conversational interface to such data sources is challenging. This paper introduces SUQL, Structured and Unstructured Query Language, the first formal executable representation that naturally covers compositions of structured and unstructured data queries. Specifically, it augments SQL with several free-text primitives to form a precise, succinct, and expressive representation. This paper also presents a conversational search agent based on large language models, including a few-shot contextual semantic parser for SUQL. To validate our approach, we introduce a dataset consisting of crowdsourced questions and conversations about real restaurants. Over 51% of the questions in the dataset require both structured and unstructured data, suggesting that it is a common phenomenon. We show that our few-shot conversational agent based on SUQL finds an entity satisfying all user requirements 89.3% of the time, compared to just 65.0% for a strong and commonly used baseline.
Large Language Models for Propaganda Span Annotation
results: 我们的结果表明,提供更多的信息作为提示可以提高注释一致性和性能。我们计划将多个笔记者的注释,包括GPT-4的注释,向社区提供。Abstract
The use of propagandistic techniques in online communication has increased in recent years, aiming to manipulate online audiences. Efforts to automatically detect and debunk such content have been made, addressing various modeling scenarios. These include determining whether the content (text, image, or multimodal) (i) is propagandistic, (ii) employs one or more techniques, and (iii) includes techniques with identifiable spans. Significant research efforts have been devoted to the first two scenarios compared to the latter. Therefore, in this study, we focus on the task of detecting propagandistic textual spans. We investigate whether large language models such as GPT-4 can be utilized to perform the task of an annotator. For the experiments, we used an in-house developed dataset consisting of annotations from multiple annotators. Our results suggest that providing more information to the model as prompts improves the annotation agreement and performance compared to human annotations. We plan to make the annotated labels from multiple annotators, including GPT-4, available for the community.
摘要
在latest years, the use of propaganda techniques in online communication has increased, aiming to manipulate online audiences. Efforts to automatically detect and debunk such content have been made, addressing various modeling scenarios. These include determining whether the content (text, image, or multimodal) (i) is propagandistic, (ii) employs one or more techniques, and (iii) includes techniques with identifiable spans. Significant research efforts have been devoted to the first two scenarios compared to the latter. Therefore, in this study, we focus on the task of detecting propagandistic textual spans. We investigate whether large language models such as GPT-4 can be utilized to perform the task of an annotator. For the experiments, we used an in-house developed dataset consisting of annotations from multiple annotators. Our results suggest that providing more information to the model as prompts improves the annotation agreement and performance compared to human annotations. We plan to make the annotated labels from multiple annotators, including GPT-4, available for the community.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.
results: 模型在ToTTo测试套件中的纯表格转文本设置中超过了状态态的表现,并在控制的表格转文本设置中保持竞争力。它还在未看过的数据集中表现出色,在所有生成设置中超越了ToTTo状态态。Abstract
Table-to-Text has been traditionally approached as a linear language to text problem. However, visually represented tables are rich in visual information and serve as a concise, effective form of representing data and its relationships. When using text-based approaches, after the linearization process, this information is either lost or represented in a space inefficient manner. This inefficiency has remained a constant challenge for text-based approaches making them struggle with large tables. In this paper, we demonstrate that image representation of tables are more space-efficient than the typical textual linearizations, and multi-modal approaches are competitive in Table-to-Text tasks. We present PixT3, a multimodal table-to-text model that outperforms the state-of-the-art (SotA) in the ToTTo benchmark in a pure Table-to-Text setting while remaining competitive in controlled Table-to-Text scenarios. It also generalizes better in unseen datasets, outperforming ToTTo SotA in all generation settings. Additionally, we introduce a new intermediate training curriculum to reinforce table structural awareness, leading to improved generation and overall faithfulness of the models.
摘要
传统上,Table-to-Text问题被看作是一个线性语言到文本问题。然而,可见的表格具有丰富的视觉信息,并且作为数据和其关系的简洁、有效的表示形式。在文本基于的方法中, послеLinearization过程,这些信息将 Either lost or represented in an inefficient manner.这种不足在文本基于的方法中一直是一大挑战,使得它们在处理大表格时困难。在这篇论文中,我们证明了图像表示的表格更加空间效率,而且多模态方法在Table-to-Text任务中竞争。我们提出了PixT3,一种多模态表格到文本模型,在ToTTo Benchmark中超越了状态的艺术(SotA),在纯文本基于的Setting中具有比较竞争力,并在Controlled Table-to-Text Setting中具有更好的整体 faithfulness。此外,我们还引入了一种新的中间培训课程,以强化表格结构的认识,导致模型的生成和整体 faithfulness得到改进。
The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
results: 我们的发现表明,在successive 迭代中,模型的输出多样性明显减少,这表明了在这种训练方法下,LLMs 的语言能力可能受到限制。Abstract
This study investigates the consequences of training large language models (LLMs) on synthetic data generated by their predecessors, an increasingly prevalent practice aimed at addressing the limited supply of human-generated training data. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we developed a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive fine-tuning experiments across various natural language generation tasks. Our findings reveal a marked decrease in the diversity of the models' outputs through successive iterations. This trend underscores the potential risks of training LLMs on predecessor-generated text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of LLMs.
摘要
DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data
for: This paper aims to evaluate the numerical reasoning and problem-solving capabilities of large language models (LLMs) in the context of understanding and analyzing financial documents.
methods: The paper introduces DocMath-Eval, a comprehensive benchmark that incorporates different prompting strategies to assess the capabilities and limitations of existing LLMs in understanding financial documents.
results: The current best-performing system (GPT-4) can perform well on simple problems, but significantly lags behind human experts in more complex problems grounded in longer contexts. The paper concludes that DocMath-Eval can be used as a valuable benchmark to evaluate LLMs’ capabilities to solve challenging numerical reasoning problems in expert domains.Abstract
Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 19 LLMs, including those specialized in coding and finance. We also incorporate different prompting strategies (i.e., Chain-of-Thoughts and Program-of-Thoughts) to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that, although the current best-performing system (i.e., GPT-4), can perform well on simple problems such as calculating the rate of increase in a financial metric within a short document context, it significantly lags behind human experts in more complex problems grounded in longer contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs' capabilities to solve challenging numerical reasoning problems in expert domains. We will release the benchmark and code at https://github.com/yale-nlp/DocMath-Eval.
摘要
现代LLM技术在解决类似于考试的数学问题方面已经表现出了惊人的表现。然而,这些数学解决问题在实际场景中的有效性,特别是在专家领域,仍然未经充分调查。这篇论文介绍了DocMath-Eval,一个特有的数学问题解决和分析金融文档中的文本和表格的完整性评价标准。我们评估了19种不同的LLM系统,包括编程和金融领域的专门系统。我们还采用了不同的提示策略(即链条思维和程序思维)来全面评估现有LLM的能力和局限性。我们发现,即使最佳表现的系统(即GPT-4)在短文档上解决金融指标增长率的简单问题上表现出色,但是在更复杂的问题上,它在更长的文档上缺乏人类专家的能力。我们认为DocMath-Eval可以用于评估LLM在专家领域中解决复杂的数学问题的能力。我们将在https://github.com/yale-nlp/DocMath-Eval上发布标准和代码。
$\textit{Dial BeInfo for Faithfulness}$: Improving Factuality of Information-Seeking Dialogue via Behavioural Fine-Tuning
paper_authors: Evgeniia Razumovskaia, Ivan Vulić, Pavle Marković, Tomasz Cichy, Qian Zheng, Tsung-Hsien Wen, Paweł Budzianowski
for: 提高信息寻找对话系统的准确性和可靠性,使其响应用户的询问能够提供有用和适合知识源的回答。
methods: 使用行为调整来改善信息寻找对话系统的准确性和可靠性,以避免现象抽象和假象。
results: 对三个标准数据集和多个领域进行了调整,并在零容量情况下在不同领域中表现出色,而且在实际生产对话中表现更好,超过GPT4。Abstract
Factuality is a crucial requirement in information seeking dialogue: the system should respond to the user's queries so that the responses are meaningful and aligned with the knowledge provided to the system. However, most modern large language models suffer from hallucinations, that is, they generate responses not supported by or contradicting the knowledge source. To mitigate the issue and increase faithfulness of information-seeking dialogue systems, we introduce BeInfo, a simple yet effective method that applies behavioural tuning to aid information-seeking dialogue. Relying on three standard datasets, we show that models tuned with BeInfo} become considerably more faithful to the knowledge source both for datasets and domains seen during BeInfo-tuning, as well as on unseen domains, when applied in a zero-shot manner. In addition, we show that the models with 3B parameters (e.g., Flan-T5) tuned with BeInfo demonstrate strong performance on data from real `production' conversations and outperform GPT4 when tuned on a limited amount of such realistic in-domain dialogues.
摘要
factuality是寻求信息对话中的重要需求:系统应该根据用户的询问回答,以便响应是有意义的并与系统提供的知识一致。然而,现代大语言模型很容易出现幻觉,即生成不支持或与知识源相 contradicting 的回答。为了解决这个问题并提高信息寻求对话系统的忠诚度,我们介绍了BeInfo,一种简单 yet effective的方法,通过行为调整来帮助信息寻求对话。我们使用三个标准 dataset 来显示,通过BeInfo-调整,模型在seen 和 unseen 领域都变得更加忠诚于知识源。此外,我们还显示,具有3B参数的模型(如Flan-T5),通过BeInfo 的调整,在真实的生产对话数据上表现出色,并在不seen 领域中具有 Zero-shot 的优势。
How Far Can We Extract Diverse Perspectives from Large Language Models? Criteria-Based Diversity Prompting!
results: 研究发现,通过使用提问技术,可以很好地评估LLMs的多元观点生成能力,并且可以在不同任务上(如荷尔豪害语言标注和故事续写)检验LLMs的多元观点生成能力。Abstract
Collecting diverse human data on subjective NLP topics is costly and challenging. As Large Language Models (LLMs) have developed human-like capabilities, there is a recent trend in collaborative efforts between humans and LLMs for generating diverse data, offering potential scalable and efficient solutions. However, the extent of LLMs' capability to generate diverse perspectives on subjective topics remains an unexplored question. In this study, we investigate LLMs' capacity for generating diverse perspectives and rationales on subjective topics, such as social norms and argumentative texts. We formulate this problem as diversity extraction in LLMs and propose a criteria-based prompting technique to ground diverse opinions and measure perspective diversity from the generated criteria words. Our results show that measuring semantic diversity through sentence embeddings and distance metrics is not enough to measure perspective diversity. To see how far we can extract diverse perspectives from LLMs, or called diversity coverage, we employ a step-by-step recall prompting for generating more outputs from the model in an iterative manner. As we apply our prompting method to other tasks (hate speech labeling and story continuation), indeed we find that LLMs are able to generate diverse opinions according to the degree of task subjectivity.
摘要
COLLECTING多样的人类数据 на主观NLP话题是成本高昂和挑战性强的。随着大语言模型(LLMs)的发展,有一种现代趋势是通过人类和LLMs的合作来生成多样数据,提供可扩展和高效的解决方案。然而,LLMs是否具备生成主观话题多样视角的能力仍是一个未知问题。在这种研究中,我们调查LLMs是否能够生成主观话题多样视角和理由,如社会规范和论战文本。我们将这个问题定义为LLMs中的多样性提取问题,并提出了基于标准的提示技术来锁定多样的意见和度量视角多样性从生成的标准词语中。我们的结果表明,通过句子嵌入和距离度量来度量semantic多样性并不够来度量视角多样性。为了测试LLMs是否能够提取多样视角,我们采用了一种步骤性的回忆提示法,通过多次生成输出来评估模型的多样性覆盖率。在我们应用提示方法于其他任务(仇恨言语标注和故事续写)时,实际上我们发现LLMs可以根据任务的主观程度生成多样的意见。
KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains
results: 评估了 14 种不同的语言模型,其中最高级别的系统 (GPT-4 with Program-of-Thoughts) 的准确率只有 45.4%,而知识扩展的 LLMs 可以提高性能 (如 GPT-3.5 从 23.9% 提高到 32.0%),但仍然远低于人类专家的估计性能 (94%)。Abstract
We introduce KnowledgeMath, a novel benchmark designed to evaluate LLMs' capabilities in applying financial knowledge to solve complex math word problems. Compared to prior works, this study features three core advancements. First, KnowledgeMath includes 1,259 problems with a hybrid of textual and tabular content and require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. Finally, we evaluate a wide spectrum of 14 LLMs with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. The current best-performing system (i.e., GPT-4 with Program-of-Thoughts) achieves only 45.4% accuracy, leaving substantial room for improvement. While knowledge-augmented LLMs can improve the performance (e.g., from 23.9% to 32.0% for GPT-3.5), it is still significantly lower the estimated human expert performance of 94%. We believe that KnowledgeMath can facilitate future research on domain-specific knowledge retrieval and augmentation into the math word problem-solving process. We will release the benchmark and code at https://github.com/yale-nlp/KnowledgeMath.
摘要
我们介绍 KnowledgeMath,一个新的评估大型自然语言处理(LLM)的能力应用金融知识解决复杂的数学问题的标准库。相比之前的研究,这些研究有三个核心进步:首先,KnowledgeMath 包含 1,259 个具有文本和表格内容的问题,需要大学学士学位水准的金融领域知识以解决。第二,我们提供了专家标注、详细的解决方案 refer 以 Python 程式码格式,以 Ensure 高品质的标准库 для LLM 评估。第三,我们评估了 14 种不同的 LLM,包括 Chain-of-Thoughts 和 Program-of-Thoughts 等提示策略。现有最高表现的系统 (即 GPT-4 with Program-of-Thoughts) 的精度为 45.4%,剩下许多空间供改善。而知识增强 LLM 可以提高表现 (例如从 23.9% 提升至 32.0% для GPT-3.5),但仍然很低于估计的人类专家表现率 (94%)。我们认为 KnowledgeMath 可以促进未来专业知识抽取和增强在数学问题解决过程中的研究。我们将在 GitHub 上发布标准库和程式码。
More Samples or More Prompt Inputs? Exploring Effective In-Context Sampling for LLM Few-Shot Prompt Engineering
paper_authors: Bingsheng Yao, Guiming Chen, Ruishi Zou, Yuxuan Lu, Jiachen Li, Shao Zhang, Sijia Liu, James Hendler, Dakuo Wang
for: 提高LLM性能和信心度
methods: 利用多个ICL提示输入构建多个ICS提示输入,以提高LLM的预测性能和信心度
results: 实验结果表明,ICS可以一直提高LLM的预测性能和信心度,而且可以采用多样性基于的策略进一步提高LLM的性能。Abstract
While most existing works on LLM prompt-engineering focus only on how to select a better set of data samples inside one single prompt input (In-Context Learning or ICL), why can't we design and leverage multiple prompt inputs together to further improve the LLM performance? In this work, we propose In-Context Sampling (ICS), a low-resource LLM prompt-engineering technique to produce the most confident prediction results by optimizing the construction of multiple ICL prompt inputs. Extensive experiments with two SOTA LLMs (FlanT5-XL and Mistral-7B) on three NLI datasets (e-SNLI, Multi-NLI, and ANLI) illustrate that ICS can consistently enhance LLM's prediction performance and confidence. An ablation study suggests that a diversity-based ICS strategy may further improve LLM's performance, which sheds light on a new yet promising future research direction.
摘要
“现有的大多数 LLM 提示工程化研究仅专注于选择更好的内部提示输入(内部学习或 ICL),那么我们不能设计和利用多个提示输入来进一步提高 LLM 的表现吗?在这个工作中,我们提出了内部抽象(ICS),一种低资源 LLM 提示工程化技术,以提高多个 ICL 提示输入的建构,以获得最高的预测结果和自信度。实验显示,使用 FlanT5-XL 和 Mistral-7B 两个 SOTA LLM 在 e-SNLI、Multi-NLI 和 ANLI 三个 NLI 数据集上,ICS 可以一致地提高 LLM 的预测性能和自信度。剖析研究表明,一种多样性基于的 ICS 策略可能会进一步提高 LLM 的表现,这照明了一个新的未来研究方向。”Note: "LLM" stands for "Large Language Model" in English.
To be or not to be? an exploration of continuously controllable prompt engineering
results: 本文的实验结果显示,ControlPE 可以实现精确控制不同类型的问题提示(包括短回答问题、拒绝问题和推理链问题),并且能够在不同的任务上灵活地应用。Abstract
As the use of large language models becomes more widespread, techniques like parameter-efficient fine-tuning and other methods for controlled generation are gaining traction for customizing models and managing their outputs. However, the challenge of precisely controlling how prompts influence these models is an area ripe for further investigation. In response, we introduce ControlPE (Continuously Controllable Prompt Engineering). ControlPE enables finer adjustments to prompt effects, complementing existing prompt engineering, and effectively controls continuous targets. This approach harnesses the power of LoRA (Low-Rank Adaptation) to create an effect akin to prompt weighting, enabling fine-tuned adjustments to the impact of prompts. Our methodology involves generating specialized datasets for prompt distillation, incorporating these prompts into the LoRA model, and carefully adjusting LoRA merging weight to regulate the influence of prompts. This provides a dynamic and adaptable tool for prompt control. Through our experiments, we have validated the practicality and efficacy of ControlPE. It proves to be a promising solution for control a variety of prompts, ranging from generating short responses prompts, refusal prompts to chain-of-thought prompts.
摘要
As the use of large language models becomes more widespread, techniques like parameter-efficient fine-tuning and other methods for controlled generation are gaining traction for customizing models and managing their outputs. However, the challenge of precisely controlling how prompts influence these models is an area ripe for further investigation. In response, we introduce ControlPE (Continuously Controllable Prompt Engineering). ControlPE enables finer adjustments to prompt effects, complementing existing prompt engineering, and effectively controls continuous targets. This approach harnesses the power of LoRA (Low-Rank Adaptation) to create an effect akin to prompt weighting, enabling fine-tuned adjustments to the impact of prompts. Our methodology involves generating specialized datasets for prompt distillation, incorporating these prompts into the LoRA model, and carefully adjusting LoRA merging weight to regulate the influence of prompts. This provides a dynamic and adaptable tool for prompt control. Through our experiments, we have validated the practicality and efficacy of ControlPE. It proves to be a promising solution for control a variety of prompts, ranging from generating short responses prompts, refusal prompts to chain-of-thought prompts.Here's the translation in Traditional Chinese:当大语言模型的使用越来越普及时,Parameter-efficient fine-tuning 和其他控制生成的技术也在广泛地应用,以适应化模型和管理其输出。然而, precisely controlling how prompts influence these models 是一个需要进一步的探索的领域。为此,我们引入 ControlPE (Continuously Controllable Prompt Engineering)。ControlPE 可以实现更细微的问题影响,与现有的问题工程相结合,并实现连续目标的控制。这个方法利用 LoRA (Low-Rank Adaptation) 的力量,实现问题权重的效果,并允许精确地调整问题的影响。我们的方法包括生成特殊的问题蒸馏集,将这些问题 integrate 到 LoRA 模型中,并精确地调整 LoRA 合并重量,以调控问题的影响。这提供了一个动态和适应的问题控制工具。经过我们的实验,我们已经 validate 了 ControlPE 的实用性和有效性。它证明是一个可靠的解决方案,可以控制多种问题,包括短回应问题、拒绝问题和链式思维问题。
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
paper_authors: Yiqi Liu, Nafise Sadat Moosavi, Chenghua Lin
for: This paper aims to investigate the potential bias of language model-driven evaluation metrics in the context of summarization tasks.
methods: The paper uses three popular language models (BART, T5, and GPT) to evaluate the quality of summaries generated by these models.
results: The paper finds that the evaluation metrics demonstrate a bias towards the underlying language models, particularly when used in a reference-free manner without gold summaries.Abstract
Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics--namely, BARTScore, T5Score, and GPTScore--demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in an reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality, highlighting the necessity of developing more dependable evaluation protocols in the future.
摘要
<>现代自然语言处理(NLP)领域中自动评估生成文本内容的挑战仍在继续。由于现代语言模型(LM)在多种NLP任务中表现出色,因此有增加使用这些模型来创造新的评估指标来自动评估生成任务。本文探讨一个重要问题:语言模型驱动的评估指标是否具有偏好于由同一个语言模型生成的文本? Specifically,我们评估了三个主要LM-based评估指标——BARTScore、T5Score和GPTScore——在摘要任务中是否具有偏好于其所处理的语言模型。我们的发现显示,在不使用黄金摘要的情况下,这些评估指标具有明显的偏好,特别是当用于 reference-free 的情况下。这些结果表明,由生成评估模型提供的评估结果可能会受到 beyond 文本质量的因素的影响,高亮了未来需要开发更可靠的评估协议。
Test-time Backdoor Mitigation for Black-Box Large Language Models with Defensive Demonstrations
for: This paper focuses on defending against backdoor attacks in large language models (LLMs) during the testing phase, which has been overlooked in previous studies that primarily focus on training-time defenses.
methods: The proposed method, called defensive demonstrations, involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are combined with user queries and presented to the model during testing, without requiring any modifications or tuning to the black-box model.
results: The paper shows that defensive demonstrations are effective in defending against both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.Abstract
Existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. This gap becomes particularly pronounced in the context of Large Language Models (LLMs) deployed as Web Services, which typically offer only black-box access, rendering training-time defenses impractical. To bridge this gap, our work introduces defensive demonstrations, an innovative backdoor defense strategy for blackbox large language models. Our method involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are then combined with user queries and presented to the model during testing, without requiring any modifications/tuning to the black-box model or insights into its internal mechanisms. Defensive demonstrations are designed to counteract the adverse effects of triggers, aiming to recalibrate and correct the behavior of poisoned models during test-time evaluations. Extensive experiments show that defensive demonstrations are effective in defending both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.
摘要
Our method involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. These demonstrations are then combined with user queries and presented to the model during testing, without requiring any modifications or tuning to the black-box model or insights into its internal mechanisms. Defensive demonstrations are designed to counteract the adverse effects of triggers, aiming to recalibrate and correct the behavior of poisoned models during test-time evaluations.Our extensive experiments show that defensive demonstrations are effective in defending against both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking
results: 在对话状态追踪任务中,提出的路由框架substantially提高了性能,而且降低了计算成本超过50%。Abstract
Large language models (LLMs) have revolutionized the landscape of Natural Language Processing systems, but are computationally expensive. To reduce the cost without sacrificing performance, previous studies have explored various approaches to harness the potential of Small Language Models (SLMs) as cost-effective alternatives to their larger counterparts. Driven by findings that SLMs and LLMs exhibit complementary strengths in a structured knowledge extraction task, this work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance. First, exemplar pools are created to represent the types of contexts where each LM provides a more reliable answer, leveraging a sentence embedding fine-tuned so that context similarity is close to dialogue state similarity. Then, during inference, the k-nearest exemplars to the testing instance are retrieved, and the instance is routed according to majority vote. In dialogue state tracking tasks, the proposed routing framework enhances performance substantially compared to relying solely on LLMs, while reducing the computational costs by over 50%.
摘要
The framework begins by creating exemplar pools that represent the types of contexts where each LM provides a more reliable answer. This is achieved by fine-tuning a sentence embedding so that context similarity is close to dialogue state similarity. During inference, the k-nearest exemplars to the testing instance are retrieved, and the instance is routed according to majority vote.In dialogue state tracking tasks, the proposed routing framework enhances performance by over 50% compared to relying solely on LLMs, while reducing computational costs by over 50%. This framework demonstrates the potential of combining SLMs and LLMs to improve the efficiency and effectiveness of natural language processing systems.
FairytaleCQA: Integrating a Commonsense Knowledge Graph into Children’s Storybook Narratives
results: 对比较大的LLM模型(GPT-4),一个较小的T5-large模型在新的问答对组成任务(QAG)中表现出色,表明:1)我们的数据集对现有LLM模型带来了新的挑战,2)人工专家的数据注释仍然是关键,因为它们在儿童教育领域具有丰富的细节知识。Abstract
AI models (including LLM) often rely on narrative question-answering (QA) datasets to provide customized QA functionalities to support downstream children education applications; however, existing datasets only include QA pairs that are grounded within the given storybook content, but children can learn more when teachers refer the storybook content to real-world knowledge (e.g., commonsense knowledge). We introduce the FairytaleCQA dataset, which is annotated by children education experts, to supplement 278 storybook narratives with educationally appropriate commonsense knowledge. The dataset has 5,868 QA pairs that not only originate from the storybook narrative but also contain the commonsense knowledge grounded by an external knowledge graph (i.e., ConceptNet). A follow-up experiment shows that a smaller model (T5-large) fine-tuned with FairytaleCQA reliably outperforms much larger prompt-engineered LLM (e.g., GPT-4) in this new QA-pair generation task (QAG). This result suggests that: 1) our dataset brings novel challenges to existing LLMs, and 2) human experts' data annotation are still critical as they have much nuanced knowledge that LLMs do not know in the children educational domain.
摘要
人工智能模型(包括LLM)经常利用叙事问答(QA)数据集来提供下游儿童教育应用程序中自定义的QA功能;然而,现有数据集只包含基于给定的故事书内容的QA对。然而,孩子们可以通过教师将故事书内容与实际世界知识相关联来学习更多。我们介绍了 FairytaleCQA 数据集,该数据集由儿童教育专家标注,用于补充 278 本故事书内容教育适用的常识知识。该数据集包含 5,868 对问答,其中不仅来自故事书内容,还由外部知识图(i.e., ConceptNet)补充了 Commonsense 知识。一项追加实验表明,一个较小的模型(T5-large)在 FairytaleCQA 上进行了可靠地超越了较大的Prompt-工程化 LLVM(例如 GPT-4)在这个新的问答对生成任务(QAG)中。这一结果表明:1)我们的数据集带来了现有的LLMs新的挑战,2)人类专家的数据标注仍然是关键的,因为它们在儿童教育领域中具有许多细节的知识,LLMs不知。
How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
results: 研究发现,使用不同的滤波数据会导致下游任务性能异常大,与现有研究不同,表明使用不同的滤波数据可能会导致 LLM 的性能变化。Abstract
Pruning and quantization form the foundation of model compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated state-of-the-art performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples, to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of pruning and quantization methods, tasks, models, and datasets. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.
摘要
剪枝和量化是神经网络模型压缩的基础,启用高效的推理 для大语言模型(LLM)。近年,各种量化和剪枝技术在后处理环境中表现出了状态之冠。它们依赖于校准数据,一小量的无标示例,来生成层活动。然而,没有任何先前的工作系统atically investigated calibration data对LLM性能的影响。在这篇论文中,我们提供了首次对剪枝和量化方法的效果进行了广泛的实验研究。我们对各种剪枝和量化方法、任务、模型和数据集进行了试验。 surprisingly,我们发现了下游任务性能的显著差异,与现有的工作 suggessthat a greater level of robustness to the calibration data。最后,我们对LLM剪枝和量化中有效使用calibration data的推荐。
Translation Aligned Sentence Embeddings for Turkish Language
results: 通过这种方法,可以在短时间内使用有限的 target 语言数据进行高精度的 fine-tuning,并且可以提高 sentence embedding 模型在 Turkish 语言上的表现。Abstract
Due to the limited availability of high quality datasets for training sentence embeddings in Turkish, we propose a training methodology and a regimen to develop a sentence embedding model. The central idea is simple but effective : is to fine-tune a pretrained encoder-decoder model in two consecutive stages, where the first stage involves aligning the embedding space with translation pairs. Thanks to this alignment, the prowess of the main model can be better projected onto the target language in a sentence embedding setting where it can be fine-tuned with high accuracy in short duration with limited target language dataset.
摘要
Here's the text in Simplified Chinese:由于土耳其语 sentence embedding 训练数据的可用性有限,我们提出了一种训练方法和日程,以提高 sentence embedding 模型的质量。中心思想简单 yet effective:在两个阶段中,首先对预训练的 encoder-decoder 模型进行了两个阶段的微调,其中第一个阶段是将 embedding 空间与翻译对照进行对齐。这样可以使得模型在 sentence embedding 设置下,通过短时间内使用有限的目标语言数据进行高精度的微调。
Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks
results: 该方法可以提高模型在捕捉 annotators 的看法方面的表现,并且可以避免因为 annotators 的差异而导致的偏向。此外,这个方法还可以学习 annotators 的行为,以便进一步的探索。Abstract
In most classification models, it has been assumed to have a single ground truth label for each data point. However, subjective tasks like toxicity classification can lead to genuine disagreement among annotators. In these cases aggregating labels will result in biased labeling and, consequently, biased models that can overlook minority opinions. Previous studies have shed light on the pitfalls of label aggregation and have introduced a handful of practical approaches to tackle this issue. Recently proposed multi-annotator models, which predict labels individually per annotator, are vulnerable to under-determination for annotators with small samples. This problem is especially the case in crowd-sourced datasets. In this work, we propose Annotator Aware Representations for Texts (AART) for subjective classification tasks. We will show the improvement of our method on metrics that assess the performance on capturing annotators' perspectives. Additionally, our approach involves learning representations for annotators, allowing for an exploration of the captured annotation behaviors.
摘要
通常的分类模型假设每个数据点有单一的真实标签。然而,主观任务如攻击性评分可能会导致注释器之间真实的分歧。在这种情况下,聚合标签会导致偏执zh labels和模型,这些模型可能会忽略小量意见。先前的研究已经揭示了标签聚合的坑缺和提出了一些实用的方法来解决这个问题。我们最近提出的多注释模型,它预测每个注释器的标签,容易受到 annotators with small samples 的下降决策。这个问题特别是在大量数据集中存在。在这种情况下,我们提出了注释者意识表示(AART) для主观分类任务。我们将展示我们的方法在衡量注释者的观点性能指标上的改进。此外,我们的方法包括学习注释者的表示,以便探索被捕捉的注释行为。
What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization
results: 实验结果显示,P^3Sum比前一代摘要系统和大语言模型提高了11.4%的成功率,具有与标准摘要价值指标一样的性能。这些结果显示,即使是现有的模型,在新闻摘要中保持作者意见和观点仍然是一个挑战,而P^3Sum则是一个重要的第一步。Abstract
In this work, we take a first step towards designing summarization systems that are faithful to the author's opinions and perspectives. Focusing on a case study of preserving political perspectives in news summarization, we find that existing approaches alter the political opinions and stances of news articles in more than 50% of summaries, misrepresenting the intent and perspectives of the news authors. We thus propose P^3Sum, a diffusion model-based summarization approach controlled by political perspective classifiers. In P^3Sum, the political leaning of a generated summary is iteratively evaluated at each decoding step, and any drift from the article's original stance incurs a loss back-propagated to the embedding layers, steering the political stance of the summary at inference time. Extensive experiments on three news summarization datasets demonstrate that P^3Sum outperforms state-of-the-art summarization systems and large language models by up to 11.4% in terms of the success rate of stance preservation, with on-par performance on standard summarization utility metrics. These findings highlight the lacunae that even for state-of-the-art models it is still challenging to preserve author perspectives in news summarization, while P^3Sum presents an important first step towards evaluating and developing summarization systems that are faithful to author intent and perspectives.
摘要
在这项工作中,我们开始努力设计 faithful 的摘要系统,以保持作者的意图和观点。通过新闻摘要中保持政治立场的案例研究,我们发现现有方法在超过50%的摘要中改变了新闻文章的政治立场和意图,这些摘要不符合作者的意图和观点。我们因此提出 P^3Sum,一种基于扩散模型的摘要方法,该方法在摘要生成过程中控制政治观点分类器,以确保生成的摘要保持原文的政治立场。在 P^3Sum 中,生成的摘要中的政治倾向在每个解码步骤中被评估,如果摘要偏离原文的政治立场,就会产生损失,这些损失将在嵌入层传递给 embedding 层,以在推理时间控制摘要的政治倾向。我们对三个新闻摘要数据集进行了广泛的实验,结果表明,P^3Sum 在保持摘要的政治立场方面的成功率比现有的摘要系统和大语言模型高出11.4%,而与标准摘要用途指标具有相同的性能。这些发现表明,即使是当今最先进的模型,在新闻摘要中保持作者的意图和观点仍然是一项挑战,而 P^3Sum 则是一个重要的第一步。
CARE: Extracting Experimental Findings From Clinical Literature
results: 研究使用了两个来源的700个摘要进行广泛的注解,并对多种当前IE系统的性能进行了测试。结果表明,即使使用 SOTA 模型,如 GPT4,也很难在这个数据集上进行relation extraction。Abstract
Extracting fine-grained experimental findings from literature can provide massive utility for scientific applications. Prior work has focused on developing annotation schemas and datasets for limited aspects of this problem, leading to simpler information extraction datasets which do not capture the real-world complexity and nuance required for this task. Focusing on biomedicine, this work presents CARE (Clinical Aggregation-oriented Result Extraction) -- a new IE dataset for the task of extracting clinical findings. We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes, which includes phenomena challenging for current IE systems such as discontinuous entity spans, nested relations, and variable arity n-ary relations. Using this schema, we collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports. We also benchmark the performance of various state-of-the-art IE systems on our dataset, including extractive models and generative LLMs in fully supervised and limited data settings. Our results demonstrate the difficulty of our dataset -- even SOTA models such as GPT4 struggle, particularly on relation extraction. We release our annotation schema and CARE to encourage further research on extracting and aggregating scientific findings from literature.
摘要
<>通过Literature中提取细致实验结果可以提供巨大的科学应用 utility。先前的工作主要集中在开发注解schema和数据集,以便解决这个问题的有限方面,导致的是更简单的信息抽取数据集,这些数据集不能捕捉实际世界中的复杂性和细节。在生物医学领域,本工作提出了CARE(临床结合 oriented Result Extraction)——一个新的IE数据集,用于提取临床发现。我们开发了一个新的注解schema,捕捉细致发现为n-ary关系 между实体和属性,该schemas包括现实困难 для当前IE系统的现象,如不连续实体跨度、嵌入关系和变量数学 n-ary关系。使用该schemas,我们收集了700个报告的广泛的注解,来自两个来源:临床试验和案例报告。我们还对我们的数据集进行了多种当前IE系统的性能测试,包括抽取模型和生成LLMs在完全超vised和有限数据设置下。我们的结果表明,我们的数据集具有很大的困难度——即使SOTA模型如GPT4,它们在关系提取方面尤其困难。我们发布了我们的注解schema和CARE,以促进Literature中的实验发现和抽取。
paper_authors: Alexander Spangher, Emilio Ferrara, Ben Welsh, Nanyun Peng, Serdar Tumgoren, Jonathan May
for: 本研究ocuses on Local public policy coverage in the San Francisco Bay Area by the San Francisco Chronicle.
methods: The paper uses probabilistic relational modeling to link news articles, public policy documents, and meeting recordings.
results: The paper shows that different aspects of public policy discussion yield different newsworthiness signals, and their systems identify policies considered newsworthy with 68% F1 and their coverage recommendations are helpful with an 84% win-rate.Here’s the format you requested:
for: <本研究ocuses on Local public policy coverage in the San Francisco Bay Area by the San Francisco Chronicle.>
methods: <The paper uses probabilistic relational modeling to link news articles, public policy documents, and meeting recordings.>
results: <The paper shows that different aspects of public policy discussion yield different newsworthiness signals, and their systems identify policies considered newsworthy with 68% F1 and their coverage recommendations are helpful with an 84% win-rate.>Abstract
Journalists must find stories in huge amounts of textual data (e.g. leaks, bills, press releases) as part of their jobs: determining when and why text becomes news can help us understand coverage patterns and help us build assistive tools. Yet, this is challenging because very few labelled links exist, language use between corpora is very different, and text may be covered for a variety of reasons. In this work we focus on news coverage of local public policy in the San Francisco Bay Area by the San Francisco Chronicle. First, we gather news articles, public policy documents and meeting recordings and link them using probabilistic relational modeling, which we show is a low-annotation linking methodology that outperforms other retrieval-based baselines. Second, we define a new task: newsworthiness prediction, to predict if a policy item will get covered. We show that different aspects of public policy discussion yield different newsworthiness signals. Finally we perform human evaluation with expert journalists and show our systems identify policies they consider newsworthy with 68% F1 and our coverage recommendations are helpful with an 84% win-rate.
摘要
MOKA: Moral Knowledge Augmentation for Moral Event Extraction
paper_authors: Xinliang Frederick Zhang, Winston Wu, Nick Beauchamp, Lu Wang
For: This paper is written for studying the phenomenon of moral language in news media and the dynamics of moral events in shaping news content.* Methods: The paper uses a new dataset called MORAL EVENTS, which consists of 5,494 structured annotations on 474 news articles from diverse US media outlets. The authors also propose a moral event extraction framework called MOKA, which leverages knowledge derived from moral words and moral scenarios.* Results: The experimental results show that MOKA outperforms competitive baselines across three moral event understanding tasks. Additionally, the authors find that media outlets of different ideological leanings selectively report moral events, highlighting the significance of event-level morality analysis in news.Abstract
News media employ moral language to create memorable stories, and readers often engage with the content that align with their values. Moral theories have been applied to news analysis studying moral values in isolation, while the intricate dynamics among participating entities in shaping moral events have been overlooked. This is mainly due to the use of obscure language to conceal evident ideology and values, coupled with the insufficient moral reasoning capability in most existing NLP systems, where LLMs are no exception. To study this phenomenon, we first annotate a new dataset, MORAL EVENTS, consisting of 5,494 structured annotations on 474 news articles by diverse US media across the political spectrum. We further propose MOKA, a moral event extraction framework with MOral Knowledge Augmentation, that leverages knowledge derived from moral words and moral scenarios. Experimental results show that MOKA outperforms competitive baselines across three moral event understanding tasks. Further analyses illuminate the selective reporting of moral events by media outlets of different ideological leanings, suggesting the significance of event-level morality analysis in news. Our datasets and codebase are available at https://github.com/launchnlp/MOKA.
摘要
新闻媒体使用道德语言创造深刻的故事,读者常与其价值观合而参与内容。道德理论在新闻分析中被应用,但是参与者之间的复杂关系和形成道德事件的过程受到了忽略。这主要是因为使用晦涩的语言隐藏了明确的意识形态和价值观,同时现有的NLP系统中的道德理解能力不够,LLMs也不例外。为研究这一现象,我们首先创建了新的数据集,道德事件集(MORAL EVENTS),包含474篇来自美国各种政见媒体的新闻文章5,494个结构化注释。我们还提出了MOKA,一个基于道德知识增强的道德事件抽取框架,利用道德词汇和道德enario来抽取道德事件。实验结果显示,MOKA在三个道德事件理解任务上与竞争对手相比表现出色。进一步的分析发现媒体不同政见倾向的报道道德事件是有偏见的,这表明事件级别的道德分析在新闻中的重要性。我们的数据集和代码库可以在GitHub上找到:https://github.com/launchnlp/MOKA。
On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering
results: 我们的研究发现,当前的State-of-the-art GPT-4模型在这个任务中存在两个主要的瓶颈:规划能力和多个 SQL 查询的生成能力。我们还引入了一种多代理评估框架,以便更准确地评估答案质量。这种框架允许我们更好地理解当前 LLM 在复杂的检索和推理任务中的优劣点。Abstract
This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.
摘要
Regularized Conventions: Equilibrium Computation as a Model of Pragmatic Reasoning
results: 在使用该模型的实验中,论文能够匹配或超越现有的最佳回应和理性演讲模型的预测,并且可以提供有关语言交流中的通信成功和自然性的理论保证。Abstract
We present a model of pragmatic language understanding, where utterances are produced and understood by searching for regularized equilibria of signaling games. In this model (which we call ReCo, for Regularized Conventions), speakers and listeners search for contextually appropriate utterance--meaning mappings that are both close to game-theoretically optimal conventions and close to a shared, ''default'' semantics. By characterizing pragmatic communication as equilibrium search, we obtain principled sampling algorithms and formal guarantees about the trade-off between communicative success and naturalness. Across several datasets capturing real and idealized human judgments about pragmatic implicatures, ReCo matches or improves upon predictions made by best response and rational speech act models of language understanding.
摘要
我们提出了一种语言理解模型,其中讲话和理解都是通过搜索正则化平衡的信号游戏来实现的。我们称这种模型为ReCo(正则化会议)。在这个模型中,说话者和听众在语言上进行Contextually appropriate的讲话-意思映射搜索,以达到Game-theoretically optimal的会议和共同默认 semantics。通过将 Pragmatic communication 定义为平衡搜索,我们得到了原则性的抽样算法和Formal guarantees about the trade-off between communicative success and naturalness。在几个捕捉了真实和理想的人类评价的数据集上,ReCo匹配或超过了Best response和理性语言理解模型的预测。
Large Language Model Inference with Lexical Shortlisting
results: 研究发现,lexical shortlisting可以减少一些模型的内存使用量,最高可以减少50%,同时也有25%的提升的可能性。此外,研究还发现了这种词库选择方法的缺点,并提出了未来研究的可能性。Abstract
Large language model (LLM) inference is computation and memory intensive, so we adapt lexical shortlisting to it hoping to improve both. While lexical shortlisting is well-explored in tasks like machine translation, it requires modifications before being suitable for LLMs as the intended applications vary significantly. Our work studies two heuristics to shortlist sub-vocabulary at LLM inference time: Unicode-based script filtering and corpus-based selection. We explore different LLM families and sizes, and we find that lexical shortlisting can reduce the memory usage of some models by nearly 50\% and has an upper bound of 25\% improvement in generation speed. In this pilot study, we also identify the drawbacks of such vocabulary selection methods and propose avenues for future research.
摘要
大型语言模型(LLM)的推理是计算和内存密集的,因此我们适应lexical shortlisting以提高它们。lexical shortlisting在机器翻译任务中广泛探索过,但是需要修改才能适用于LLM,因为它们的应用场景差异很大。我们的工作研究了两种决策指标来短list sub-vocabulary during LLM inference time:Unicode-based script filtering和corpus-based selection。我们在不同的LLM家族和大小上进行了 исследование,发现lexical shortlisting可以将一些模型的内存使用量减少到 nearly 50%,并且有一个 Upper bound的25%的提高 Speed of generation。在这个 Pilot study中,我们还发现了这种词汇选择方法的缺点并提出了未来研究的可能性。
A Self-enhancement Multitask Framework for Unsupervised Aspect Category Detection
paper_authors: Thi-Nhung Nguyen, Hoang Ngo, Kiem-Hieu Nguyen, Tuan-Dung Cao
for: addresses the problem of unsupervised Aspect Category Detection using a small set of seed words.
methods: proposes a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training, and jointly trains Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity.
results: surpasses strong baselines on standard datasets.Abstract
Our work addresses the problem of unsupervised Aspect Category Detection using a small set of seed words. Recent works have focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.
摘要
我们的工作解决了无监督方面类检测问题,使用一小组种子词。 latest works focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.Here's a word-for-word translation of the text in Traditional Chinese:我们的工作解决了无监督方面类检测问题,使用一小组种子词。 latest works focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.
GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding
results: 提出了一个新的benchmark dataset called GenCodeSearchNet (GeCS),以评估语言模型对不同编程语言的理解能力,并引入了一个新的手动审核 subsets StatCodeSearch,Focus on R编程语言,以增强模型对不同编程语言的适应能力。Abstract
Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries.These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et.\,al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systemically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.
摘要
Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code search tasks using natural language queries. These capabilities are heavily influenced by the quality and diversity of the available training data. Source code datasets used for training usually focus on the most popular languages and testing is mostly conducted on the same distributions, often overlooking low-resource programming languages. Motivated by the NLP generalization taxonomy proposed by Hupkes et al., we propose a new benchmark dataset called GenCodeSearchNet (GeCS) which builds upon existing natural language code search datasets to systematically evaluate the programming language understanding generalization capabilities of language models. As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language that is often used by researchers outside the field of computer science. For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models in a zero-shot setting.Here's the translation in Traditional Chinese:语言模型可以serve as a valuable tool for software developers to increase productivity。大型生成模型可以用于代码生成和代码完成,而小型encoder-only模型则可以进行代码搜寻任务使用自然语言查询。这些能力受到训练数据的质量和多样性的影响。通常的源代码资料集用于训练通常会针对最受欢迎的语言进行集中,而测试通常会在同一个分布上进行,往往忽略低资源的编程语言。驱动了Hupkes等人提出的NLG概念分类,我们提议一个新的benchmarkdatasetcalled GenCodeSearchNet (GeCS),这个dataset建立在现有的自然语言代码搜寻dataset之上,以系统地评估语言模型对程式语言理解的扩展能力。这个dataset中,我们引入了一个新的、手动精心筛选的子集StatCodeSearch,它针对R语言,这是一个受欢迎但现在尚未得到充分关注的编程语言,经常被computer科学以外的研究人员使用。为了评估和比较,我们收集了一些基准结果使用精心翻译BERT类型模型和GPT类型大型语言模型,这些模型在零条件设定下进行比较。
Fumbling in Babel: An Investigation into ChatGPT’s Language Identification Ability
for: investigate ChatGPT’s language identification abilities
methods: compile Babel-670 benchmark, study ChatGPT’s ability to identify language names and language codes under zero- and few-shot conditions with and without label set
results: ChatGPT lags behind smaller finetuned language identification tools, indicating potential for enhancement before serving diverse communities.Here is the same information in Traditional Chinese text:
results: ChatGPT落后于小型训练语言识别工具,显示需要进一步改进以应对多元社区。Abstract
Recently, ChatGPT has emerged as a powerful NLP tool that can carry out several tasks. However, the range of languages ChatGPT can handle remains largely a mystery. In this work, we investigate ChatGPT's language identification abilities. For this purpose, we compile Babel-670, a benchmark comprising $670$ languages representing $23$ language families. Languages in Babel-670 run the gamut between the very high-resource to the very low-resource and are spoken in five continents. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify both language names and language codes (ii) under both zero- and few-shot conditions (iii) with and without provision of label set. When compared to smaller finetuned language identification tools, we find that ChatGPT lags behind. Our empirical analysis shows the reality that ChatGPT still resides in a state of potential enhancement before it can sufficiently serve diverse communities.
摘要
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
results: 研究发现,不是所有的OOD测试都能够提供更深入的稳定性评估。使用CheckLists和对比集的评估显示了模型的性能差距,并且尚未充分提高模型的稳定性。此外,作者还指出了当前对模型 robustness的评估方法存在问题,这些方法可以被轻松地骗过,并且当前的评估方法不够深入。因此,作者 conclude that NLP中的稳定性问题仍未得到解决,甚至一些用于评估稳定性的方法需要重新评估。Abstract
Are the longstanding robustness issues in NLP resolved by today's larger and more performant models? To address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) OOD and challenge test sets, (b) CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all OOD tests provide further insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
摘要
是否已经解决了自然语言处理(NLP)领域的长期稳定性问题?为了回答这个问题,我们进行了19种不同大小和架构的模型的完整调查。我们使用了(a)跨模型测试集(Out-of-distribution,OOD)和挑战测试集,(b)CheckLists,(c)对比集和(d)敌意输入来进行评估。我们的分析发现,不是所有的OOD测试都能够提供更多的 robustness 信息。使用 CheckLists 和对比集的评估显示了模型的显著性能差异;即便模型的大小增加,也不能 garantuee 其 sufficient robustness。最后,我们指出了当前对模型 adversarial 评估的方法存在问题:它们可以轻松地被阻断,并且在当前的形式下,不能深入探索模型的 robustness。我们结论是,NLP 领域的 robustness 问题仍未得到解决,而且一些用于评估 robustness 的方法也需要重新评估。
Inducing Political Bias Allows Language Models Anticipate Partisan Reactions to Controversies
results: 研究发现模型能够准确地捕捉到情感和道德上的细节,但在姿势检测方面存在一些挑战。Abstract
Social media platforms are rife with politically charged discussions. Therefore, accurately deciphering and predicting partisan biases using Large Language Models (LLMs) is increasingly critical. In this study, we address the challenge of understanding political bias in digitized discourse using LLMs. While traditional approaches often rely on finetuning separate models for each political faction, our work innovates by employing a singular, instruction-tuned LLM to reflect a spectrum of political ideologies. We present a comprehensive analytical framework, consisting of Partisan Bias Divergence Assessment and Partisan Class Tendency Prediction, to evaluate the model's alignment with real-world political ideologies in terms of stances, emotions, and moral foundations. Our findings reveal the model's effectiveness in capturing emotional and moral nuances, albeit with some challenges in stance detection, highlighting the intricacies and potential for refinement in NLP tools for politically sensitive contexts. This research contributes significantly to the field by demonstrating the feasibility and importance of nuanced political understanding in LLMs, particularly for applications requiring acute awareness of political bias.
摘要
社交媒体平台上的政治话题非常普遍,因此正确地理解和预测政治偏见使用大型自然语言模型(LLM)变得越来越重要。在这项研究中,我们解决了政治偏见在数字化言语中的理解挑战,使用一个单一、指导 instru 的 LLM,以反映政治意识形态的谱系。我们提出了一个完整的分析框架,包括政治偏见分化评估和政治类倾向预测,以评估模型与实际世界政治意识形态之间的吻合程度。我们的发现表明模型能够很好地捕捉情感和道德上的细 nuances,但在姿势检测方面存在一些挑战,这 highlights NL 工具在政治敏感上的复杂性和可能的改进。本研究对于场景中的政治偏见理解的重要性和可行性作出了重要贡献,特别是在需要精准的政治偏见认知应用场景下。
R-Tuning: Teaching Large Language Models to Refuse Unknown Questions
results: 实验结果表明,R-Tuning方法可以有效地提高模型回答知道问题的能力,同时避免回答未知问题。此外,在域外数据集上进行测试,发现模型学习不确定性的能力可以通过训练来提高。Abstract
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. A predominant issue is the propensity for these models to generate non-existent facts, a concern termed hallucination. Our research is motivated by the observation that previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. When the question is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. In this paper, we present a new approach called Refusal-Aware Instruction Tuning (R-Tuning). This approach is formalized by first identifying the knowledge gap between parametric knowledge and the instruction tuning data. Then, we construct the refusal-aware data based on the knowledge intersection, to tune LLMs to refrain from responding to questions beyond its parametric knowledge. Experimental results demonstrate this new instruction tuning approach effectively improves a model's ability to answer known questions and refrain from answering unknown questions. Furthermore, when tested on out-of-domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. Further analysis surprisingly finds that learning the uncertainty during training displays a better ability to estimate uncertainty than uncertainty-based testing. Our code will be released at https://github.com/shizhediao/R-Tuning.
摘要
大型语言模型(LLM)已经革命化了许多领域,但仍面临一些挑战。一个主要问题是这些模型的倾向于生成不存在的事实,被称为幻觉。我们的研究受到了以前的 instrucion 级别调整方法会让模型完成一个句子,无论模型知道这些知识还是不知道。当问题出现在模型的参数知识之外时,它会尝试 fabricate 一个答案,并且无法指示当前缺乏知识。在这篇论文中,我们提出了一种新的方法,即 Refusal-Aware Instruction Tuning(R-Tuning)。这种方法由首先标识模型的参数知识和 instrucion 调整数据之间的知识差异而始。然后,我们将基于知识交叉的 refusal-aware 数据进行调整,以使模型不再回答 beyond 其参数知识的问题。实验结果表明,这种新的 instrucion 调整方法可以有效地提高模型回答知道的问题能力,并且不再回答不知道的问题。此外,当测试在域外数据集时,发现了一个叫做 "拒绝能力" 的元技能,可以在其他任务上generalize。进一步的分析显示,在训练时学习不确定性实际上比测试时 uncertainty-based 测试时更好地估计不确定性。我们的代码将在 GitHub 上发布,请参考 。
Where Do People Tell Stories Online? Story Detection Across Online Communities
methods: 这篇论文使用了一份codebook和Storytelling in Online Communities Corpus,一个专家标注的数据集,以及在线故事检测模型的训练和评估,来研究在线故事tellding的特点和社会上的应用。
results: 根据这篇论文的研究结果,在线故事tellding的特点包括:不同社区中的故事tellding频率不同,各种社区中的故事tellding具有共同的特征,以及在线故事tellding可以跨越不同话题和场景进行交互。Abstract
People share stories online for a myriad of purposes, whether as a means of self-disclosure, processing difficult personal experiences, providing needed information or entertainment, or persuading others to share their beliefs. Better understanding of online storytelling can illuminate the dynamics of social movements, sensemaking practices, persuasion strategies, and more. However, unlike other media such as books and visual content where the narrative nature of the content is often overtly signaled at the document level, studying storytelling in online communities is challenging due to the mixture of storytelling and non-storytelling behavior, which can be interspersed within documents and across diverse topics and settings. We introduce a codebook and create the Storytelling in Online Communities Corpus, an expert-annotated dataset of 502 English-language posts and comments with labeled story and event spans. Using our corpus, we train and evaluate an online story detection model, which we use to investigate the role storytelling of in different social contexts. We identify distinctive features of online storytelling, the prevalence of storytelling among different communities, and the conversational patterns of storytelling.
摘要
To address this challenge, we introduce a codebook and create the Storytelling in Online Communities Corpus, an expert-annotated dataset of 502 English-language posts and comments with labeled story and event spans. Using our corpus, we train and evaluate an online story detection model, which we use to investigate the role of storytelling in different social contexts. Our findings reveal distinctive features of online storytelling, the prevalence of storytelling among different communities, and the conversational patterns of storytelling.
Improving the Generation Quality of Watermarked Large Language Models via Word Importance Scoring
results: 实验结果表明,我们的方法可以生成高质量的文本,同时保持了相同的检测率。Abstract
The strong general capabilities of Large Language Models (LLMs) bring potential ethical risks if they are unrestrictedly accessible to malicious users. Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions with a private random number generator seeded by its prefix tokens. However, this watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality if it chooses to promote tokens that are less relevant given the input. In this work, we propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS). At each generation step, we estimate the importance of the token to generate, and prevent it from being impacted by watermarking if it is important for the semantic correctness of the output. We further propose three methods to predict importance scoring, including a perturbation-based method and two model-based methods. Empirical experiments show that our method can generate texts with better quality with comparable level of detection rate.
摘要
强大的普通语言模型(LLM)具有潜在的道德风险,如果这些模型在黑客用户手中不受限制。Token-level watermarking在生成文本时插入水印,通过修改token概率分布来增加一个私有随机数生成器。然而,这种水印算法在生成过程中改变了logits,可能会导致生成的文本质量下降,如果它选择推荐不太相关的token。在这种情况下,我们提出了通过水印 scoring(WIS)来改进生成的文本质量。在每个生成步骤中,我们估算token需要生成的重要性,并在生成过程中避免由水印所影响的重要token。我们还提出了三种方法来预测重要性分配,包括一种干扰基本方法和两种模型基本方法。实验表明,我们的方法可以生成文本质量更高,同时保持检测率在相同水平。
Evaluating LLM Agent Group Dynamics against Human Group Dynamics: A Case Study on Wisdom of Partisan Crowds
results: 我们发现,不含链条思维(CoT)的LLM代理者具有与人类行为高度相似的Alignment,而含CoT的代理者则受到了Alignment的降低。此外,在人类数据进行精细调整LLM代理者后,可以实现人类样式的行为,但也存在过拟合特定行为的风险。这些发现表明了LLM代理者在模型人类群体现象方面的潜力和局限性。Abstract
This study investigates the potential of Large Language Models (LLMs) to simulate human group dynamics, particularly within politically charged contexts. We replicate the Wisdom of Partisan Crowds phenomenon using LLMs to role-play as Democrat and Republican personas, engaging in a structured interaction akin to human group study. Our approach evaluates how agents' responses evolve through social influence. Our key findings indicate that LLM agents role-playing detailed personas and without Chain-of-Thought (CoT) reasoning closely align with human behaviors, while having CoT reasoning hurts the alignment. However, incorporating explicit biases into agent prompts does not necessarily enhance the wisdom of partisan crowds. Moreover, fine-tuning LLMs with human data shows promise in achieving human-like behavior but poses a risk of overfitting certain behaviors. These findings show the potential and limitations of using LLM agents in modeling human group phenomena.
摘要
Translated into Simplified Chinese:这个研究 investigate LLM(大语言模型)能模拟人类群体动态,尤其在政治敏感的背景下。我们使用LLM代理人物,模拟人类群体中的决策过程,并评估代理人物如何受社会影响。我们的关键发现表明,没有Chain-of-Thought(CoT)解释的LLM代理人物和人类行为高度相似,而CoT解释会降低对应性。然而,向代理人物添加明确的偏见不一定提高群体智慧。此外,使用人类数据进行LLM fine-tuning显示 promise in achieving human-like behavior,但也存在适应特定行为的风险。这些发现表明LLM代理人物在模拟人类群体现象的潜在和局限性。
Evolving Domain Adaptation of Pretrained Language Models for Text Classification
results: 研究发现,对于语言模型进行自我训练方法能够优化语言模型在不断变化的语言环境中的性能,并且比传统的领域适应技术高效。Abstract
Adapting pre-trained language models (PLMs) for time-series text classification amidst evolving domain shifts (EDS) is critical for maintaining accuracy in applications like stance detection. This study benchmarks the effectiveness of evolving domain adaptation (EDA) strategies, notably self-training, domain-adversarial training, and domain-adaptive pretraining, with a focus on an incremental self-training method. Our analysis across various datasets reveals that this incremental method excels at adapting PLMs to EDS, outperforming traditional domain adaptation techniques. These findings highlight the importance of continually updating PLMs to ensure their effectiveness in real-world applications, paving the way for future research into PLM robustness against the natural temporal evolution of language.
摘要
这篇研究评估了对于时间序列文本分类中的预训语言模型(PLM)进行演进领域变化(EDS)的适用性,以维持准确性。研究对于自我训练、领域抗战术和领域适应训练等不同的演进领域整合策略进行比较,并着重于增量自我训练方法。我们的分析发现,这种增量方法在处理EDS时表现出色,超过传统领域整合策略。这些发现显示了预训语言模型的更新和调整的重要性,以确保它们在实际应用中的效能。这将推动未来关于PLM的研究,以探索它们对自然时间演进的语言的Robustness。
ICXML: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification
results: 对于两个公共评分平台,实验结果表明ICXML已经提高了状态之arte。Abstract
This paper focuses on the task of Extreme Multi-Label Classification (XMC) whose goal is to predict multiple labels for each instance from an extremely large label space. While existing research has primarily focused on fully supervised XMC, real-world scenarios often lack complete supervision signals, highlighting the importance of zero-shot settings. Given the large label space, utilizing in-context learning approaches is not trivial. We address this issue by introducing In-Context Extreme Multilabel Learning (ICXML), a two-stage framework that cuts down the search space by generating a set of candidate labels through incontext learning and then reranks them. Extensive experiments suggest that ICXML advances the state of the art on two diverse public benchmarks.
摘要
Event Causality Is Key to Computational Story Understanding
results: 我们的方法在比较 human-annotated 的事件 causal 关系集合 GLUCOSE 中表现出类似的水平,同时能够轻松地扩展到不同类型和长度的故事。这些EXTRACTED causal 关系导致了对故事质量评价的提高(5.7%)和对故事视频文本对应性的提高(8.7%)。Abstract
Psychological research suggests the central role of event causality in human story understanding. Further, event causality has been heavily utilized in symbolic story generation. However, few machine learning systems for story understanding employ event causality, partially due to the lack of reliable methods for identifying open-world causal event relations. Leveraging recent progress in large language models (LLMs), we present the first method for event causality identification that leads to material improvements in computational story understanding. We design specific prompts for extracting event causal relations from GPT. Against human-annotated event causal relations in the GLUCOSE dataset, our technique performs on par with supervised models, while being easily generalizable to stories of different types and lengths. The extracted causal relations lead to 5.7\% improvements on story quality evaluation and 8.7\% on story video-text alignment. Our findings indicate enormous untapped potential for event causality in computational story understanding.
摘要
心理研究表明人类故事理解中心stage causality的重要性。此外,event causality在Symbolic story generation中得到了广泛使用。然而,现代机器学习系统 для故事理解 rarely employ event causality,部分原因是没有可靠的方法来确定开放世界的 causal event relations。基于最近的大语言模型(LLMs),我们提出了首个事件 causality identification的方法,该方法在计算机故事理解中产生了Material improvements。我们为GPT设计了特定的提示,以EXTRACT event causal relations。与人类标注的事件 causal relations在GLUCOSE数据集中,我们的技术与超级vised模型相当,而且可以轻松扩展到不同的类型和长度的故事。提取的 causal relations导致了5.7%的故事质量评估提高和8.7%的故事视频-文本对齐提高。我们的发现表明事件 causality在计算机故事理解中存在巨大的untapped potential。
Evaluating In-Context Learning of Libraries for Code Generation
results: 研究结果显示,即使使用小型开源LLMs如Llama-2和StarCoder,也能够很好地理解新的代码库模块,基于受欢迎库模块的 спецификации进行受欢迎库模块的代码生成。此外,研究还发现,LLMs可以通过自然语言描述或 raw code 实现来学习新的库模块,这些资源通常比示例更加便宜。总之,本研究的结果铺平了在更多的适应和动态编程环境中使用LLMs的道路。Abstract
Contemporary Large Language Models (LLMs) exhibit a high degree of code generation and comprehension capability. A particularly promising area is their ability to interpret code modules from unfamiliar libraries for solving user-instructed tasks. Recent work has shown that large proprietary LLMs can learn novel library usage in-context from demonstrations. These results raise several open questions: whether demonstrations of library usage is required, whether smaller (and more open) models also possess such capabilities, etc. In this work, we take a broader approach by systematically evaluating a diverse array of LLMs across three scenarios reflecting varying levels of domain specialization to understand their abilities and limitations in generating code based on libraries defined in-context. Our results show that even smaller open-source LLMs like Llama-2 and StarCoder demonstrate an adept understanding of novel code libraries based on specification presented in-context. Our findings further reveal that LLMs exhibit a surprisingly high proficiency in learning novel library modules even when provided with just natural language descriptions or raw code implementations of the functions, which are often cheaper to obtain than demonstrations. Overall, our results pave the way for harnessing LLMs in more adaptable and dynamic coding environments.
摘要
现代大型语言模型(LLM)表现出了高度的代码生成和理解能力。特别是在解决用户指令下的代码模块解释方面表现出了极高的能力。最近的研究表明,大型专有LLM可以通过示例学习新的库使用。这些结果提出了多个开放问题:是否需要示例学习,小型(更开放)模型也具备这种能力等。在这项工作中,我们采取了更广泛的方法,系统地评估了多种LLM在不同领域专业化的三个场景中代码生成能力。我们的结果显示,即使使用小型开源LLM like Llama-2和StarCoder,也能够很好地理解新的代码库,基于场景中提供的规范进行解释。我们的发现还表明,LLM在只有自然语言描述或Raw code实现函数时仍然能够学习新的库模块,这些函数经常比示例更容易获得。总的来说,我们的结果为使用LLM在更适应和动态编程环境中做出了重要贡献。
From Scroll to Misbelief: Modeling the Unobservable Susceptibility to Misinformation on Social Media
results: 评估表示,该模型的估计与人类判断相吻合度很高。此外,该研究还发现了不同社会因素对受到谣言程度的相关性。Abstract
Susceptibility to misinformation describes the extent to believe unverifiable claims, which is hidden in people's mental process and infeasible to observe. Existing susceptibility studies heavily rely on the self-reported beliefs, making any downstream applications on susceptability hard to scale. To address these limitations, in this work, we propose a computational model to infer users' susceptibility levels given their activities. Since user's susceptibility is a key indicator for their reposting behavior, we utilize the supervision from the observable sharing behavior to infer the underlying susceptibility tendency. The evaluation shows that our model yields estimations that are highly aligned with human judgment on users' susceptibility level comparisons. Building upon such large-scale susceptibility labeling, we further conduct a comprehensive analysis of how different social factors relate to susceptibility. We find that political leanings and psychological factors are associated with susceptibility in varying degrees.
摘要
人们的信息受感染度描述了他们信任未经验证的说法的程度,这个程度隐藏在人们的思维过程中,无法直接观察。现有的受感染性研究主要基于自我报告的信念,这使得下游应用困难扩大。为解决这些限制,在这项工作中,我们提出了一种计算模型,用于根据用户的活动来推断他们的受感染性水平。由于用户的受感染性是共享行为的关键指标,我们利用共享行为的监督来推断受感染性的倾向。我们的评估结果显示,我们的模型可以提供与人类判断高度一致的用户受感染性水平的估计。基于大规模的受感染性标签,我们进一步进行了社会因素如政治倾向和心理因素与受感染性之间的全面分析。我们发现,政治倾向和心理因素在不同程度上与受感染性相关。
Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning
paper_authors: Kazuma Hashimoto, Karthik Raman, Michael Bendersky
for: 本研究旨在分析不同标签策略对目标任务的影响。
methods: 本研究使用了LLMs的输出概率和任务特定的奖励来评估不同的标策略。
results: 研究发现,当输出概率分布在整个值范围内时,概率是有效的(在分类任务上),而在 segmentation 和翻译任务上,提供细化的奖励值和长输出可以使下游指标更加稳定。此外,提出了一种新的标策方法——增量有用性,可以评估LLMs中带入的新知识增量。Abstract
In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new tasks. Previous studies have shown that using LLMs' outputs as labels is effective in training models to select demonstrations. Such a label is expected to estimate utility of a demonstration in ICL; however, it has not been well understood how different labeling strategies affect results on target tasks. This paper presents an analysis on different utility functions by focusing on LLMs' output probability given ground-truth output, and task-specific reward given LLMs' prediction. Unlike the previous work, we introduce a novel labeling method, incremental utility, which estimates how much incremental knowledge is brought into the LLMs by a demonstration. We conduct experiments with instruction-tuned LLMs on binary/multi-class classification, segmentation, and translation across Arabic, English, Finnish, Japanese, and Spanish. Our results show that (1) the probability is effective when the probability values are distributed across the whole value range (on the classification tasks), and (2) the downstream metric is more robust when nuanced reward values are provided with long outputs (on the segmentation and translation tasks). We then show that the proposed incremental utility further helps ICL by contrasting how the LLMs perform with and without the demonstrations.
摘要
大型语言模型(LLMs)的嵌入式学习(ICL)是一种出现的能力。只需要几个示例,LLMs 就可以作为黑obox для新任务使用。先前的研究表明,使用 LLMs 的输出作为标签可以有效地培训模型选择示例。这个标签预期能够估算示例在 ICL 中的用于性能。然而,不同的标签策略对目标任务的影响还未得到很好的理解。这篇论文分析了不同的用于性能的标签策略,并对 LLMs 的输出概率和任务特定的奖励给出了分析。与先前的工作不同,我们提出了一种新的标签方法,即增量用处,可以评估示例带来 LLMS 中的增量知识。我们在使用 instruction-tuned LLMs 进行了 binary/多类分类、分割和翻译任务,并在阿拉伯语、英语、芬兰语、日语和西班牙语等语言上进行了实验。我们的结果表明:1)在分类任务中,当概率值分布在整个值范围内时,概率效果非常高;2)在分割和翻译任务中,提供细化的奖励值和长输出可以使下游指标更加稳定。然后,我们表明了我们提出的增量用处可以进一步帮助 ICL。
Simulating Opinion Dynamics with Networks of LLM-based Agents
paper_authors: Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, Timothy T. Rogers
for: 这篇论文旨在 simulating human opinion dynamics 以及 understanding societal phenomena, such as polarization and the spread of misinformation.
methods: 本文使用 Large Language Models (LLMs) 来模拟意见动态, 并通过 prompt engineering 来导致confirmation bias.
results: 研究发现 LLM agents 具有强烈的倾向 towards accurate information, leading to consensus in line with scientific reality. However, this bias limits the simulation of individuals with resistant views on issues like climate change, leading to opinion fragmentation.Abstract
Accurately simulating human opinion dynamics is crucial for understanding a variety of societal phenomena, including polarization and the spread of misinformation. However, the agent-based models (ABMs) commonly used for such simulations lack fidelity to human behavior. We propose a new approach to simulating opinion dynamics based on populations of Large Language Models (LLMs). Our findings reveal a strong inherent bias in LLM agents towards accurate information, leading to consensus in line with scientific reality. However, this bias limits the simulation of individuals with resistant views on issues like climate change. After inducing confirmation bias through prompt engineering, we observed opinion fragmentation in line with existing agent-based research. These insights highlight the promise and limitations of LLM agents in this domain and suggest a path forward: refining LLMs with real-world discourse to better simulate the evolution of human beliefs.
摘要
准确模拟人类意见动态对社会现象的理解具有重要意义,包括分化和信息的快速传播。然而,常用的Agent-based模型(ABM)在模拟人类行为方面缺乏准确性。我们提出一种基于大语言模型(LLM)的新方法来模拟意见动态。我们的发现表明LLM代理具有准确信息的强烈偏好,导致与科学实际相符的共识。然而,这种偏好限制了对抵抗看法的个体模拟,如气候变化。通过引入确认偏见通过提示工程,我们观察到意见分化与现有的ABM研究相符。这些发现表明LLM代理在这个领域的承诺和局限性,并建议通过与现实世界的对话来更好地模拟人类信念的演化。
On Retrieval Augmentation and the Limitations of Language Model Training
results: 研究发现,通过将kNN Retrieval incorporated into vanilla GPT-2 117M可以有效地提高LM的性能,特别是在针对不相关的训练数据进行探索和泛化时。Abstract
Augmenting a language model (LM) with $k$-nearest neighbors (kNN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remains elusive. In this work, we first rule out one previously posited possibility -- the "softmax bottleneck." We further identify the MLP hurdle phenomenon, where the final MLP layer in LMs may impede LM optimization early on. We explore memorization and generalization in language models with two new datasets, where advanced model like GPT-3.5-turbo find generalizing to irrelevant information in the training data challenging. However, incorporating kNN retrieval to vanilla GPT-2 117M can consistently improve performance in this setting.
摘要
Language model (LM) 可以通过 $k$-nearest neighbors(kNN) Retrieval on its training data alone 降低其plexity,但其下面的原因仍然不明确。在这项工作中,我们首先排除了一个先前提出的可能性——“softmax瓶颈”。我们进一步发现了 MLP 障碍现象,即 LM 的最后一层 MLP 层可能会阻碍 LM 优化的初始阶段。我们通过使用两个新的数据集进行了Memorization和Generalization的探索,发现高级模型如 GPT-3.5-turbo 在training数据中分配 irrelevant information 的泛化很困难。然而,在vanilla GPT-2 117M中添加 kNN Retrieval 可以一直提高性能在这种设定下。
Efficient End-to-End Visual Document Understanding with Rationale Distillation
results: Student model based on Pix2Struct achieved consistent improvements on three visual document understanding benchmarks, with improvements of more than 4% absolute over a comparable Pix2Struct model that predicts answers directly.Abstract
Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts. State-of-the-art methods commonly use specialized pre-processing tools, such as optical character recognition (OCR) systems, that map document image inputs to extracted information in the space of textual tokens, and sometimes also employ large language models (LLMs) to reason in text token space. However, the gains from external tools and LLMs come at the cost of increased computational and engineering complexity. In this paper, we ask whether small pretrained image-to-text models can learn selective text or layout recognition and reasoning as an intermediate inference step in an end-to-end model for pixel-level visual language understanding. We incorporate the outputs of such OCR tools, LLMs, and larger multimodal models as intermediate ``rationales'' on training data, and train a small student model to predict both rationales and answers for input questions based on those training examples. A student model based on Pix2Struct (282M parameters) achieves consistent improvements on three visual document understanding benchmarks representing infographics, scanned documents, and figures, with improvements of more than 4\% absolute over a comparable Pix2Struct model that predicts answers directly.
摘要
理解图文需要识别文本和视觉元素,并解释复杂的布局。现代方法通常使用专门的预处理工具,如光学字符识别(OCR)系统,将文档图像输入映射到提取的信息空间中的文本 токен中,并有时还使用大型语言模型(LLM)来在文本 токен空间中进行理解。然而,外部工具和LLM的成本是计算和工程复杂性的增加。在这篇论文中,我们问 Whether small pretrained image-to-text模型可以学习选择性的文本或布局认识和理解作为末端模型的中间推理步骤。我们将OCR工具、LLM和更大的多Modal模型的输出作为训练数据中的中间“理由”,并训练一个小型学生模型来根据输入问题预测 rationales和答案。一个基于 Pix2Struct 的小型学生模型(282M参数)在三个视觉文档理解标准准中表现出了逐渐提高,超过4%的绝对提升。
GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks
paper_authors: Shivanshu Gupta, Clemens Rosenbaum, Ethan R. Elenberg
for: 这paper aimed to improve the in-context learning (ICL) performance of large language models (LLMs) by selecting the best examples from a candidate pool.
methods: The authors proposed a novel metric called GistScore, which is based on Example Gisting, a technique for training example retrievers using an attention bottleneck. They also experimented with fine-tuning gist models on each dataset and multi-task training a single model on a large collection of datasets.
results: The authors achieved state-of-the-art ICL performance on 21 diverse datasets spanning 9 tasks, with an average absolute gain of 20% over off-the-shelf retrievers and 7% over the best prior methods. Their multi-task model also generalizes well out-of-the-box to new task categories, datasets, and prompt templates, with retrieval speeds that are consistently thousands of times faster than the best prior training-free method.Abstract
Large language models (LLMs) have the ability to perform in-context learning (ICL) of new tasks by conditioning on prompts comprising a few task examples. This work studies the problem of selecting the best examples given a candidate pool to improve ICL performance on given a test input. Existing approaches either require training with feedback from a much larger LLM or are computationally expensive. We propose a novel metric, GistScore, based on Example Gisting, a novel approach for training example retrievers for ICL using an attention bottleneck via Gisting, a recent technique for compressing task instructions. To tradeoff performance with ease of use, we experiment with both fine-tuning gist models on each dataset and multi-task training a single model on a large collection of datasets. On 21 diverse datasets spanning 9 tasks, we show that our fine-tuned models get state-of-the-art ICL performance with 20% absolute average gain over off-the-shelf retrievers and 7% over the best prior methods. Our multi-task model generalizes well out-of-the-box to new task categories, datasets, and prompt templates with retrieval speeds that are consistently thousands of times faster than the best prior training-free method.
摘要
大型语言模型(LLM)具有培根学习(ICL)新任务的能力,通过条件Prompt中的一些任务示例来实现。这项工作研究如何选择最佳示例集来提高ICL性能,以便对给定输入进行测试。现有方法可能需要与更大的LLM进行培训或者计算成本较高。我们提出了一个新的指标——GistScore,基于Example Gisting,一种新的培训示例检索器 для ICL 使用注意力瓶颈via Gisting,一种最近的技术用于压缩任务说明。为了让性能和使用方便进行权衡,我们进行了练习 fine-tuning gist模型 на每个数据集和多任务训练单个模型在一个大量数据集上。在21个多样化的数据集和9个任务上,我们显示了我们的精心调整模型可以达到状态之最ICL性能,与各种off-the-shelf retrievers相比,具有20%的绝对均值提升,并且与最佳先前方法相比,具有7%的提升。我们的多任务模型在新的任务类别、数据集和提示模板上具有良好的泛化能力,并且在输入速度上与最佳先前无需培训的方法相比,保持了一定的速度优势。
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals
results: 研究发现,依赖于训练数据中的偶合关系的依赖性是主要的数据依赖性。另外,研究发现GPT3在增加示例数量后变得更加不够注意力,而其测试数据上的准确率提高。结果表明,在训练或示例数据中添加counterfactual可以提高模型的抽象能力。此外,CAT测试表明,模型的注意力测量不同于 solely measuring correlations in data。Abstract
The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to correlations between a specific part of the input (e.g., the hypothesis in NLI) and the label; consequently, models trained only on those outperform chance. Are these correlations picked up by models trained on the full input data? To address this question, we propose a new evaluation method, Counterfactual Attentiveness Test (CAT). CAT uses counterfactuals by replacing part of the input with its counterpart from a different example (subject to some restrictions), expecting an attentive model to change its prediction. Using CAT, we systematically investigate established supervised and in-context learning models on ten datasets spanning four tasks: natural language inference, reading comprehension, paraphrase detection, and visual & language reasoning. CAT reveals that reliance on such correlations is mainly data-dependent. Surprisingly, we find that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves. Our results demonstrate that augmenting training or demonstration data with counterfactuals is effective in improving models' attentiveness. We show that models' attentiveness measured by CAT reveals different conclusions from solely measuring correlations in data.
摘要
“训练数据中偶现的假象相关性会对NLP模型的泛化性产生负面影响。先前的研究发现,带有对应输入部分(如NLI中的假设)和标签之间存在相关性,导致使用只有这些输入训练的模型能够超过偶散。现在我们提出了一种新的评估方法:对比性注意力测试(CAT)。CAT使用对比例的方法,替换输入中的一部分(保留一些限制),期望一个注意力强的模型会改变其预测。通过CAT,我们系统地研究了多种supervised和in-context学习模型在十个 datasets 上,涵盖四个任务:自然语言推理、阅读理解、句子重写检测和视觉语言理解。结果表明,模型对这些相关性的依赖性是数据висиendent的。另外,我们发现GPT3在增加示例数量后,其注意力度会降低,而测试数据上的准确率会提高。我们的结果表明,在训练或示例数据中添加对比例可以提高模型的注意力度。我们的结果还表明,通过CAT评估模型的注意力度可以从数据中的相关性中分离出不同的结论。”
SCORE: A framework for Self-Contradictory Reasoning Evaluation
for: 这 paper 旨在分析大语言模型(LLM)是否真的具备良好的理解能力,以及这种能力是如何影响下游任务的性能。
methods: 这 paper 使用了一种名为 \textsc{SCORE} 的框架来分析 LLM 的理解能力。特别是,它关注自相矛盾的理解,即 LLM 在处理含有上下文信息和常识的任务时,可能会出现自相矛盾的行为。
results: 研究发现,LLM 在多个视点 Setting 下表现不稳定,甚至对正确预测也可能表现出含糊不清的理解。这些结果指出了 LLM 的理解能力有很大的改进空间,并且需要进一步的研究来确定评价reasoning的最佳实践。Abstract
Large language models (LLMs) have demonstrated impressive reasoning ability in various language-based tasks. Despite many proposed reasoning methods aimed at enhancing performance in downstream tasks, two fundamental questions persist: Does reasoning genuinely support predictions, and how reliable is the quality of reasoning? In this paper, we propose a framework \textsc{SCORE} to analyze how well LLMs can reason. Specifically, we focus on self-contradictory reasoning, where reasoning does not support the prediction. We find that LLMs often contradict themselves when performing reasoning tasks that involve contextual information and commonsense. The model may miss evidence or use shortcuts, thereby exhibiting self-contradictory behaviors. We also employ the Point-of-View (POV) method, which probes models to generate reasoning from multiple perspectives, as a diagnostic tool for further analysis. We find that though LLMs may appear to perform well in one-perspective settings, they fail to stabilize such behavior in multi-perspectives settings. Even for correct predictions, the reasoning may be messy and incomplete, and LLMs can easily be led astray from good reasoning. \textsc{SCORE}'s results underscore the lack of robustness required for trustworthy reasoning and the urgency for further research to establish best practices for a comprehensive evaluation of reasoning beyond accuracy-based metrics.
摘要
Translation notes:* "Large language models" is translated as "大型语言模型" (dàxìng yǔyán módel).* "Reasoning" is translated as "理解" (lǐjiě) or "解释" (jiějie).* "Self-contradictory reasoning" is translated as "自相矛盾的理解" (zìxiāng dòuduō de lǐjiě).* "Point-of-View" is translated as "视角" (wénjiàng).* "Multi-perspectives" is translated as "多视角" (duōwénjiàng).* "Messy and incomplete" is translated as "杂乱不完整" (zàilàng bù qiáncháng).* "Trustworthy reasoning" is translated as "可靠的理解" (kěkuài de lǐjiě).
Language Models (Mostly) Do Not Consider Emotion Triggers When Predicting Emotion
results: 研究发现,情绪诱发因素并不是情绪预测模型中考虑的重要特征,而是存在详细的相互作用 между各种特征和情绪检测任务。Abstract
Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? Prior work in emotion trigger or cause identification focused on training models to recognize events that trigger an emotion. Instead, this work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.
摘要
情感Trigger的情况和事件会让人们表现出不同的情感,但到底这些事件会如何影响情感探测模型的预测呢?先前的工作主要集中在训练模型可以识别引发情感的事件上,而这个工作则是 investigate how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions。我们首先介绍了一个新的数据集EmotTrigger,该数据集包含900个社交媒体文章,来自三个不同的数据集,这些文章由专家进行情感触发点的注释,注释具有高度一致性。使用EmotTrigger数据集,我们评估了大型自然语言模型(LLMs)能够识别情感触发点,并对这些任务之间的细节进行比较分析。我们的分析发现,情感触发点并不是情感预测模型考虑的重要特征,而是各种特征之间的细节很复杂地相互作用,以实现情感预测任务。
LifeTox: Unveiling Implicit Toxicity in Life Advice
paper_authors: Minbeom Kim, Jahyun Koo, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung
for: 这个论文的目的是为了检测生活中的隐式恶意言语。
methods: 这个论文使用了RoBERTa模型,并在LifeTox数据集上进行了微调。
results: 实验表明,RoBERTa模型在隐式恶意言语分类任务中匹配或超过了现有的大语言模型的零shot性能。Abstract
As large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. To this end, we introduce LifeTox, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox comprises diverse contexts derived from personal experiences through open-ended questions. Experiments demonstrate that RoBERTa fine-tuned on LifeTox matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. These results underscore the efficacy of LifeTox in addressing the complex challenges inherent in implicit toxicity.
摘要
Large language models are becoming increasingly integrated into daily life, so detecting implicit toxicity across diverse contexts is crucial. To address this challenge, we introduce LifeTox, a dataset designed for identifying implicit toxicity in a broad range of advice-seeking scenarios. Unlike existing safety datasets, LifeTox includes diverse contexts derived from personal experiences through open-ended questions. Experimental results show that RoBERTa fine-tuned on LifeTox performs equally well or even better than large language models in toxicity classification tasks, demonstrating the effectiveness of LifeTox in addressing the complex challenges of implicit toxicity.
results: 与现有评价 metric 比较,提出的GPT-4基于评价方法在医疗笔记生成和医疗报告摘要任务上显示了substantially higher的一致性。Abstract
In the evaluation of medical text generation, it is essential to scrutinize each piece of information and ensure the utmost accuracy of the evaluation. Existing evaluation metrics either focus on coarse-level evaluation that assigns one score for the whole generated output or rely on evaluation models trained on general domain, resulting in inaccuracies when adapted to the medical domain. To address these issues, we propose a set of factuality-centric evaluation aspects and design corresponding GPT-4-based metrics for medical text generation. We systematically compare these metrics with existing ones on clinical note generation and medical report summarization tasks, revealing low inter-metric correlation. A comprehensive human evaluation confirms that the proposed GPT-4-based metrics exhibit substantially higher agreement with human judgments than existing evaluation metrics. Our study contributes to the understanding of medical text generation evaluation and offers a more reliable alternative to existing metrics.
摘要
在医学文本生成评估中,必须仔细检查每个信息并确保评估的准确性。现有的评估指标可能会将整个生成输出的评估授予一个分数,或者基于通用领域的评估模型,导致在医学领域中出现不准确的评估。为解决这些问题,我们提出了一组中心于事实的评估方面和基于GPT-4的评估指标,用于医学文本生成。我们系统比较了这些指标与现有指标的相关性,发现它们在医学报告摘要和医学病历生成任务上显示了低相关性。人工评估表明,我们提出的GPT-4基于的评估指标与人类判断更为一致,与现有指标相比,具有更高的一致性。我们的研究增进了医学文本生成评估的理解,并提供了更可靠的评估方法。
methods: 本研究提出了一种新的方法 called MMOE(多modal交互专家杂合),它可以自动将数据点分类为不同的交互类型,并采用特定交互类型的专门模型进行处理。
results: 根据实验结果,MMOE方法可以提高对困难交互的表现,比如让机器学习模型更好地预测讽刺语言。总的来说,这种方法可以提高 dataset 分析的新视角,并且实现了当前最佳性能。Abstract
Multimodal machine learning, which studies the information and interactions across various input modalities, has made significant advancements in understanding the relationship between images and descriptive text. However, this is just a portion of the potential multimodal interactions seen in the real world and does not include new interactions between conflicting utterances and gestures in predicting sarcasm, for example. Notably, the current methods for capturing shared information often do not extend well to these more nuanced interactions, sometimes performing as low as 50% in binary classification. In this paper, we address this problem via a new approach called MMOE, which stands for a mixture of multimodal interaction experts. Our method automatically classifies data points from unlabeled multimodal datasets by their interaction type and employs specialized models for each specific interaction. Based on our experiments, this approach improves performance on these challenging interactions by more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction. As a result, interaction quantification provides new insights for dataset analysis and yields simple approaches that obtain state-of-the-art performance.
摘要
多模式机器学习,研究不同输入模式之间的信息和互动,在理解图像和描述文本之间的关系方面做出了重要进步。然而,这只是实际世界中多模式互动的一部分,不包括新型的对话和手势冲突的互动,如讲述嘲讽的例子。当前的共享信息捕捉方法经常不能够很好地扩展到这些更复杂的互动,有时performance只有50%级别的binary分类。在这篇论文中,我们解决这个问题通过一种新的方法,即MMOE(多模式互动专家混合)。我们的方法可以自动将数据点从无标签多模式数据集分类为互动类型,并采用特殊的模型来处理每种特定的互动。根据我们的实验,这种方法可以提高对这些复杂的互动的性能,增加总性能约2%,如讲述嘲讽预测等任务。因此,互动量化提供了新的数据分析途径,并且实现了简单的方法,达到了现状之前的最佳性能。Here is the translation of the text into Simplified Chinese:多模式机器学习,研究不同输入模式之间的信息和互动,在理解图像和描述文本之间的关系方面做出了重要进步。然而,这只是实际世界中多模式互动的一部分,不包括新型的对话和手势冲突的互动,如讲述嘲讽的例子。当前的共享信息捕捉方法经常不能够很好地扩展到这些更复杂的互动,有时performance只有50%级别的binary分类。在这篇论文中,我们解决这个问题通过一种新的方法,即MMOE(多模式互动专家混合)。我们的方法可以自动将数据点从无标签多模式数据集分类为互动类型,并采用特殊的模型来处理每种特定的互动。根据我们的实验,这种方法可以提高对这些复杂的互动的性能,增加总性能约2%,如讲述嘲讽预测等任务。因此,互动量化提供了新的数据分析途径,并且实现了简单的方法,达到了现状之前的最佳性能。
Crafting In-context Examples according to LMs’ Parametric Knowledge
results: 实验结果表明,使用包含知识和未知信息的示例集可以最佳地在多种设置下进行表现。此外,研究还发现,使用模型的 parametric knowledge 来排序答案集可以提高表现。Abstract
In-context learning has been applied to knowledge-rich tasks such as question answering. In such scenarios, in-context examples are used to trigger a behaviour in the language model: namely, it should surface information stored in its parametric knowledge. We study the construction of in-context example sets, with a focus on the parametric knowledge of the model regarding in-context examples. We identify 'known' examples, where models can correctly answer from its parametric knowledge, and 'unknown' ones. Our experiments show that prompting with 'unknown' examples decreases the performance, potentially as it encourages hallucination rather than searching its parametric knowledge. Constructing an in-context example set that presents both known and unknown information performs the best across diverse settings. We perform analysis on three multi-answer question answering datasets, which allows us to further study answer set ordering strategies based on the LM's knowledge about each answer. Together, our study sheds lights on how to best construct in-context example sets for knowledge-rich tasks.
摘要
启用上下文学习应用于知识充沛的任务,如问答。在这些场景下,上下文示例被用来触发语言模型的行为:即它应该Surface其参数知识中的信息。我们研究上下文示例集的建构,强调语言模型对上下文示例的参数知识。我们分类了“知道”的示例和“不知道”的示例。我们的实验表明,向语言模型提供“不知道”的示例会降低其性能,可能是因为它鼓励了幻化而不是搜索其参数知识。构建包含知道和不知道信息的上下文示例集最佳,我们在多种场景中进行了分析。我们还研究了基于语言模型对每个答案的知识来排序答案集的策略。ogether,我们的研究为知识充沛任务中的上下文示例集建构提供了新的灯光。
A Reevaluation of Event Extraction: Past, Present, and Future Challenges
For: The paper is written for the purpose of proposing a standardized, fair, and reproducible benchmark for event extraction, and to address the evaluation challenges in recent studies.* Methods: The paper uses standardized data preprocessing scripts and splits for more than ten datasets across different domains, and aggregates and re-implements over ten event extraction approaches published in recent years.* Results: The paper conducts a comprehensive reevaluation of event extraction approaches using the proposed benchmark, and explores the capability of large language models in event extraction. The results are expected to provide a reliable benchmark for future research in the field.Abstract
Event extraction has attracted much attention in recent years due to its potential for many applications. However, recent studies observe some evaluation challenges, suggesting that reported scores might not reflect the true performance. In this work, we first identify and discuss these evaluation challenges, including the unfair comparisons resulting from different assumptions about data or different data preprocessing steps, the incompleteness of the current evaluation framework leading to potential dataset bias or data split bias, and low reproducibility of prior studies. To address these challenges, we propose TextEE, a standardized, fair, and reproducible benchmark for event extraction. TextEE contains standardized data preprocessing scripts and splits for more than ten datasets across different domains. In addition, we aggregate and re-implement over ten event extraction approaches published in recent years and conduct a comprehensive reevaluation. Finally, we explore the capability of large language models in event extraction and discuss some future challenges. We expect TextEE will serve as a reliable benchmark for event extraction, facilitating future research in the field.
摘要
Event extraction 在最近几年内受到了广泛关注,因为它在多个应用领域中具有潜在的潜力。然而,最近的研究发现了评估挑战,表明报告的分数可能不准确反映实际表现。在这项工作中,我们首先标识和讨论了评估挑战,包括数据假设不同或数据预处理步骤不同导致的不公正比较,当前评估框架不完整,导致可能的数据偏见或数据拆分偏见,以及过去研究的低可重现性。为解决这些挑战,我们提出了 TextEE,一个标准化、公平、可重现的事件抽取benchmark。 TextEE包含了标准化的数据预处理脚本和分割,以及多个领域的超过十个数据集。此外,我们对过去十年以来发表的十多个事件抽取方法进行了汇总和重新实现,并进行了全面的重评。最后,我们探讨了大语言模型在事件抽取中的能力,并讨论了未来的挑战。我们期望 TextEE 能成为事件抽取领域的可靠 benchmark,促进未来的研究。
Pachinko: Patching Interpretable QA Models through Natural Language Feedback
results: 研究发现,不同的理由格式对于用户提供反馈和理解模型回答的能力有显著影响。 certain formats significantly enhance user reported understanding and trust of model outputs.Abstract
Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales generated by QA models to support their answers. We specifically consider decomposed question-answering models that first extract an intermediate rationale based on a context and a question and then use solely this rationale to answer the question. A rationale outlines the approach followed by the model to answer the question. Our work considers various formats of these rationales that vary according to well-defined properties of interest. We sample these rationales from large language models using few-shot prompting for two reading comprehension datasets, and then perform two user studies. In the first one, we present users with incorrect answers and corresponding rationales of various formats and ask them to provide natural language feedback to revise the rationale. We then measure the effectiveness of this feedback in patching these rationales through in-context learning. The second study evaluates how well different rationale formats enable users to understand and trust model answers, when they are correct. We find that rationale formats significantly affect how easy it is (1) for users to give feedback for rationales, and (2) for models to subsequently execute this feedback. In addition to influencing critiquablity, certain formats significantly enhance user reported understanding and trust of model outputs.
摘要
找到用户对NL理解模型的反馈可以有助于改进模型。然而,如何在给用户显示模型回答以便他们可以更好地修改它?而且,用户关心什么样的特性来信任和理解模型的回答呢?我们通过分析QA模型生成的论证来回答这些问题。我们专门考虑了基于上下文和问题的分解Question answering模型,它们首先从上下文和问题中提取中间论证,然后只使用这个论证回答问题。论证描述模型回答问题的方法。我们使用大型语言模型通过几个提示来采样这些论证,然后对两个阅读理解dataset进行两项用户研究。在第一项研究中,我们给用户显示错误的回答和相应的论证不同格式,并询问他们提供自然语言反馈来修改论证。我们然后测量这些反馈是否可以通过上下文学习来修复论证。第二项研究检验了不同的论证格式对用户理解和信任模型输出的影响。我们发现,不同的论证格式对用户提供反馈的容易度和模型执行这些反馈的能力有很大影响。此外,某些格式可以明显提高用户报告的理解和信任度。
Large Language Models are Few-Shot Training Example Generators: A Case Study in Fallacy Recognition
methods: incorporating additional context and leveraging大语言模型生成Synthetic数据,以增加较少seen classes的表现。
results: 在不同的谬误类型、数据集和生成器上进行了评估,得到了一致的提高。Abstract
Recognizing fallacies is crucial for ensuring the quality and validity of arguments across various domains. However, computational fallacy recognition faces challenges due to the diverse genres, domains, and types of fallacies found in datasets. This leads to a highly multiclass, and even multi-label, setup with substantial class imbalance. In this study, we aim to enhance existing models for fallacy recognition by incorporating additional context and by leveraging large language models to generate synthetic data, thus increasing the representation of the infrequent classes. We experiment with GPT3.5 to generate synthetic examples and we examine the impact of prompt settings for this. Moreover, we explore zero-shot and few-shot scenarios to evaluate the effectiveness of using the generated examples for training smaller models within a unified fallacy recognition framework. Furthermore, we analyze the overlap between the synthetic data and existing fallacy datasets. Finally, we investigate the usefulness of providing supplementary context for detecting fallacy types that need such context, e.g., diversion fallacies. Our evaluation results demonstrate consistent improvements across fallacy types, datasets, and generators.
摘要
识别谬误是确保不同领域的论据质量和有效性的关键。然而,计算机谬误识别受到数据集中多种类型、领域和类别的多种谬误的挑战。这导致了一个高度多类、甚至多标签的设置,以及巨大的类别偏度问题。在这种情况下,我们想要提高现有的谬误识别模型,通过添加更多的 контекст和利用大语言模型生成 sintetic数据,以增加轻度类的表现。我们使用GPT3.5生成 sintetic例子,并考虑Prompt设置的影响。此外,我们还探索零shot和几shotenario来评估使用生成的例子来训练更小的模型在一个简化的谬误识别框架中。此外,我们还分析了生成的数据和现有的谬误数据集之间的重叠。最后,我们 investigate了在检测某些谬误类型时提供补充的 контекст的有用性,例如误导谬误。我们的评估结果表明,无论谬误类型、数据集或生成器,我们的方法都能够实现了一致的改进。
A Speed Odyssey for Deployable Quantization of LLMs
results: 实验结果显示,我们的W4A8方法可以提高实际推理速度至多达4倍于Hugging Face FP16推理和2.23倍于TensorRT-LLM在FP16推理中,并在INT8推理中与TensorRT-LLM相比提高了1.45倍,而不会对性能造成严重干扰。Abstract
The large language model era urges faster and less costly inference. Prior model compression works on LLMs tend to undertake a software-centric approach primarily focused on the simulated quantization performance. By neglecting the feasibility of deployment, these approaches are typically disabled in real practice. They used to drastically push down the quantization bit range for a reduced computation which might not be supported by the mainstream hardware, or involve sophisticated algorithms that introduce extra computation or memory access overhead. We argue that pursuing a hardware-centric approach in the construction of quantization algorithms is crucial. In this regard, we are driven to build our compression method on top of hardware awareness, eliminating impractical algorithm choices while maximizing the benefit of hardware acceleration. Our method, OdysseyLLM, comes with a novel W4A8 kernel implementation called FastGEMM and a combined recipe of quantization strategies. Extensive experiments manifest the superiority of our W4A8 method which brings the actual speed boosting up to \textbf{4$\times$} compared to Hugging Face FP16 inference and \textbf{2.23$\times$} vs. the state-of-the-art inference engine TensorRT-LLM in FP16, and \textbf{1.45$\times$} vs. TensorRT-LLM in INT8, yet without substantially harming the performance.
摘要
大型语言模型时代强调更快速且成本更低的推导。先前的模型压缩方法倾向于以软件中心的方式进行,主要侧重在模拟量化性能。但这些方法通常在实际应用中被禁用,因为它们通常会降低量化比例,使得主流硬件无法支持或增加了复杂的算法或内存访问开销。我们认为在量化算法的建立中,应该将硬件考虑为核心。在这方面,我们将我们的压缩方法建立在硬件意识之上,排除不可行的算法选择,同时将硬件加速器的最大优化。我们的方法 OdysseyLLM 搭配了一个新的 W4A8 核心实现 FastGEMM,以及一种结合的量化策略。实验结果显示 OdysseyLLM 的实际速度提升为 \textbf{4$\times$} 比 Hugging Face FP16 推导,并且与现有的推导引擎 TensorRT-LLM 在 FP16 下的速度提升为 \textbf{2.23$\times$},并且在 INT8 下的速度提升为 \textbf{1.45$\times$},但不会对性能造成严重的损害。
Towards Pragmatic Awareness in Question Answering: A Case Study in Maternal and Infant Health
results: 研究发现,通过检测问题中含义的推理,可以生成更加准确和有用的回答,从而避免了在回答用户问题时可能产生的危害。Abstract
Questions posed by information-seeking users often contain implicit false or potentially harmful assumptions. In a high-risk domain such as maternal and infant health, a question-answering system must recognize these pragmatic constraints and go beyond simply answering user questions, examining them in context to respond helpfully. To achieve this, we study pragmatic inferences made when mothers ask questions about pregnancy and infant care. Some of the inferences in these questions evade detection by existing methods, risking the possibility of QA systems failing to address them which can have dangerous health and policy implications. We explore the viability of detecting inferences from questions using large language models and illustrate that informing existing QA pipelines with pragmatic inferences produces responses that can mitigate the propagation of harmful beliefs.
摘要
常见于信息寻求用户的问题中的隐含假设或潜在危险假设,在高风险领域如母婴健康,一个问答系统必须认识这些实用限制,不仅回答用户的问题,更要在上下文中检查它们,以对用户提供有用的回答。为了实现这一目标,我们研究了怀孕和婴儿护理中妈妈提出的假设推理。一些这些问题中的假设逃避现有的方法检测,这可能会导致问答系统失败 Addressing them, which can have serious health and policy implications. We explore the feasibility of detecting inferences from questions using large language models and show that incorporating pragmatic inferences into existing QA pipelines can mitigate the propagation of harmful beliefs.
Reducing Privacy Risks in Online Self-Disclosures with Language Models
paper_authors: Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, Wei Xu
For: 保护在线自透泄的用户端隐私* Methods: 发展19种自透泄类划分,精度 fine-tune语言模型,并进行人工测试* Results: 实现Token F$_1$的过程优于75%,并通过用户反馈引入自透泄抽象任务,实现多种 fine-tuning 策略,生成具有较高实用性和Moderate隐私风险的抽象结果。Abstract
Self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. In this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through identification and abstraction. We develop a taxonomy of 19 self-disclosure categories, and curate a large corpus consisting of 4.8K annotated disclosure spans. We then fine-tune a language model for identification, achieving over 75% in Token F$_1$. We further conduct a HCI user study, with 82\% of participants viewing the model positively, highlighting its real world applicability. Motivated by the user feedback, we introduce the task of self-disclosure abstraction. We experiment with both one-span abstraction and three-span abstraction settings, and explore multiple fine-tuning strategies. Our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation.
摘要
自我披露在社交媒体交互中很常见和奖励,但也存在隐私风险。在这篇论文中,我们主动保护用户端隐私相关于在线自我披露的权益。我们开发了19种自我披露类别的taxonomy,并采集了4.8K注释化的披露跨度。我们然后精细调整语言模型,实现了Token F$_1$的过 75%。我们进一步进行了人机交互研究,82%的参与者视为模型有利可图,这反映了其在实际世界中的可行性。受用户反馈 inspirited,我们引入了自我披露抽象任务。我们在一span抽象和三span抽象的设置下进行了实验,并探索了多种调整策略。我们最佳模型可以生成多样化的抽象, moderately reducing privacy risks while maintaining high utility according to human evaluation.
Effective Large Language Model Adaptation for Improved Grounding
methods: 提出了一种新的框架AGREE,即Adaptation of LLMs for GRounding EnhancEment,以改进grounding的问题从一个整体的角度。
results: 比较prompting-based方法,通过调整LLMs来ground它们的答案,可以得到更好地参照的答案,并且可以减少对数据的需求。Abstract
Large language models (LLMs) have achieved remarkable advancements in natural language understanding, generation, and manipulation of text-based data. However, one major issue towards their widespread deployment in the real world is that they can generate "hallucinated" answers that are not factual. Towards this end, this paper focuses on improving grounding from a holistic perspective with a novel framework, AGREE, Adaptation of LLMs for GRounding EnhancEment. We start with the design of an iterative test-time adaptation (TTA) capability that takes into account the support information generated in self-grounded responses. To effectively enable this capability, we tune LLMs to ground the claims in their responses to retrieved documents by providing citations. This tuning on top of the pre-trained LLMs requires a small amount of data that needs to be constructed in a particular way to learn the grounding information, for which we introduce a data construction method. Our results show that the tuning-based AGREE framework generates better grounded responses with more accurate citations compared to prompting-based approaches.
摘要
The AGREE framework focuses on improving grounding from a holistic perspective by incorporating an iterative test-time adaptation (TTA) capability that considers the support information generated in self-grounded responses. To enable this capability, we fine-tune LLMs to ground their claims in their responses to retrieved documents by providing citations. This fine-tuning process requires a small amount of specially constructed data to learn the grounding information.Our results show that the tuning-based AGREE framework generates more accurate and better grounded responses compared to prompting-based approaches. This demonstrates the effectiveness of the AGREE framework in improving the factual accuracy of LLMs' responses.
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-driven Training Data Generation
methods: 本研究使用了 Abstract Meaning Representation (AMR) 来生成不一致的摘要,并使用了自然语言判断和 BARTScore 来选择高质量的负例。
results: 实验结果表明,本研究的方法在 AggreFact-SOTA 数据集上显著超越了之前的系统,这说明了其在检测抽象摘要中的事实准确性的能力。Abstract
Ensuring factual consistency is crucial in various natural language processing tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior entailment-based approaches often generate factually inconsistent summaries and then train a classifier on the generated data. However, summaries produced by these approaches are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a novel framework that generates factually inconsistent summaries using Abstract Meaning Representation (AMR). Our approach parses factually correct summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate that our approach significantly outperforms previous systems on the AggreFact-SOTA dataset, showcasing its efficacy in assessing factuality in abstractive summarization.
摘要
保持事实一致性在各种自然语言处理任务中非常重要,特别是在抽象概念摘要中,因为保持信息完整性非常重要。先前基于前提推理的方法通常会生成不一致的摘要,然后对生成的数据进行训练。然而,这些方法生成的摘要通常是低凝结的或缺乏错误类型覆盖。为解决这些问题,我们提出了 AMRFact 框架,它使用抽象意义表示(AMR)来生成不一致的摘要。我们的方法将事实正确的摘要转换为 AMR 图并在其中注入控制的不一致性,以生成高错误类型覆盖的不一致摘要。此外,我们还提出了一个名为 NegFilter 的数据选择模块,它根据自然语言推理和 BARTScore 来确保生成的负样本的质量。实验结果表明,我们的方法与之前系统相比显著提高了 AggreFact-SOTA 数据集上的表现,这表明我们的方法在抽象概念摘要中评估事实性的有效性。
Leveraging Code to Improve In-context Learning for Semantic Parsing
paper_authors: Ben Bogin, Shivanshu Gupta, Peter Clark, Ashish Sabharwal
for: 提高semantic parsing的效果,尤其是在受限的数据量下
methods: 使用通用编程语言如Python,并将提问添加结构化域描述
results: 在三个 популяр的数据集上显著提高了准确率(例如,从7.9%提升到66.5%),降低了需要大量示例的要求,并减少了语言的 популяр度对于预训练 corpora 的影响。Abstract
In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs. In this work, we improve the effectiveness of ICL for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs, and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets. Combined, they lead to dramatic improvements (e.g. 7.9% to 66.5% on SMCalFlow compositional split), nearly closing the performance gap between easier i.i.d.\ and harder compositional splits when used with a strong model, and reducing the need for a large number of demonstrations. We find that the resemblance of the target parse language to general-purpose code is a more important factor than the language's popularity in pre-training corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs.
摘要
启发式学习(ICL)是 semantic parsing 方法的一种吸引人的方式,因为它可以通过几次示例学习来达到更好的泛化性。然而,学习到特定领域语言(DSL)的语义分析仍然是挑战,尤其是使用只有几个示例的情况下。在这种情况下,我们改进了 ICLL 的效iveness,通过以下两点:1. 使用通用编程语言,如 Python,而不是特定领域语言。2. 在提示中添加结构化领域描述,包括可用的类和函数。我们发现,这两点都会显著提高准确性,并在三个流行的数据集上达到了显著提高(例如,从 7.9% 提高到 66.5% 在 SMCalFlow compositional split 上)。这些改进使得模型在更难的 compositional split 上表现更好,并减少了需要大量示例的需求。我们发现,目标语义分析语言与通用编程语言之间的相似性是更重要的因素,而不是语言的 популярность。我们的发现可以提供一种改进的方法来在现代 ICLL 中建立 semantic parser。
GEE! Grammar Error Explanation with Large Language Models
results: 人工评估表明,这个pipeline在德语和中文 grammar error correction 数据上的正确率分别为 93.9% 和 98.0%。Abstract
Grammatical error correction tools are effective at correcting grammatical errors in users' input sentences but do not provide users with \textit{natural language} explanations about their errors. Such explanations are essential for helping users learn the language by gaining a deeper understanding of its grammatical rules (DeKeyser, 2003; Ellis et al., 2006). To address this gap, we propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences. We analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting. To improve upon this performance, we develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction, followed by prompting GPT-4 to generate explanations. We evaluate our pipeline on German and Chinese grammar error correction data sampled from language learners with a wide range of proficiency levels. Human evaluation reveals that our pipeline produces 93.9% and 98.0% correct explanations for German and Chinese data, respectively. To encourage further research in this area, we will open-source our data and code.
摘要
grammatical error correction tools can correct grammatical errors in users' input sentences, but they do not provide users with 自然语言 explanations about their errors. these explanations are essential for helping users learn the language by gaining a deeper understanding of its grammatical rules (DeKeyser, 2003; Ellis et al., 2006). to address this gap, we propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences. we analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting. to improve upon this performance, we develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction, followed by prompting GPT-4 to generate explanations. we evaluate our pipeline on German and Chinese grammar error correction data sampled from language learners with a wide range of proficiency levels. human evaluation reveals that our pipeline produces 93.9% and 98.0% correct explanations for German and Chinese data, respectively. to encourage further research in this area, we will open-source our data and code.
Sequencing Matters: A Generate-Retrieve-Generate Model for Building Conversational Agents
for: This paper describes the Georgetown InfoSense group’s approach to solving the challenges of TREC iKAT 2023.
methods: The approach uses a Generate-Retrieve-Generate method, which is found to outperform Retrieve-Then-Generate approaches. The solution involves using Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again.
results: The submitted runs outperform the median runs by a significant margin, with superior performance in nDCG across various cut numbers and overall success rate. The official results of the TREC evaluation contradict the initial self-evaluation, but the findings suggest that the sequence of involving different components matters, with LLMs being essential before using search engines.Abstract
This paper contains what the Georgetown InfoSense group has done in regard to solving the challenges presented by TREC iKAT 2023. Our submitted runs outperform the median runs by a significant margin, exhibiting superior performance in nDCG across various cut numbers and in overall success rate. Our approach uses a Generate-Retrieve-Generate method, which we've found to greatly outpace Retrieve-Then-Generate approaches for the purposes of iKAT. Our solution involves the use of Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again. We leverage several purpose-built Language Models, including BERT, Chat-based, and text-to-transfer-based models, for text understanding, classification, generation, and summarization. The official results of the TREC evaluation contradict our initial self-evaluation, which may suggest that a decrease in the reliance on our retrieval and classification methods is better. Nonetheless, our findings suggest that the sequence of involving these different components matters, where we see an essentiality of using LLMs before using search engines.
摘要
Our solution employs Large Language Models (LLMs) for initial answers, answer grounding by BM25, passage quality filtering by logistic regression, and answer generation by LLMs again. We utilize several purpose-built Language Models, including BERT, Chat-based, and text-to-transfer-based models, for text understanding, classification, generation, and summarization.While the official results of the TREC evaluation differ from our initial self-evaluation, our findings suggest that the sequence of involving these different components is crucial. Specifically, we find that using LLMs before search engines is essential.
One Size Does Not Fit All: Customizing Open-Domain Procedures
results: 研究发现,在Sequential设置下使用LLM作为定制代理和执行代理时,可以很好地满足用户的定制需求,但是LLM并不充分考虑用户的定制需求,导致错误率为~51%。Abstract
How-to procedures, such as how to plant a garden, are ubiquitous. But one size does not fit all - humans often need to customize these procedural plans according to their specific needs, e.g., planting a garden without pesticides. While LLMs can fluently generate generic procedures, we present the first study on how well LLMs can customize open-domain procedures. We introduce CustomPlans, a probe dataset of customization hints that encodes diverse user needs for open-domain How-to procedures. Using LLMs as CustomizationAgent and ExecutionAgent in different settings, we establish their abilities to perform open-domain procedure customization. Human evaluation shows that using these agents in a Sequential setting is the best, but they are good enough only ~51% of the time. Error analysis shows that LLMs do not sufficiently address user customization needs in their generated procedures.
摘要
各种如何程序(如植 garden)是普遍存在的。但是一个size不适用于所有人——人们常需要根据自己的具体需求自定义这些程序,例如不使用杀虫剂植 garden。 LLMS可以轻松生成通用的程序,但我们的研究表明,LLMS可以如何自定义开放领域的程序。我们介绍了一个名为CustomPlans的探索数据集,该数据集包含多种用户需求的自定义提示。我们使用LLMS作为自定义代理和执行代理在不同的设置下,并证明了它们在开放领域程序自定义方面的能力。人工评估表明,使用这些代理在顺序设置下是最好的,但它们只能成功约51%的时间。错误分析表明,LLMS在生成的程序中不充分考虑用户自定义需求。
results: 根据评估结果,SQATIN 在已有NLU benchmark上设置了新的状态对话NLU性能,大幅超越了现有的模型基于标准精度优化目标的表现,尤其是在跨领域传递中。Abstract
Task-oriented dialogue (ToD) systems help users execute well-defined tasks across a variety of domains (e.g., $\textit{flight booking}$ or $\textit{food ordering}$), with their Natural Language Understanding (NLU) components being dedicated to the analysis of user utterances, predicting users' intents ($\textit{Intent Detection}$, ID) and extracting values for informational slots ($\textit{Value Extraction}$, VE). In most domains, labelled NLU data is scarce, making sample-efficient learning -- enabled with effective transfer paradigms -- paramount. In this work, we introduce SQATIN, a new framework for dialog NLU based on (i) instruction tuning and (ii) question-answering-based formulation of ID and VE tasks. According to the evaluation on established NLU benchmarks, SQATIN sets the new state of the art in dialogue NLU, substantially surpassing the performance of current models based on standard fine-tuning objectives in both in-domain training and cross-domain transfer. SQATIN yields particularly large performance gains in cross-domain transfer, owing to the fact that our QA-based instruction tuning leverages similarities between natural language descriptions of classes (i.e., slots and intents) across domains.
摘要
任免对话(ToD)系统可以帮助用户完成具体的任务(如飞行订票或食物订单),其自然语言理解(NLU)组件专门用于分析用户言语,预测用户的意图(Intent Detection,ID)和提取信息槽的值(Value Extraction,VE)。在大多数领域中,标注的NLU数据 scarce,因此使得样本效率学习 -- 通过有效的传输方法 -- 是非常重要的。在这项工作中,我们介绍了SQATIN,一种新的对话NLU框架,基于(i)指令调整和(ii)问答题解法来实现ID和VE任务。根据评估已知NLU标准准的评估 benchmark,SQATIN将对话NLU的新状态划定,大幅超过了现有基于标准精度调整目标的当前模型在领域培训和跨领域传输中的性能。SQATIN在跨领域传输中的性能提升特别大,这是因为我们的QA-based instruction tuning利用了不同领域的自然语言描述中的类 similarities(即槽和意图)。
Personalized Jargon Identification for Enhanced Interdisciplinary Communication
results: 研究发现,科研人员对术语熟悉度和信息需求之间存在很大差异,即使在同一个子领域内。研究还找到了个人、子领域和领域知识等特征,以便预测个人对术语熟悉度的认知。Abstract
Scientific jargon can impede researchers when they read materials from other domains. Current methods of jargon identification mainly use corpus-level familiarity indicators (e.g., Simple Wikipedia represents plain language). However, researchers' familiarity of a term can vary greatly based on their own background. We collect a dataset of over 10K term familiarity annotations from 11 computer science researchers for terms drawn from 100 paper abstracts. Analysis of this data reveals that jargon familiarity and information needs vary widely across annotators, even within the same sub-domain (e.g., NLP). We investigate features representing individual, sub-domain, and domain knowledge to predict individual jargon familiarity. We compare supervised and prompt-based approaches, finding that prompt-based methods including personal publications yields the highest accuracy, though zero-shot prompting provides a strong baseline. This research offers insight into features and methods to integrate personal data into scientific jargon identification.
摘要
科学技术术语可能会阻碍研究人员在不同领域的文献中阅读。现有的词汇识别方法主要使用文库级 familiarness 指标(例如简单的wikipedia)。然而,研究人员对于一个词汇的熟悉程度可能会很大差异,基于他们的背景知识。我们收集了超过10,000个词汇熟悉标注from 11名计算机科学研究人员,来自100篇摘要中的词汇。我们对这些数据进行分析发现,词汇熟悉度和信息需求在审题人员中很大差异,甚至在同一个子领域(例如NLP)内。我们研究个人、子领域和领域知识的特征,以预测个人词汇熟悉度。我们比较了经过学习和提示方法,发现提示方法,包括个人出版物,可以达到最高的准确率,虽然零开始提示方法提供了强大的基准。这项研究对个人数据集成 scientific jargon 识别提供了新的想法和方法。
Show Your Work with Confidence: Confidence Bands for Tuning Curves
results: 本文的实验结果表明,新提出的方法可以准确地建立比较 curves,并且可以与现有的bootstrapconfidence bands进行比较。Abstract
The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. Tuning curves fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data. Beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at https://github.com/nalourie/opda .
摘要
“选择超参数会对自然语言处理性能产生深远影响。然而,常常难以判断一方法是哪一方法更好,这是因为它们的优化努力不同。对于这问题,曲线数据可以提供解答。 Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. 许多估计器存在这些曲线上,但是常用的是点估计,我们展示它们会在有限数据情况下失败并给出矛盾的结果。 beyond point estimates, confidence bands are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods. empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release a library implementing the method at .”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs
results: 该paper的实验结果表明,intent-sim可以更好地确定需要clarification的时候,并且可以double randomly select的性能。此外,intent-sim在多种NLP任务和LMs中都表现了良好的稳定性。Abstract
Resolving ambiguities through interaction is a hallmark of natural language, and modeling this behavior is a core challenge in crafting AI assistants. In this work, we study such behavior in LMs by proposing a task-agnostic framework for resolving ambiguity by asking users clarifying questions. Our framework breaks down this objective into three subtasks: (1) determining when clarification is needed, (2) determining what clarifying question to ask, and (3) responding accurately with the new information gathered through clarification. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. For the first subtask, we present a novel uncertainty estimation approach, intent-sim, that determines the utility of querying for clarification by estimating the entropy over user intents. Our method consistently outperforms existing uncertainty estimation approaches at identifying predictions that will benefit from clarification. When only allowed to ask for clarification on 10% of examples, our system is able to double the performance gains over randomly selecting examples to clarify. Furthermore, we find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs. Together, our work lays foundation for studying clarifying interactions with LMs.
摘要
解决冲突通过互动是自然语言的特征,模拟这种行为是AI助手设计的核心挑战。在这项工作中,我们研究LM中的这种行为,通过提出任务无关的框架来解决冲突。我们将这个目标分解为三个互动任务:(1)确定是否需要准确化,(2)确定需要准确化的问题,以及(3)通过准确化获取新信息并准确回答。我们在三种NLP应用中评估系统:问答、机器翻译和自然语言推理。对于第一个任务,我们提出了一种新的uncertainty estimation方法,即意图sim,该方法根据用户意图的Entropy来判断是否需要准确化。我们的方法在identifying需要准确化的预测中表现出色,并且在只允许问 clarification 10%的示例中,我们的系统能够double Performance gain。此外,我们发现intent-sim是可靠的,在各种NLP任务和LM上都能够达到显著改进。总之,我们的工作为 изуча clarify interactions with LMs 提供了基础。
for: Addressing the problem of achieving asymptotically fair participation in machine learning models, particularly when the data distribution shifts due to deployment.
methods: Optimal control formulation and surrogate retention system based on evolutionary population dynamics to approximate the dynamics of distribution shifts on active user counts.
results: Superior performance compared to existing baseline methods in a generic simulation environment, demonstrating the effectiveness of the proposed method for long-term planning and maintaining model performance across all demographic groups.Here’s the full text in Simplified Chinese:
results: 在一个通用的 simulate 环境中,比基eline方法表现出色,证明提议的方法可以对长期规划和维护所有民主组中的表现。Abstract
The performance of state-of-the-art machine learning models often deteriorates when testing on demographics that are under-represented in the training dataset. This problem has predominately been studied in a supervised learning setting where the data distribution is static. However, real-world applications often involve distribution shifts caused by the deployed models. For instance, the performance disparity against monitory users can lead to a high customer churn rate, thus the available data provided by active users are skewed due to the lack of minority users. This feedback effect further exacerbates the disparity among different demographic groups in future steps. To address this issue, we propose asymptotically fair participation as a condition to maintain long-term model performance over all demographic groups. In this work, we aim to address the problem of achieving asymptotically fair participation via optimal control formulation. Moreover, we design a surrogate retention system based on existing literature on evolutionary population dynamics to approximate the dynamics of distribution shifts on active user counts, from which the objective of achieving asymptotically fair participation is formulated as an optimal control problem, and the control variables are considered as the model parameters. We apply an efficient implementation of Pontryagin's maximum principle to estimate the optimal control solution. To evaluate the effectiveness of the proposed method, we design a generic simulation environment that simulates the population dynamics of the feedback effect between user retention and model performance. When we deploy the resulting models to the simulation environment, the optimal control solution accounts for long-term planning and leads to superior performance compared with existing baseline methods.
摘要
现代机器学习模型在测试不充分表示的民生数据时 часто会下降性能。这个问题主要在静止数据分布下进行研究,但实际应用中经常出现数据分布shift的问题。例如,由于模型的使用者留存率不均匀,导致可用数据受到少数民生的抑制,从而使得模型的性能差异化。这种反馈效应进一步夹紧不同民生群体之间的性能差异,使得长期维护模型的性能成为一个重要的问题。为解决这个问题,我们提出了 asymptotically fair participation 作为一种维护长期模型性能的条件。在这种情况下,我们通过可EVOLUTIONARY POPULATION DYNAMICS 来模拟活跃用户数量的分布转移,从而将目标 achieving asymptotically fair participation 转化为一个优化控制问题。我们使用 Pontryagin's maximum principle 的有效实现来估计优化控制解。为评估提案的效果,我们设计了一个通用的 simulate 环境,该环境可以模拟反馈效应的影响,使得我们可以评估提案的效果。当我们将结果应用到 simulate 环境中时,优化控制解会考虑长期规划,并且比拥有基准方法更高的性能。
Adaptive Optimization Algorithms for Machine Learning
methods: 本论文使用了多种方法,包括个性化损失、meta-学习、卷积矩阵规则、步长 Newton 方法和低维度更新。
results: 本论文的研究结果包括提出了新的适应性方法,改进了现有算法的收敛保证,以及对实际应用中流行的算法进行了深入分析。Abstract
Machine learning assumes a pivotal role in our data-driven world. The increasing scale of models and datasets necessitates quick and reliable algorithms for model training. This dissertation investigates adaptivity in machine learning optimizers. The ensuing chapters are dedicated to various facets of adaptivity, including: 1. personalization and user-specific models via personalized loss, 2. provable post-training model adaptations via meta-learning, 3. learning unknown hyperparameters in real time via hyperparameter variance reduction, 4. fast O(1/k^2) global convergence of second-order methods via stepsized Newton method regardless of the initialization and choice basis, 5. fast and scalable second-order methods via low-dimensional updates. This thesis contributes novel insights, introduces new algorithms with improved convergence guarantees, and improves analyses of popular practical algorithms.
摘要
机器学习在数据驱动世界中扮演着关键性的角色。随着模型和数据集的规模的增长,需要快速可靠的模型训练算法。本论文调查了机器学习优化器中的适应性。以下章节探讨了不同方面的适应性,包括:1. 个性化损失函数 для个性化模型;2. 通过meta学习提供可证明的后期模型适应性;3. 在实时中学习 unknown 的 гипер参数 via гипер参数减少方法;4. O(1/k^2) 全球快速收敛的二次方法,无论初始化和选择基准都是;5. 快速可扩展的二次方法 via 低维度更新。本论文提供了新的发现和改进了现有算法的新算法,同时也提高了现有实践中的分析。
Improving Unimodal Inference with Multimodal Transformers
results: 在RGB和深度动手势识别、语音和脸部视频情感识别以及语音-视频-文本情感分析等任务上表现出色,超过传统单模型counterpartAbstract
This paper proposes an approach for improving performance of unimodal models with multimodal training. Our approach involves a multi-branch architecture that incorporates unimodal models with a multimodal transformer-based branch. By co-training these branches, the stronger multimodal branch can transfer its knowledge to the weaker unimodal branches through a multi-task objective, thereby improving the performance of the resulting unimodal models. We evaluate our approach on tasks of dynamic hand gesture recognition based on RGB and Depth, audiovisual emotion recognition based on speech and facial video, and audio-video-text based sentiment analysis. Our approach outperforms the conventionally trained unimodal counterparts. Interestingly, we also observe that optimization of the unimodal branches improves the multimodal branch, compared to a similar multimodal model trained from scratch.
摘要
Algebraic Topological Networks via the Persistent Local Homology Sheaf
for: 这 paper 的目的是提出一种基于代数Topology的图 convolution 和注意力模块的新方法,以便更好地利用数据的本地 topological 特性。
methods: 这 paper 使用了 sheaf neural networks 框架,通过将数据转化为 simplicial complex 后,构造其 local homology sheaf,并使用这个 sheaf 的 Laplacian 来建立更复杂的线性消息。
results: 这 paper 的结果表明,通过使用本 paper 的方法,可以construct more expressive, non-isotropic messages,并且可以 directly optimize the topology of intermediate features。Abstract
In this work, we introduce a novel approach based on algebraic topology to enhance graph convolution and attention modules by incorporating local topological properties of the data. To do so, we consider the framework of sheaf neural networks, which has been previously leveraged to incorporate additional structure into graph neural networks' features and construct more expressive, non-isotropic messages. Specifically, given an input simplicial complex (e.g. generated by the cliques of a graph or the neighbors in a point cloud), we construct its local homology sheaf, which assigns to each node the vector space of its local homology. The intermediate features of our networks live in these vector spaces and we leverage the associated sheaf Laplacian to construct more complex linear messages between them. Moreover, we extend this approach by considering the persistent version of local homology associated with a weighted simplicial complex (e.g., built from pairwise distances of nodes embeddings). This i) solves the problem of the lack of a natural choice of basis for the local homology vector spaces and ii) makes the sheaf itself differentiable, which enables our models to directly optimize the topology of their intermediate features.
摘要
在这个工作中,我们介绍了一种基于代数Topology的新方法,用于增强图 convolution和注意模块。我们通过使用 sheaf neural networks 框架,将数据的本地 topological 特性 incorporated 到图 neural networks 的特征中。具体来说,我们给输入的 simplicial complex (例如,由图中的 clique 或点云中的邻居生成) 构建了本地 homology sheaf,将每个节点的 vector space 分配给其本地 homology。我们的网络中间特征生活在这些 vector space 中,并利用相关的 sheaf Laplacian 构建更复杂的线性消息 между它们。此外,我们还扩展了这种方法,考虑weighted simplicial complex 中的 persistent local homology,解决了选择 local homology vector space 的自然基的问题,并使 sheaf 本身可导,使我们的模型直接优化其中间特征的 topology。
Near-optimal Closed-loop Method via Lyapunov Damping for Convex Optimization
paper_authors: Severin Maier, Camille Castera, Peter Ochs
for: 这个论文是为了解决首阶Optimization问题而设计的一个自动控制系统。
methods: 这个系统使用了闭环抑止来实现 convergence rate的最佳化。
results: 研究发现,这个系统可以达到 arbitrarily close to 最佳的 convergence rate,而且可以实现闭环抑止。Abstract
We introduce an autonomous system with closed-loop damping for first-order convex optimization. While, to this day, optimal rates of convergence are only achieved by non-autonomous methods via open-loop damping (e.g., Nesterov's algorithm), we show that our system is the first one featuring a closed-loop damping while exhibiting a rate arbitrarily close to the optimal one. We do so by coupling the damping and the speed of convergence of the system via a well-chosen Lyapunov function. We then derive a practical first-order algorithm called LYDIA by discretizing our system, and present numerical experiments supporting our theoretical findings.
摘要
我们介绍一个自动化系统,其中关闭调对于首项凸优化问题的关闭调。至今为止,仅有非自动化方法(如尼斯特洛夫的算法)可以 дости得最佳速度传递,但我们证明我们的系统是第一个具有关闭调的系统,其速度与最佳速度传递之间存在一定的关联。我们通过一个适当的 Lyapunov 函数来实现这一点。我们随后从数值方面提出了一个实用的首项算法,即 LYDIA,并提供了数值实验证明我们的理论成果。
Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces
results: 本文的实验结果显示FLAT方法能够成功地在118个UCI数据集上进行几何学习,并与基准值相比有着明显的改善。Abstract
Despite the prevalence of tabular datasets, few-shot learning remains under-explored within this domain. Existing few-shot methods are not directly applicable to tabular datasets due to varying column relationships, meanings, and permutational invariance. To address these challenges, we propose FLAT-a novel approach to tabular few-shot learning, encompassing knowledge sharing between datasets with heterogeneous feature spaces. Utilizing an encoder inspired by Dataset2Vec, FLAT learns low-dimensional embeddings of datasets and their individual columns, which facilitate knowledge transfer and generalization to previously unseen datasets. A decoder network parametrizes the predictive target network, implemented as a Graph Attention Network, to accommodate the heterogeneous nature of tabular datasets. Experiments on a diverse collection of 118 UCI datasets demonstrate FLAT's successful generalization to new tabular datasets and a considerable improvement over the baselines.
摘要
尽管表格数据集很普遍,ew-shot学习在这个领域仍然受到不 enough 探索。现有的ew-shot方法不直接适用于表格数据集,因为列之间的关系、意义和 permutation 不变。为解决这些挑战,我们提出FLAT,一种 novel 的表格ew-shot学习方法,利用dataset2Vec inspirited Encoder 学习表格和其各列的低维度嵌入,从而促进了知识传递和泛化至之前未见的表格数据集。Decoder 网络实现了 Graph Attention Network,以适应表格数据集的多样性。我们在118个 UCI 数据集上进行了实验,并证明FLAT可以成功泛化到新的表格数据集,并且明显超过基elines。
Guaranteeing Control Requirements via Reward Shaping in Reinforcement Learning
results: 数据表明,使用本研究提出的方法可以确保优化策略满足指定的控制要求,并且可以在 OpenAI Gym 中的两个示例环境中进行 validate。Abstract
In addressing control problems such as regulation and tracking through reinforcement learning, it is often required to guarantee that the acquired policy meets essential performance and stability criteria such as a desired settling time and steady-state error prior to deployment. Motivated by this necessity, we present a set of results and a systematic reward shaping procedure that (i) ensures the optimal policy generates trajectories that align with specified control requirements and (ii) allows to assess whether any given policy satisfies them. We validate our approach through comprehensive numerical experiments conducted in two representative environments from OpenAI Gym: the Inverted Pendulum swing-up problem and the Lunar Lander. Utilizing both tabular and deep reinforcement learning methods, our experiments consistently affirm the efficacy of our proposed framework, highlighting its effectiveness in ensuring policy adherence to the prescribed control requirements.
摘要
<>将控制问题,如补偿和跟踪,通过强化学习方法解决时,经常需要保证获得的策略能够满足必要的性能和稳定性标准,如所需的定点时间和稳定态误差。驱动了这种需求,我们提出了一组结果和一种系统的奖励形式,以确保优化策略生成的轨迹与指定的控制要求相对应,并且可以评估任何给定策略是否满足这些要求。我们通过对OpenAI Gym提供的两个示例环境中的摆式椅子振荡问题和月球降落问题进行了广泛的数学实验,结果表明我们的提出的框架具有确保策略遵循指定控制要求的效iveness。
Online Optimization for Network Resource Allocation and Comparison with Reinforcement Learning Techniques
paper_authors: Ahmed Sid-Ali, Ioannis Lambadaris, Yiqiang Q. Zhao, Gennady Shaikhet, Amirhossein Asgharnia
for: 这paper是为了解决在线网络资源分配问题,包括工作转移。
methods: 这paper使用了Randomized Online Algorithm based on exponentially weighted method。
results: 这paper证明了该算法具有下线时间 regret,并且在人工数据上测试表明该算法在工作转移问题上表现出优于强化学习方法。Abstract
We tackle in this paper an online network resource allocation problem with job transfers. The network is composed of many servers connected by communication links. The system operates in discrete time; at each time slot, the administrator reserves resources at servers for future job requests, and a cost is incurred for the reservations made. Then, after receptions, the jobs may be transferred between the servers to best accommodate the demands. This incurs an additional transport cost. Finally, if a job request cannot be satisfied, there is a violation that engenders a cost to pay for the blocked job. We propose a randomized online algorithm based on the exponentially weighted method. We prove that our algorithm enjoys a sub-linear in time regret, which indicates that the algorithm is adapting and learning from its experiences and is becoming more efficient in its decision-making as it accumulates more data. Moreover, we test the performance of our algorithm on artificial data and compare it against a reinforcement learning method where we show that our proposed method outperforms the latter.
摘要
在本文中,我们研究了一个在线网络资源分配问题,其中包括作业传输。网络由多个服务器连接而成,系统在精确时钟下运行,管理员在每个时间槽内预留服务器上的资源,以储存未来的作业请求。预留资源的成本将会产生。然后,接收作业可能会被传输到不同的服务器,以满足需求。这会产生额外的传输成本。如果一个作业请求无法满足,那么会出现阻塞,并且需要支付阻塞作业的成本。我们提议一种随机在线算法,基于加速方法。我们证明我们的算法具有线性小于时间的 regret,这表明我们的算法在经验学习和决策过程中变得更加高效。此外,我们在人工数据上测试了我们的算法,并与一种强化学习方法进行比较,我们的提议方法在效率方面表现更好。
results: 这种方法可以高精度地描述LEO卫星的运动轨迹,并且可以保持物理可读性的坐标系。Abstract
A novel approach is presented for discovering PDEs that govern the motion of satellites in space. The method is based on SINDy, a data-driven technique capable of identifying the underlying dynamics of complex physical systems from time series data. SINDy is utilized to uncover PDEs that describe the laws of physics in space, which are non-deterministic and influenced by various factors such as drag or the reference area (related to the attitude of the satellite). In contrast to prior works, the physically interpretable coordinate system is maintained, and no dimensionality reduction technique is applied to the data. By training the model with multiple representative trajectories of LEO - encompassing various inclinations, eccentricities, and altitudes - and testing it with unseen orbital motion patterns, a mean error of around 140 km for the positions and 0.12 km/s for the velocities is achieved. The method offers the advantage of delivering interpretable, accurate, and complex models of orbital motion that can be employed for propagation or as inputs to predictive models for other variables of interest, such as atmospheric drag or the probability of collision in an encounter with a spacecraft or space objects. In conclusion, the work demonstrates the promising potential of using SINDy to discover the equations governing the behaviour of satellites in space. The technique has been successfully applied to uncover PDEs describing the motion of satellites in LEO with high accuracy. The method possesses several advantages over traditional models, including the ability to provide physically interpretable, accurate, and complex models of orbital motion derived from high-entropy datasets. These models can be utilised for propagation or as inputs to predictive models for other variables of interest.
摘要
一种新的方法被提出,用于发现卫星在空间中的运动方程。该方法基于SINDy,一种数据驱动的技术,可以从时间序列数据中找到物理系统的下面动力学。SINDy被用来揭示卫星运动的PDE,这些PDE是非束定的,受到阻力或参考面积的影响。与先前的工作不同,physically interpretable的坐标系统被保留,无需应用任何维度减少技术。通过训练模型使用多个LEO表示轨迹,包括不同的倾斜、轨道半长轴和高度,并测试它们与未经见过的轨道运动模式,实现了平均误差约为140公里的位置和0.12公里/秒的速度。这种方法具有以下优点:可提供可解释、准确、复杂的轨道运动模型,可以用于卫星的传播或作为其他变量的预测模型的输入。总之,这种方法在发现卫星在空间中的运动方程方面表现出了扎实的推力,并成功地应用于LEO中的卫星运动。这种方法比传统模型具有多个优点,包括能提供physically interpretable、准确、复杂的轨道运动模型,从高熵数据集中拟合出来的。这些模型可以用于卫星的传播或作为其他变量的预测模型的输入。
Co-data Learning for Bayesian Additive Regression Trees
paper_authors: Jeroen M. Goedhart, Thomas Klausch, Jurriaan Janssen, Mark A. van de Wiel
For: This paper proposes a method to incorporate external information (co-data) into Bayesian additive regression trees (BART) to improve prediction in medical prediction applications with small sample sizes and high-dimensional covariates.* Methods: The proposed method uses an empirical Bayes (EB) framework to estimate prior covariate weights in the BART model, and can handle multiple types of co-data simultaneously. The EB framework also estimates other hyperparameters of BART.* Results: The proposed method finds relevant covariates and improves prediction compared to default BART in simulations, and outperforms regression-based co-data learners when the covariate-response relationship is nonlinear. The method is applied to diffuse large B-cell lymphoma prognosis with clinical covariates, gene mutations, DNA translocations, and DNA copy number data.Abstract
Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction
摘要
医学预测应用经常面临小样本大于变量的问题。这些数据会对预测和变量选择造成困难,� особенply when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data.关键字:Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction
Xputer: Bridging Data Gaps with NMF, XGBoost, and a Streamlined GUI Experience
results: 在性能评估中,Xputer与已有的工具 IterativeImputer 相比,不仅 Computational speed 快,而且在填充精度方面也经常表现出来。此外,Xputer可以自动处理多种数据类型,包括分类、连续和布尔型数据,不需要先进行处理。Abstract
The rapid proliferation of data across diverse fields has accentuated the importance of accurate imputation for missing values. This task is crucial for ensuring data integrity and deriving meaningful insights. In response to this challenge, we present Xputer, a novel imputation tool that adeptly integrates Non-negative Matrix Factorization (NMF) with the predictive strengths of XGBoost. One of Xputer's standout features is its versatility: it supports zero imputation, enables hyperparameter optimization through Optuna, and allows users to define the number of iterations. For enhanced user experience and accessibility, we have equipped Xputer with an intuitive Graphical User Interface (GUI) ensuring ease of handling, even for those less familiar with computational tools. In performance benchmarks, Xputer not only rivals the computational speed of established tools such as IterativeImputer but also often outperforms them in terms of imputation accuracy. Furthermore, Xputer autonomously handles a diverse spectrum of data types, including categorical, continuous, and Boolean, eliminating the need for prior preprocessing. Given its blend of performance, flexibility, and user-friendly design, Xputer emerges as a state-of-the-art solution in the realm of data imputation.
摘要
随着数据在多个领域的快速扩散,缺失值的准确填充变得非常重要,以保持数据完整性和获得有意义的发现。为回应这个挑战,我们提出了Xputer,一种新的填充工具,它将非正式矩阵分解(NMF)与XGBoost的预测能力结合得非常灵活。Xputer的一些特点包括:* 支持零填充* 通过Optuna进行参数优化* 允许用户定义迭代次数为了提高用户体验和可达性,我们为Xputer设计了一个直观的图形用户界面(GUI),使其易于操作,即使用户对计算工具不熟悉。在性能测试中,Xputer不仅与已有的工具如IterativeImputer相当,而且经常超越它们在填充精度方面。此外,Xputer可以自动处理多种数据类型,包括 categorical、continue 和Boolean,从而消除了先前的预处理需求。由于其综合表现、灵活性和用户友好的设计,Xputer在数据填充领域中成为了状态 искусственный的解决方案。
Self-supervised learning of multi-omics embeddings in the low-label, high-data regime
results: 研究发现,使用这种方法可以在肿瘤类型预测 tasks 中获得更高的性能,比如XGBoost和CatBoost等标准准则。此外,研究还探讨了多modal SSL,并提出了一种晚期融合模型,其中每种Omics都通过自己的子网络进行处理,然后将输出融合并传递给预训练或下游目标函数。这种方法在多modal样本中预测单 modal 样本的性能得到了改进。Abstract
Contrastive, self-supervised learning (SSL) is used to train a model that predicts cancer type from miRNA, mRNA or RPPA expression data. This model, a pretrained FT-Transformer, is shown to outperform XGBoost and CatBoost, standard benchmarks for tabular data, when labelled samples are scarce but the number of unlabelled samples is high. This is despite the fact that the datasets we use have $\mathcal{O}(10^{1})$ classes and $\mathcal{O}(10^{2})-\mathcal{O}(10^{4})$ features. After demonstrating the efficacy of our chosen method of self-supervised pretraining, we investigate SSL for multi-modal models. A late-fusion model is proposed, where each omics is passed through its own sub-network, the outputs of which are averaged and passed to the pretraining or downstream objective function. Multi-modal pretraining is shown to improve predictions from a single omics, and we argue that this is useful for datasets with many unlabelled multi-modal samples, but few labelled unimodal samples. Additionally, we show that pretraining each omics-specific module individually is highly effective. This enables the application of the proposed model in a variety of contexts where a large amount of unlabelled data is available from each omics, but only a few labelled samples.
摘要
“对比自学习(Contrastive, self-supervised learning)用于训练一个预训练FT-Transformer模型,用于预测肿瘤类型基于miRNA、mRNA或RPPA表达数据。这个模型在标签样本稀缺但无标签样本多的情况下表现出优于XGBoost和CatBoost标准准则,即使我们使用的数据集有数十个类型和数十个到数百个特征。我们首先证明了我们选择的自我超vised预训练方法的有效性,然后我们调查了多modal模型的SSL。我们提议了一种晚期 fusione模型,其中每个Omics被 passing through its own sub-network,输出被平均并 passing to the pretraining或 downstream objective function。我们发现,多modal预训练可以提高单modal预测结果,并且我们认为这是有用的在 datasets中有多个不标签多modal样本,但只有少量标签单modal样本。此外,我们发现预训练每个 OmicsSpecific module 都是非常有效的。这使得我们的模型可以在每个 Omics 有大量未标签数据,但只有几个标签单Modal 样本的情况下应用。”
Natural Disaster Analysis using Satellite Imagery and Social-Media Data for Emergency Response Situations
paper_authors: Sukeerthi Mandyam, Shanmuga Priya MG, Shalini Suresh, Kavitha Srinivasan for:这项研究旨在分析不同类型的数据(卫星图像和推特数据),以提供深入的灾害管理分析。methods:这项研究包括两个阶段:卫星图像分析和推特数据分析,然后将这两个模块集成使用位置坐标。在第一阶段,使用多类地表征分 segmentation技术,基于U-Net架构进行预和后灾害卫星图像分析。在第二阶段,将地区映射到必需的紧急救援操作信息上,并提取推特数据使用关键词对应的地区。results:这项研究得到的结果是一种基于实时位置坐标和频率分析技术的多维ensional信息集成系统,可以帮助灾害管理人员在灾害发生时获得全面的情况概述,如喀拉拉和密西西比洪灾的分析和验证。这项研究的创新之处在于通过使用分割卫星图像和地区特定筛选器,对灾害区域进行深入的分析和救援操作。Abstract
Disaster Management is one of the most promising research areas because of its significant economic, environmental and social repercussions. This research focuses on analyzing different types of data (pre and post satellite images and twitter data) related to disaster management for in-depth analysis of location-wise emergency requirements. This research has been divided into two stages, namely, satellite image analysis and twitter data analysis followed by integration using location. The first stage involves pre and post disaster satellite image analysis of the location using multi-class land cover segmentation technique based on U-Net architecture. The second stage focuses on mapping the region with essential information about the disaster situation and immediate requirements for relief operations. The severely affected regions are demarcated and twitter data is extracted using keywords respective to that location. The extraction of situational information from a large corpus of raw tweets adopts Content Word based Tweet Summarization (COWTS) technique. An integration of these modules using real-time location-based mapping and frequency analysis technique gathers multi-dimensional information in the advent of disaster occurrence such as the Kerala and Mississippi floods that were analyzed and validated as test cases. The novelty of this research lies in the application of segmented satellite images for disaster relief using highlighted land cover changes and integration of twitter data by mapping these region-specific filters for obtaining a complete overview of the disaster.
摘要
灾害管理是一个非常有前途的研究领域,因为它具有重要的经济、环境和社会影响。这个研究的目的是分析不同类型的数据(卫星图像和推特数据),以进行深入的灾害管理分析。这个研究分为两个阶段:卫星图像分析和推特数据分析,然后是这两个分析结果的集成。第一阶段是使用多类别土地覆盖分类技术(U-Net架构)进行卫星图像分析,以分析灾害发生前后的地区变化。第二阶段是将地区分配为不同的灾害情况,并从推特数据中提取相关信息。在这个阶段,采用Content Word based Tweet Summarization(COWTS)技术来提取灾害情况的主要信息。将这两个模块集成使用实时地理位置基于的映射和频率分析技术,可以同时获得不同灾害情况的多维度信息。在实验阶段,对印度喀拉拉和美国密西西比洪涝进行了实验和验证。这个研究的创新点在于,通过使用分类卫星图像和地区特定的推特数据,对灾害发生情况进行全面的概括和分析。
Fast multiplication by two’s complement addition of numbers represented as a set of polynomial radix 2 indexes, stored as an integer list for massively parallel computation
results: 这种方法可以实现任何整数或实数的表示为一个列表中的整数指数,并且可以将这些指数存储和分布在多个CPU / GPU上。此外,这种方法还可以完全分布加法和乘法操作,从而超越现有的并行乘法方法的限制,即需要共享公共核心内存和磁盘来计算结果和中间结果。Abstract
We demonstrate a multiplication method based on numbers represented as set of polynomial radix 2 indices stored as an integer list. The 'polynomial integer index multiplication' method is a set of algorithms implemented in python code. We demonstrate the method to be faster than both the Number Theoretic Transform (NTT) and Karatsuba for multiplication within a certain bit range. Also implemented in python code for comparison purposes with the polynomial radix 2 integer method. We demonstrate that it is possible to express any integer or real number as a list of integer indices, representing a finite series in base two. The finite series of integer index representation of a number can then be stored and distributed across multiple CPUs / GPUs. We show that operations of addition and multiplication can be applied as two's complement additions operating on the index integer representations and can be fully distributed across a given CPU / GPU architecture. We demonstrate fully distributed arithmetic operations such that the 'polynomial integer index multiplication' method overcomes the current limitation of parallel multiplication methods. Ie, the need to share common core memory and common disk for the calculation of results and intermediate results.
摘要
我们展示了一种多项式方法,基于表示为二进制基数指数的数字,并将其存储为整数列表。我们称之为“多项式整数指标乘法”方法,这是一系列python代码实现的算法。我们证明这种方法在某个位数范围内比NUMBER THEORETIC TRANSFORM(NTT)和加加姆托卡(Karatsuba) multiplication 方法更快。此外,我们还在python代码中实现了这些方法,以便与多项式整数指标乘法方法进行比较。我们示出了任意整数或实数可以表示为一个列表的整数指数表示,并且这个表示可以被存储和分布在多个CPU / GPU架构上。我们还证明了在多个CPU / GPU架构上实现了完全分布式加法和乘法操作,使得“多项式整数指标乘法”方法超越了当前的并行乘法方法的限制,即需要共享公共核心内存和公共磁盘来计算结果和中间结果。
On some elusive aspects of databases hindering AI based discovery: A case study on superconducting materials
paper_authors: Giovanni Trezza, Eliodoro Chiavazzo
for: 本文旨在探讨大数据的准确性和AI模型的设计问题。
methods: 本文使用了三种方法来检测和衡量数据偏见:批处理方法、维度衡量方法和维度减少方法。
results: 本文通过对超导材料和热电材料两种示例进行分析,发现数据偏见存在于样本选择、隐藏变量和数据年龄等方面,并提出了一种新的检测方法。Abstract
It stands to reason that the amount and the quality of big data is of key importance for setting up accurate AI-driven models. Nonetheless, we believe there are still critical roadblocks in the inherent generation of databases, that are often underestimated and poorly discussed in the literature. In our view, such issues can seriously hinder the AI-based discovery process, even when high quality, sufficiently large and highly reputable data sources are available. Here, considering superconducting and thermoelectric materials as two representative case studies, we specifically discuss three aspects, namely intrinsically biased sample selection, possible hidden variables, disparate data age. Importantly, to our knowledge, we suggest and test a first strategy capable of detecting and quantifying the presence of the intrinsic data bias.
摘要
“据悉,大数据量和质量对于建立准确的人工智能驱动模型是关键。然而,我们认为在自然生成数据库时存在一些 crítical roadblocks,这些问题在文献中受到了低估和不充分讨论。我们认为这些问题可能会妨碍人工智能基于发现过程,即使有高质量、充分大、受人尊敬的数据源也有。在这里,通过使用超导和热电材料作为两个例子,我们专门讨论了三个方面:内在偏见样本选择、隐藏变量和数据年龄差异。值得注意的是,我们建议和测试了一种能够检测和衡量内在数据偏见的第一种策略。”Here's a breakdown of the translation:* "据悉" (liàng bì) is an idiomatic expression that means "according to what is known" or "as far as is known."* "大数据量" (dà xù xiǎng) means "large amount of data."* "质量" (jīn yù) means "quality."* "关键" (guān jī) means "critical" or "key."* "因此" (yǐn qī) is a conjunction that means "therefore" or "as a result."* "存在" (cún zhī) means "there is" or "exists."* "critical roadblocks" is translated as " crítical roadblocks" (zhì zhì fāng xiào) to emphasize the importance of the issues.* "在文献中" (zài wén xiǎng zhī) means "in the literature" or "as discussed in the literature."* "受到了低估" (shòu dào le duō jì) means "have been underestimated."* "不充分讨论" (bù zhòng fēn tóu yì) means "have not been fully discussed."* "我们认为" (wǒ men rèn wēi) is a phrase that means "we believe" or "in our view."* "这些问题" (zhè xiē wèn tí) is a phrase that means "these issues" or "these problems."* "可能会" (kě néng huì) is a phrase that means "may" or "might."* "妨碍" (mǐng yòu) means "obstacle" or "hinder."* "人工智能基于发现过程" (rén gōng jì yì jī bù jiào yù) is a phrase that means "artificial intelligence based on the discovery process."* "即使" (jī shì) is a conjunction that means "even if" or "despite."* "有高质量" (yǒu gāo jīn yù) means "have high quality."* "充分大" (chōng fēn dà) means "sufficiently large."* "受人尊敬" (shòu rén zhù jì) means "respected by people" or "well-regarded."* "数据源" (xìng xiào) means "data source."* "有" (yǒu) is a particle that indicates the existence of something.* "三个方面" (sān gè fāng miàn) is a phrase that means "three aspects" or "three sides."* "内在偏见样本选择" (nèi zài pēn jiàn yàng bǎn jiǎo) is a phrase that means "intrinsic bias in sample selection."* "隐藏变量" (hūn yǎn biàn yù) means "hidden variables."* "数据年龄差异" (xìng xiàng nián suī) is a phrase that means "data age difference."* "值得注意的是" (fù dé zhù yì de shì) is a phrase that means "it is worth noting that" or "it is worth mentioning that."* "我们建议和测试了一种能够检测和衡量内在数据偏见的第一种策略" (wǒ men jiàn yù hé cè shì yī zhī fāng zhì) is a sentence that means "We suggest and test a strategy that can detect and measure the intrinsic data bias for the first time."
Safety Aware Autonomous Path Planning Using Model Predictive Reinforcement Learning for Inland Waterways
paper_authors: Astrid Vanneste, Simon Vanneste, Olivier Vasseur, Robin Janssens, Mattias Billast, Ali Anwar, Kevin Mets, Tom De Schepper, Siegfried Mercelis, Peter Hellinckx
results: 实验结果显示,MPRL在两个测试场景中比基于对称框架的规划和基于 proximal policy optimization(PPO)的规划更好,能够安全(无撞击)通过两个测试场景。Abstract
In recent years, interest in autonomous shipping in urban waterways has increased significantly due to the trend of keeping cars and trucks out of city centers. Classical approaches such as Frenet frame based planning and potential field navigation often require tuning of many configuration parameters and sometimes even require a different configuration depending on the situation. In this paper, we propose a novel path planning approach based on reinforcement learning called Model Predictive Reinforcement Learning (MPRL). MPRL calculates a series of waypoints for the vessel to follow. The environment is represented as an occupancy grid map, allowing us to deal with any shape of waterway and any number and shape of obstacles. We demonstrate our approach on two scenarios and compare the resulting path with path planning using a Frenet frame and path planning based on a proximal policy optimization (PPO) agent. Our results show that MPRL outperforms both baselines in both test scenarios. The PPO based approach was not able to reach the goal in either scenario while the Frenet frame approach failed in the scenario consisting of a corner with obstacles. MPRL was able to safely (collision free) navigate to the goal in both of the test scenarios.
摘要
近年来,自动水上交通在城市水道中受到了广泛关注,因为许多城市中心禁止汽车和卡车的进入。传统的方法,如基于弗雷内特框的规划和潜在场 Navigation,经常需要调整许多配置参数,甚至在不同情况下需要不同的配置。在本文中,我们提出了一种基于强化学习的新的规划方法, called Model Predictive Reinforcement Learning(MPRL)。MPRL计算出船只应该遵循的一系列方向点。环境被表示为一个占用度网格地图,这使得我们可以处理任何形状的水道和任何形状和数量的障碍物。我们在两个场景中测试了我们的方法,并与基于Frenet框的规划和基于 proximal policy optimization(PPO)算法的规划进行比较。我们的结果显示,MPRL在两个测试场景中都超过了两个基准点。PPO算法在任一场景中都无法达到目标,而基于Frenet框的规划在包含封闭障碍物的场景中失败。MPRL在两个测试场景中安全(无碰撞)地达到了目标。
paper_authors: Arthur da Cunha, Francesco d’Amore, Emanuele Natale
for: 研究SLTH中structured pruning的可行性
methods: 使用多dimensional generalization of Random Subset-Sum Problem来解决SLTH中的随机依赖关系
results: 提出了一种可以将任意小型网络拟合到Structured Pruning中的structured subnetwork的存在,这是SLTH的第一个下 exponential bound,开启了新的研究方向,帮助深入理解深度学习中过参数化的作用。Abstract
The Strong Lottery Ticket Hypothesis (SLTH) states that randomly-initialised neural networks likely contain subnetworks that perform well without any training. Although unstructured pruning has been extensively studied in this context, its structured counterpart, which can deliver significant computational and memory efficiency gains, has been largely unexplored. One of the main reasons for this gap is the limitations of the underlying mathematical tools used in formal analyses of the SLTH. In this paper, we overcome these limitations: we leverage recent advances in the multidimensional generalisation of the Random Subset-Sum Problem and obtain a variant that admits the stochastic dependencies that arise when addressing structured pruning in the SLTH. We apply this result to prove, for a wide class of random Convolutional Neural Networks, the existence of structured subnetworks that can approximate any sufficiently smaller network. This result provides the first sub-exponential bound around the SLTH for structured pruning, opening up new avenues for further research on the hypothesis and contributing to the understanding of the role of over-parameterization in deep learning.
摘要
“强大的抽签票假设(SLTH)称 randomly-initialized 神经网络中可能存在不需要训练的子网络,却受到不结构的剪除研究的限制。这里我们与此有关的主要原因之一是下列数学工具的限制:我们使用了最新的多dimensional 普通化的Random Subset-Sum Problem,从而获得了允许随机相依性的variant。我们运用这个结果,证明了一个广泛的随机卷积神经网络中,存在一些可以近似任何较小的网络的结构化子网络。这个结果提供了SLTH关于结构剪除的首个次对数 bounds,对于深度学习的理解做出了新的贡献,并开启了新的研究方向。”
Contribution Evaluation in Federated Learning: Examining Current Approaches
results: 论文通过在MNIST和CIFAR-10上 benchmarking不同方法的实验结果,展示了各种方法的不同特点,并引出了设计公平和高效的贡献评估方法的重要性。Abstract
Federated Learning (FL) has seen increasing interest in cases where entities want to collaboratively train models while maintaining privacy and governance over their data. In FL, clients with private and potentially heterogeneous data and compute resources come together to train a common model without raw data ever leaving their locale. Instead, the participants contribute by sharing local model updates, which, naturally, differ in quality. Quantitatively evaluating the worth of these contributions is termed the Contribution Evaluation (CE) problem. We review current CE approaches from the underlying mathematical framework to efficiently calculate a fair value for each client. Furthermore, we benchmark some of the most promising state-of-the-art approaches, along with a new one we introduce, on MNIST and CIFAR-10, to showcase their differences. Designing a fair and efficient CE method, while a small part of the overall FL system design, is tantamount to the mainstream adoption of FL.
摘要
Short vs. Long-term Coordination of Drones: When Distributed Optimization Meets Deep Reinforcement Learning
results: 对于交通监测,提出的新进展方法在比较三种基准方法的实验中表现出色,显示了其出色的性能。Abstract
Swarms of smart drones, with the support of charging technology, can provide completing sensing capabilities in Smart Cities, such as traffic monitoring and disaster response. Existing approaches, including distributed optimization and deep reinforcement learning (DRL), aim to coordinate drones to achieve cost-effective, high-quality navigation, sensing, and recharging. However, they have distinct challenges: short-term optimization struggles to provide sustained benefits, while long-term DRL lacks scalability, resilience, and flexibility. To bridge this gap, this paper introduces a new progressive approach that encompasses the planning and selection based on distributed optimization, as well as DRL-based flying direction scheduling. Extensive experiment with datasets generated from realisitic urban mobility demonstrate the outstanding performance of the proposed solution in traffic monitoring compared to three baseline methods.
摘要
众群智能飞机,受到充电技术支持,可以在智能城市提供完整的感知能力,如交通监测和灾害应急应急。现有的方法,包括分布式优化和深度强化学习(DRL),尝试协调飞机以实现成本效果、高质量的导航、感知和充电。然而,它们具有短期优化困难提供持续性利益,长期DRL缺乏扩展性、可靠性和灵活性。为了bridging这个差距,本文提出了一种新的进步方法,包括分布式优化基础上的规划和选择,以及基于DRL的飞机飞行方向安排。实验表明,提出的解决方案在交通监测中比基准方法三种表现出众。
paper_authors: Lorenzo Bonito, James Requeima, Aliaksandra Shysheya, Richard E. Turner
for: 这 paper 是为了提供一种新的替代方法来模型 neural processes,以更好地适应具有数据稀缺和预测不确定性的应用领域,如健康科学和气候科学。
methods: 该 paper 使用了一种基于扩散的方法,通过对含有噪声的数据进行conditioning来解决许多现有方法的限制,同时也超越了现有最佳实践的性能。
results: 该 paper 的研究结果表明,该新方法可以在各种应用领域中提供更高的准确性和稳定性,并且可以与现有方法进行比较。Abstract
Over the last few years, Neural Processes have become a useful modelling tool in many application areas, such as healthcare and climate sciences, in which data are scarce and prediction uncertainty estimates are indispensable. However, the current state of the art in the field (AR CNPs; Bruinsma et al., 2023) presents a few issues that prevent its widespread deployment. This work proposes an alternative, diffusion-based approach to NPs which, through conditioning on noised datasets, addresses many of these limitations, whilst also exceeding SOTA performance.
摘要
在过去几年,神经过程(Neural Processes)已成为许多应用领域中的有用模型工具,如医疗和气候科学,在数据稀缺和预测uncertainty估计是不可或缺的。然而,当前领域的状态艺(AR CNPs;布鲁因斯马等,2023)存在一些限制,阻碍其广泛应用。这项工作提出了一种替代方案,基于扩散的方法,通过对噪音数据进行条件,解决了许多这些限制,同时也超越了最佳性能。
Runtime Verification of Learning Properties for Reinforcement Learning Algorithms
methods: 这篇研究使用了新的执行时验证技术,包括三个验证性能,以便在RL算法中监控和评估这些性能 during the system’s operation。
results: 这篇研究获得了三个验证性能,包括学习质量、时间耗用和精度等。这些性能可以用来监控和评估RL算法的学习过程,以便提高学习效率和精度。Abstract
Reinforcement learning (RL) algorithms interact with their environment in a trial-and-error fashion. Such interactions can be expensive, inefficient, and timely when learning on a physical system rather than in a simulation. This work develops new runtime verification techniques to predict when the learning phase has not met or will not meet qualitative and timely expectations. This paper presents three verification properties concerning the quality and timeliness of learning in RL algorithms. With each property, we propose design steps for monitoring and assessing the properties during the system's operation.
摘要
Fossil 2.0: Formal Certificate Synthesis for the Verification and Control of Dynamical Models
results: Fossil 2.0 可以生成更多的证明、控制法则和对离散时间模型的支持。Abstract
This paper presents Fossil 2.0, a new major release of a software tool for the synthesis of certificates (e.g., Lyapunov and barrier functions) for dynamical systems modelled as ordinary differential and difference equations. Fossil 2.0 is much improved from its original release, including new interfaces, a significantly expanded certificate portfolio, controller synthesis and enhanced extensibility. We present these new features as part of this tool paper. Fossil implements a counterexample-guided inductive synthesis (CEGIS) loop ensuring the soundness of the method. Our tool uses neural networks as templates to generate candidate functions, which are then formally proven by an SMT solver acting as an assertion verifier. Improvements with respect to the first release include a wider range of certificates, synthesis of control laws, and support for discrete-time models.
摘要
results: 经过系统评估,本研究发现,通过GEO可以提高生成引擎响应中内容的可见度,最高提高40%。此外,研究还发现不同领域的可见度提升策略效果不同,强调需要针对具体领域进行定制。Abstract
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of Generative Engines (GEs), has the potential to generate accurate and personalized responses, and is rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them with the help of LLMs. While this shift significantly improves \textit{user} utility and \textit{generative search engine} traffic, it results in a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of Generative Engines, content creators have little to no control over when and how their content is displayed. With generative engines here to stay, the right tools should be provided to ensure that creator economy is not severely disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), a novel paradigm to aid content creators in improving the visibility of their content in Generative Engine responses through a black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation in this new paradigm by introducing GEO-bench, a benchmark of diverse user queries across multiple domains, coupled with sources required to answer these queries. Through rigorous evaluation, we show that GEO can boost visibility by up to 40\% in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific methods. Our work opens a new frontier in the field of information discovery systems, with profound implications for generative engines and content creators.
摘要
随着大型语言模型(LLM)的出现,一种新的搜索引擎 paradigm 已经出现,这种搜索引擎使用生成模型来收集和摘要信息以回答用户问题。我们称这种技术为生成引擎(GE)。这种技术可以生成准确和个性化的回答,并在传统搜索引擎如Google和Bing的替代品上快速取代。生成引擎通常通过将多个源的信息合并并使用LLM进行摘要来满足用户的查询。虽然这种转变会提高用户的用户体验和生成搜索引擎的搜索量,但是它会对内容创建者造成巨大的挑战。由于生成引擎的黑盒和快速移动的性质,内容创建者几乎没有控制他们的内容是如何和何时显示。为了解决这个问题,我们介绍了生成引擎优化(GEO),一种新的方法,可以帮助内容创建者在生成引擎的回答中提高他们的内容的可见度。我们通过引入GEO-bench,一个包含多个领域的多种用户查询和回答的基准,来促进系统性评估。我们的实验表明,GEO可以提高可见度达40%。此外,我们还发现这些策略在不同的领域中的效果不同,强调了需要针对具体领域的方法。我们的工作开启了一个新的领域,即信息发现系统,对生成引擎和内容创建者产生了深远的影响。
CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs
results: 作者的实验表明,CDMPP在多种DNN模型和设备上表现出色,与状态 искус的基准值相比,CDMPP的预测错误率为14.03%和10.85%,分别为cross-model和cross-device预测。而与之前的基准值相比,CDMPP的训练效率高出一个数量级。实验结果和扩展数据集可以在https://github.com/joapolarbear/cdmpp上下载。Abstract
Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp.
摘要
深度神经网络(DNN)在多种机器学习应用中表现出色。了解在特定设备上运行 DNN 模型或tensor程序的延迟是多种任务中的有用信息,如 DNN 图或tensor程序级别优化和设备选择。由于 DNN 模型和设备之间的空间很大,直接测试所有组合是不可能的。为了解决这个问题,当前的努力集中在建立一个可预测 DNN 模型在不同设备上的性能的模型。然而,现有的尝试都没有实现一个可预测多种tensor程序的性能的成本模型,同时支持训练和推理加速器。我们提出了 CDMPP,一个高效的 tensor程序延迟预测框架,用于跨模型和跨设备预测。我们设计了一种有用但不具有冗余的表示方式,called Compact ASTs,以及一种基于 pre-order 的 pozitional encoding 方法,以捕捉 tensor程序的内部结构。我们开发了一种域 adaptive 方法,用于学习域 invariant 表示,并提出了一种 KMeans 基于采样算法,使预测器从不同域中学习。我们的广泛的实验表明,CDMPP 在多种 DNN 模型和设备上表现出色,与状态对比基线错误率为 14.03% 和 10.85%,具有一个 ORDER 更高的训练效率。实现和扩展数据集可以在 GitHub 上找到:https://github.com/joapolarbear/cdmpp。
Modelling daily mobility using mobile data traffic at fine spatiotemporal scale
methods: 本研究将 NetMob 2023 数据集与适应度高的外部数据集 ENACT 结合使用,开发了三种 XGBoost 模型,通过对 NetMob2023 数据中的移动数据流量和 ENACT 值进行比较,计算每个 100m x 100m 格子细胞的人口数量。
results: 结果表明,NetMob 2023 数据可以用于 estimate 城市区域的日夜人口和格子级别,并能够解释一些城市流动动态。Abstract
We applied a data-driven approach that explores the usability of the NetMob 2023 dataset in modelling mobility patterns within an urban context. We combined the data with a highly suitable external source, the ENACT dataset, which provides a 1 km x 1km grid with estimates of the day and night population across Europe. We developed three sets of XGBoost models that predict the population in each 100m x 100m grid cell used in NetMob2023 based on the mobile data traffic of the 68 online services covered in the dataset, using the ENACT values as ground truth. The results suggest that the NetMob 2023 data can be useful for the estimation of the day and night population and grid cell level and can explain part of the dynamics of urban mobility.
摘要
我们采用了数据驱动的方法,探索了NetMob 2023数据集在城市背景下的可用性。我们将数据与非常适合的外部资源——ENACT数据集相结合,该数据集提供了1km x 1km网格中的日夜人口估计数据在欧洲。我们开发了三个XGBoost模型,使用NetMob2023数据集中68个在线服务的流量数据预测每个100m x 100m网格单元中的人口,使用ENACT值作为真实值。结果表明,NetMob 2023数据可以用于 estimate 日夜人口和网格单元层次,并可以解释一部分城市流动性的动态。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Zenkai – Framework For Exploring Beyond Backpropagation
results: 该paper通过将深度学习机器分割成层次结构,使得研究者可以更容易地探索新的深度学习领域,不再受限于传统的backpropagation框架。Abstract
Zenkai is an open-source framework designed to give researchers more control and flexibility over building and training deep learning machines. It does this by dividing the deep learning machine into layers of semi-autonomous learning machines with their own target and learning algorithm. This is to allow researchers greater exploration such as the use of non-differentiable layers or learning algorithms beyond those based on error backpropagation. Backpropagation Rumelhart et al. [1986] has powered deep learning to become one of the most exciting fields of the 21st century. As a result, a large number of software tools have been developed to support efficient implementation and training of neural networks through the use of backpropa- gation. While these have been critical to the success of deep learning, building frameworks around backpropagation can make it challenging to implement solutions that do not adhere to it. Zenkai aims to make it easier to get around these limitations and help researchers more easily explore new frontiers in deep learning that do not strictly adhere to the backpropagation framework.
摘要
zenkai 是一个开源框架,旨在给研究人员更多的控制和灵活性来建立和训练深度学习机器。它通过将深度学习机器分割成各自有target和学习算法的层次结构,以便让研究人员更好地探索不同的学习方法和算法。这样可以让研究人员更容易实现不同的深度学习解决方案,而不是仅仅依赖于error backpropagation。以前,Rumelhart等人在1986年提出了backpropagation算法,这个算法在21世纪的深度学习领域中帮助了深度学习成为一个非常有趣的领域。随着这些软件工具的开发,深度学习的实现和训练变得更加高效。然而,由backpropagation框架所固化的问题使得实现不同的解决方案变得困难。zenkai旨在使研究人员更容易实现不同的深度学习解决方案,并帮助他们更好地探索不同的深度学习领域。
GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection
methods: 本文使用了解释predictive decisions的gradient-based attribution方法,并发现这些方法在处理OOD数据时遇到困难,导致解释结果呈现异常。基于这个观察,本文引入了两种OOD检测的异常现象:零减异常和通道平均异常。然后,本文提出了一种简单有效的GAIA方法,它利用Gradient Abnormality Inspection and Aggregation来检测OOD示例。
results: 本文的GAIA方法在CIFAR10和ImageNet-1k benchmark上表现出色,比预后处理方法更有效。具体来说,GAIA在CIFAR10上降低了平均FPR95的值by 23.10%,并在CIFAR100上降低了平均FPR95的值by 45.41%。Abstract
Detecting out-of-distribution (OOD) examples is crucial to guarantee the reliability and safety of deep neural networks in real-world settings. In this paper, we offer an innovative perspective on quantifying the disparities between in-distribution (ID) and OOD data -- analyzing the uncertainty that arises when models attempt to explain their predictive decisions. This perspective is motivated by our observation that gradient-based attribution methods encounter challenges in assigning feature importance to OOD data, thereby yielding divergent explanation patterns. Consequently, we investigate how attribution gradients lead to uncertain explanation outcomes and introduce two forms of abnormalities for OOD detection: the zero-deflation abnormality and the channel-wise average abnormality. We then propose GAIA, a simple and effective approach that incorporates Gradient Abnormality Inspection and Aggregation. The effectiveness of GAIA is validated on both commonly utilized (CIFAR) and large-scale (ImageNet-1k) benchmarks. Specifically, GAIA reduces the average FPR95 by 23.10% on CIFAR10 and by 45.41% on CIFAR100 compared to advanced post-hoc methods.
摘要
检测异常输入(OOD)示例是深度神经网络在实际场景中的可靠性和安全的 garantor。在这篇论文中,我们提出了一种新的观点,即通过分析模型对预测决策的解释来衡量ID和OOD数据之间的差异。这种观点是由我们发现,使用梯度基本的归因方法对OOD数据进行归因时会遇到困难,从而导致解释结果呈现出异常的现象。因此,我们调查了梯度归因过程中的不确定性,并提出了两种OOD检测的异常现象:零膨胀异常和通道平均异常。然后,我们提出了GAIA方法,它通过梯度异常检查和综合来实现简单而有效的OOD检测。GAIA方法在CIFAR10和ImageNet-1k两个标准测试集上验证了其效果,比先进的后置方法减少了平均FPR95的值23.10%和45.41%。
Generating Drug Repurposing Hypotheses through the Combination of Disease-Specific Hypergraphs
results: 这个研究发现了两种可能有潜在的新药再利用 канди达,即达巴格利福酮(一种抗糖尿病药物)和德布瑞索酮(一种反高血压药物)。这两种药物在合并两种疾病的质量学习中,其再利用潜力显著提高。Abstract
The drug development pipeline for a new compound can last 10-20 years and cost over 10 billion. Drug repurposing offers a more time- and cost-effective alternative. Computational approaches based on biomedical knowledge graph representations have recently yielded new drug repurposing hypotheses. In this study, we present a novel, disease-specific hypergraph representation learning technique to derive contextual embeddings of biological pathways of various lengths but that all start at any given drug and all end at the disease of interest. Further, we extend this method to multi-disease hypergraphs. To determine the repurposing potential of each of the 1,522 drugs, we derive drug-specific distributions of cosine similarity values and ultimately consider the median for ranking. Cosine similarity values are computed between (1) all biological pathways starting at the considered drug and ending at the disease of interest and (2) all biological pathways starting at drugs currently prescribed against that disease and ending at the disease of interest. We illustrate our approach with Alzheimer's disease (AD) and two of its risk factors: hypertension (HTN) and type 2 diabetes (T2D). We compare each drug's rank across four hypergraph settings (single- or multi-disease): AD only, AD + HTN, AD + T2D, and AD + HTN + T2D. Notably, our framework led to the identification of two promising drugs whose repurposing potential was significantly higher in hypergraphs combining two diseases: dapagliflozin (antidiabetic; moved up, from top 32$\%$ to top 7$\%$, across all considered drugs) and debrisoquine (antihypertensive; moved up, from top 76$\%$ to top 23$\%$). Our approach serves as a hypothesis generation tool, to be paired with a validation pipeline relying on laboratory experiments and semi-automated parsing of the biomedical literature.
摘要
<> translate the following text into Simplified Chinese:The drug development pipeline for a new compound can last 10-20 years and cost over 10 billion. Drug repurposing offers a more time- and cost-effective alternative. Computational approaches based on biomedical knowledge graph representations have recently yielded new drug repurposing hypotheses. In this study, we present a novel, disease-specific hypergraph representation learning technique to derive contextual embeddings of biological pathways of various lengths but that all start at any given drug and all end at the disease of interest. Further, we extend this method to multi-disease hypergraphs. To determine the repurposing potential of each of the 1,522 drugs, we derive drug-specific distributions of cosine similarity values and ultimately consider the median for ranking. Cosine similarity values are computed between (1) all biological pathways starting at the considered drug and ending at the disease of interest and (2) all biological pathways starting at drugs currently prescribed against that disease and ending at the disease of interest. We illustrate our approach with Alzheimer's disease (AD) and two of its risk factors: hypertension (HTN) and type 2 diabetes (T2D). We compare each drug's rank across four hypergraph settings (single- or multi-disease): AD only, AD + HTN, AD + T2D, and AD + HTN + T2D. Notably, our framework led to the identification of two promising drugs whose repurposing potential was significantly higher in hypergraphs combining two diseases: dapagliflozin (antidiabetic; moved up, from top 32% to top 7%, across all considered drugs) and debrisoquine (antihypertensive; moved up, from top 76% to top 23%). Our approach serves as a hypothesis generation tool, to be paired with a validation pipeline relying on laboratory experiments and semi-automated parsing of the biomedical literature.中文翻译:drugs development pipeline for a new compound can last 10-20 years and cost over 10 billion. drug repurposing offers a more time- and cost-effective alternative. based on biomedical knowledge graph representations, computational approaches have recently yielded new drug repurposing hypotheses. in this study, we present a novel, disease-specific hypergraph representation learning technique to derive contextual embeddings of biological pathways of various lengths but that all start at any given drug and all end at the disease of interest. further, we extend this method to multi-disease hypergraphs. to determine the repurposing potential of each of the 1,522 drugs, we derive drug-specific distributions of cosine similarity values and ultimately consider the median for ranking. cosine similarity values are computed between (1) all biological pathways starting at the considered drug and ending at the disease of interest and (2) all biological pathways starting at drugs currently prescribed against that disease and ending at the disease of interest. we illustrate our approach with Alzheimer's disease (AD) and two of its risk factors: hypertension (HTN) and type 2 diabetes (T2D). we compare each drug's rank across four hypergraph settings (single- or multi-disease): AD only, AD + HTN, AD + T2D, and AD + HTN + T2D. notably, our framework led to the identification of two promising drugs whose repurposing potential was significantly higher in hypergraphs combining two diseases: dapagliflozin (antidiabetic; moved up, from top 32% to top 7%, across all considered drugs) and debrisoquine (antihypertensive; moved up, from top 76% to top 23%). our approach serves as a hypothesis generation tool, to be paired with a validation pipeline relying on laboratory experiments and semi-automated parsing of the biomedical literature.
Accelerating material discovery with a threshold-driven hybrid acquisition policy-based Bayesian optimization
paper_authors: Ahmed Shoyeb Raihan, Hamed Khosravi, Srinjoy Das, Imtiaz Ahmed for: 本研究旨在提高材料发现和开发过程中的效率,通过应用机器学习技术和bayesian优化方法,从而减少实验成本和开发时间。methods: 本研究使用了一种新的阈值驱动的UCB-EI Bayesian优化方法,它将UCB和EI两种获取函数 dynamically интегриру起来,以优化材料发现过程。results: 对于三个不同的材料数据集,TDUE-BO方法显示了较好的优化和approximation性能,比EI和UCB-based BO方法更快地 converges,并且在RMSE分数上显示出较好的表现。Abstract
Advancements in materials play a crucial role in technological progress. However, the process of discovering and developing materials with desired properties is often impeded by substantial experimental costs, extensive resource utilization, and lengthy development periods. To address these challenges, modern approaches often employ machine learning (ML) techniques such as Bayesian Optimization (BO), which streamline the search for optimal materials by iteratively selecting experiments that are most likely to yield beneficial results. However, traditional BO methods, while beneficial, often struggle with balancing the trade-off between exploration and exploitation, leading to sub-optimal performance in material discovery processes. This paper introduces a novel Threshold-Driven UCB-EI Bayesian Optimization (TDUE-BO) method, which dynamically integrates the strengths of Upper Confidence Bound (UCB) and Expected Improvement (EI) acquisition functions to optimize the material discovery process. Unlike the classical BO, our method focuses on efficiently navigating the high-dimensional material design space (MDS). TDUE-BO begins with an exploration-focused UCB approach, ensuring a comprehensive initial sweep of the MDS. As the model gains confidence, indicated by reduced uncertainty, it transitions to the more exploitative EI method, focusing on promising areas identified earlier. The UCB-to-EI switching policy dictated guided through continuous monitoring of the model uncertainty during each step of sequential sampling results in navigating through the MDS more efficiently while ensuring rapid convergence. The effectiveness of TDUE-BO is demonstrated through its application on three different material datasets, showing significantly better approximation and optimization performance over the EI and UCB-based BO methods in terms of the RMSE scores and convergence efficiency, respectively.
摘要
技术进步受材料进步的影响很大。然而,找到和开发满足需求的材料往往受到巨大的实验成本、资源占用和长时间的开发周期的阻碍。为了解决这些挑战,现代方法常常使用机器学习(ML)技术,如权重优化(BO),以快速搜索优化材料的属性。然而,传统的BO方法,虽有利,但往往在材料发现过程中困难寻找优化和权衡的平衡,导致表现下降。本文介绍一种新的阈值驱动的UCB-EI权重优化方法(TDUE-BO),它在材料设计空间(MDS)中高效地寻找优化。TDUE-BO方法在开始时使用探索带有UCB的方法,以确保初步扫描MDS的全面性。随着模型收获更多的信息,它逐渐过渡到更加利用的EI方法,专注于之前确定的优点。TDUE-BO方法的UCB-to-EI交换策略,通过监测模型在每次采样中的不确定性的连续监测,以高效地在MDS中导航,并确保更快的收敛。TDUE-BO方法在三个不同的材料数据集上进行应用,与EI和UCB基于BO方法的表现相比,在TERMSE scores和收敛效率上显示出了显著的改进。
Group-Aware Interest Disentangled Dual-Training for Personalized Recommendation
results: 实验结果显示,IGRec可以有效地解决数据缺乏和冷启问题,并且在群体推荐任务上显示出了更高的信息含量。Abstract
Personalized recommender systems aim to predict users' preferences for items. It has become an indispensable part of online services. Online social platforms enable users to form groups based on their common interests. The users' group participation on social platforms reveals their interests and can be utilized as side information to mitigate the data sparsity and cold-start problem in recommender systems. Users join different groups out of different interests. In this paper, we generate group representation from the user's interests and propose IGRec (Interest-based Group enhanced Recommendation) to utilize the group information accurately. It consists of four modules. (1) Interest disentangler via self-gating that disentangles users' interests from their initial embedding representation. (2) Interest aggregator that generates the interest-based group representation by Gumbel-Softmax aggregation on the group members' interests. (3) Interest-based group aggregation that fuses user's representation with the participated group representation. (4) A dual-trained rating prediction module to utilize both user-item and group-item interactions. We conduct extensive experiments on three publicly available datasets. Results show IGRec can effectively alleviate the data sparsity problem and enhance the recommender system with interest-based group representation. Experiments on the group recommendation task further show the informativeness of interest-based group representation.
摘要
personalized recommender systems aim to predict users' preferences for items. It has become an indispensable part of online services. online social platforms enable users to form groups based on their common interests. The users' group participation on social platforms reveals their interests and can be utilized as side information to mitigate the data sparsity and cold-start problem in recommender systems. users join different groups out of different interests. In this paper, we generate group representation from the user's interests and propose IGRec (Interest-based Group enhanced Recommendation) to utilize the group information accurately. It consists of four modules. (1) Interest disentangler via self-gating that disentangles users' interests from their initial embedding representation. (2) Interest aggregator that generates the interest-based group representation by Gumbel-Softmax aggregation on the group members' interests. (3) Interest-based group aggregation that fuses user's representation with the participated group representation. (4) A dual-trained rating prediction module to utilize both user-item and group-item interactions. We conduct extensive experiments on three publicly available datasets. results show IGRec can effectively alleviate the data sparsity problem and enhance the recommender system with interest-based group representation. experiments on the group recommendation task further show the informativeness of interest-based group representation.Here's the breakdown of the translation:* "personalized recommender systems" becomes "个性化推荐系统" (gèxìnghuà zhìdòng yìxìng zhìdòng)* "aim to predict users' preferences for items" becomes "预测用户对物品的喜好" (yùndào yìzhí yìxìng zhìdòng)* "online social platforms" becomes "在线社交平台" (zài xiàng xìng zhìdòng)* "enable users to form groups based on their common interests" becomes "允许用户根据共同的兴趣组成群体" (shèngxìn yìxìng zhìdòng)* "users join different groups out of different interests" becomes "用户根据不同的兴趣加入不同的群体" (yìxìng zhìdòng zài bùdìng de xìngxìng)* "Interest-based Group enhanced Recommendation" becomes "兴趣基于群体增强推荐" (yìxìng jīyào qúnwù zhìdòng zhìdòng)* "consists of four modules" becomes "包括四个模块" (bāng xīn sì ge móudào)* "Interest disentangler via self-gating" becomes "自我阻塞来消除用户兴趣的混合" (zìwǒ zhìxíng lái xiāoxiǎo yìxìng zhìdòng)* "Interest aggregator" becomes "兴趣聚合器" (yìxìng jùhégōng)* "Interest-based group aggregation" becomes "兴趣基于群体聚合" (yìxìng jīyào qúnwù jùhégōng)* "dual-trained rating prediction module" becomes "双向训练评分模块" (shuāng xiàng xiǎng zhìxíng píngfāng móudào)* "extensive experiments on three publicly available datasets" becomes "在三个公开数据集上进行了广泛的实验" (zài sān gè gōngkāi xìngxìng zhìdòng zhìdòng)* "results show IGRec can effectively alleviate the data sparsity problem" becomes "结果显示 IGRec 可以有效解决数据缺乏问题" (jiéguī xiǎnshì IGRec kěyì yǒu jìngxìng duōshì wèn tí)* "experiments on the group recommendation task further show the informativeness of interest-based group representation" becomes "在群体推荐任务上进行的实验再次表明兴趣基于群体表示的有用性" (zài qúnwù zhìdòng zhìdòng shì de jìngxìng yǐngxìng)
A Knowledge Distillation Approach for Sepsis Outcome Prediction from Multivariate Clinical Time Series
paper_authors: Anna Wong, Shu Ge, Nassim Oufattole, Adam Dejl, Megan Su, Ardavan Saeedi, Li-wei H. Lehman
for: 预测 septic 病人结果,学习可读性的状态表示
methods: 使用知识塑化 via 约束变量推断,将师网络模型的知识塑化到学生网络模型中,以实现高预测性和可读性
results: 使用实际数据,在 MIMIC-IV 数据库上训练 LSTM 作为师网络模型,预测 septic 病人死亡率,并使用 AR-HMM 学习可读性的隐藏状态表示,预测多个下游结果,包括医院死亡率、肺液肿、需要药物、透析和机械呼吸等。结果表明,我们的方法可以成功integrate constraint,实现高预测性和可读性。Abstract
Sepsis is a life-threatening condition triggered by an extreme infection response. Our objective is to forecast sepsis patient outcomes using their medical history and treatments, while learning interpretable state representations to assess patients' risks in developing various adverse outcomes. While neural networks excel in outcome prediction, their limited interpretability remains a key issue. In this work, we use knowledge distillation via constrained variational inference to distill the knowledge of a powerful "teacher" neural network model with high predictive power to train a "student" latent variable model to learn interpretable hidden state representations to achieve high predictive performance for sepsis outcome prediction. Using real-world data from the MIMIC-IV database, we trained an LSTM as the "teacher" model to predict mortality for sepsis patients, given information about their recent history of vital signs, lab values and treatments. For our student model, we use an autoregressive hidden Markov model (AR-HMM) to learn interpretable hidden states from patients' clinical time series, and use the posterior distribution of the learned state representations to predict various downstream outcomes, including hospital mortality, pulmonary edema, need for diuretics, dialysis, and mechanical ventilation. Our results show that our approach successfully incorporates the constraint to achieve high predictive power similar to the teacher model, while maintaining the generative performance.
摘要
伤害是一种生命威胁的疾病,由于感染过程的极端反应而引起。我们的目标是预测患有伤害患者的结果,使用他们的医疗历史和治疗方法,同时学习可读取的状态表示,以评估患者在不同的不良结果发展中的风险。虽然神经网络在结果预测方面表现出色,但它们的解释能力受限。在这种工作中,我们使用知识填充via受限变量推理来填充教师神经网络模型的知识,以训练学生隐藏变量模型,以学习可读取的隐藏状态表示,以实现高度预测性和解释能力。使用实际数据库,我们训练了LSTM作为教师模型,以预测伤害患者的死亡,根据他们的近期生命体征、实验室值和治疗方法的信息。为学生模型,我们使用自适应隐藏马尔可夫模型(AR-HMM)来学习患者的临床时序序列中的可读取隐藏状态,并使用学习的 posterior 分布来预测多个下游结果,包括医院死亡率、肺液肿、需要药物、人工呼吸和肾透析。我们的结果表明,我们的方法可以成功地满足Constraint来实现高度预测力和解释能力,同时保持生成性能。
Know Thy Neighbors: A Graph Based Approach for Effective Sensor-Based Human Activity Recognition in Smart Homes
results: 本研究在CASAS数据集上进行了多个实验,结果显示了 graph-guided neural network 在智能家居HAR中的表现,在多个数据集和大幅优势的情况下超越了现有的方法。这些结果显示了 HAR 系统在实际应用中的潜力。Abstract
There has been a resurgence of applications focused on Human Activity Recognition (HAR) in smart homes, especially in the field of ambient intelligence and assisted living technologies. However, such applications present numerous significant challenges to any automated analysis system operating in the real world, such as variability, sparsity, and noise in sensor measurements. Although state-of-the-art HAR systems have made considerable strides in addressing some of these challenges, they especially suffer from a practical limitation: they require successful pre-segmentation of continuous sensor data streams before automated recognition, i.e., they assume that an oracle is present during deployment, which is capable of identifying time windows of interest across discrete sensor events. To overcome this limitation, we propose a novel graph-guided neural network approach that performs activity recognition by learning explicit co-firing relationships between sensors. We accomplish this by learning a more expressive graph structure representing the sensor network in a smart home, in a data-driven manner. Our approach maps discrete input sensor measurements to a feature space through the application of attention mechanisms and hierarchical pooling of node embeddings. We demonstrate the effectiveness of our proposed approach by conducting several experiments on CASAS datasets, showing that the resulting graph-guided neural network outperforms the state-of-the-art method for HAR in smart homes across multiple datasets and by large margins. These results are promising because they push HAR for smart homes closer to real-world applications.
摘要
随着智能家居技术的发展,人活动识别(HAR)应用也在智能家居领域得到了新的推动。特别是在 ambient intelligence 和助生技术领域,HAR 应用已成为当前研究热点。然而,实际世界中的 HAR 系统面临着许多挑战,如感知器的变化、缺失和噪声等问题。尽管现有的 HAR 系统已经做出了很大的进步,但它们尤其受到一种实际限制:它们需要成功地预分 segments 持续的感知数据流,以便自动识别活动。为了突破这一限制,我们提出了一种新的图导型神经网络方法,该方法通过学习感知器之间的显式协同关系来进行活动识别。我们通过在数据驱动方式下学习更加表达式的图结构,将感知网络在智能家居中映射到具有表达能力的特征空间。我们的方法通过注意机制和层次聚合节点嵌入来将离散输入感知测量映射到特征空间。我们在 CASAS 数据集上进行了多个实验,并证明了我们提出的图导型神经网络方法在智能家居中 HAR 方面的效果较为出色,与当前状态艺技术相比,差距较大。这些结果是推动 HAR 在智能家居中的实际应用的好消息。
Identifying Systems with Symmetries using Equivariant Autoregressive Reservoir Computers
results: 文中结果表明,使用这些技术可以有效地识别和预测具有对称性的非线性系统,无论这些系统是否 exhibits 乱流行为。Abstract
The investigation reported in this document focuses on identifying systems with symmetries using equivariant autoregressive reservoir computers. General results in structured matrix approximation theory are presented, exploring a two-fold approach. Firstly, a comprehensive examination of generic symmetry-preserving nonlinear time delay embedding is conducted. This involves analyzing time series data sampled from an equivariant system under study. Secondly, sparse least-squares methods are applied to discern approximate representations of the output coupling matrices. These matrices play a pivotal role in determining the nonlinear autoregressive representation of an equivariant system. The structural characteristics of these matrices are dictated by the set of symmetries inherent in the system. The document outlines prototypical algorithms derived from the described techniques, offering insight into their practical applications. Emphasis is placed on their effectiveness in the identification and predictive simulation of equivariant nonlinear systems, regardless of whether such systems exhibit chaotic behavior.
摘要
这份报告的调查集中关注使用对称 Autoregressive 计算机系统来识别具有对称性的系统。报告提供了一般结果,探讨了两种方法:首先,对具有对称性的非线性时间延迟嵌入进行全面的分析,这是通过分析研究中的对称系统时间序列数据来实现的。其次,使用稀疏最小二乘方法来突出输出联系矩阵的 Approximate 表示。这些矩阵在确定非线性 Autoregressive 表示中扮演重要的角色,其结构特征由系统中的对称性决定。报告描述了基于这些技术的评估算法,并强调其在识别和预测非线性系统中的有效性,无论这些系统是否展现混沌行为。
Investigating the Impact of Weight Sharing Decisions on Knowledge Transfer in Continual Learning
results: 研究发现,在不同任务的复杂度和相似性的情况下,有优化的 weight sharing 决策可以提高任务的准确率。 通过遵循这些决策,我们可以在 CL 中提高任务的性能。Abstract
Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (CF) in the sequential training of neural networks, improving network efficiency and adaptability to different tasks. Additionally, CL serves as an ideal setting for studying network behavior and Forward Knowledge Transfer (FKT) between tasks. Pruning methods for CL train subnetworks to handle the sequential tasks which allows us to take a structured approach to investigating FKT. Sharing prior subnetworks' weights leverages past knowledge for the current task through FKT. Understanding which weights to share is important as sharing all weights can yield sub-optimal accuracy. This paper investigates how different sharing decisions affect the FKT between tasks. Through this lens we demonstrate how task complexity and similarity influence the optimal weight sharing decisions, giving insights into the relationships between tasks and helping inform decision making in similar CL methods. We implement three sequential datasets designed to emphasize variation in task complexity and similarity, reporting results for both ResNet-18 and VGG-16. By sharing in accordance with the decisions supported by our findings, we show that we can improve task accuracy compared to other sharing decisions.
摘要
Network Wide Evacuation Traffic Prediction in a Rapidly Intensifying Hurricane from Traffic Detectors and Facebook Movement Data: A Deep Learning Approach
results: 模型在常规时间段(5月-8月)的数据上进行了训练,并在风暴撤离期间使用test数据进行预测,其中Accuracy为95%(RMSE = 356),但在风暴撤离期间,模型表现不佳,Accuracy为55%(RMSE = 1084)。然后,研究人员采用了传输学习方法,使用预训练的模型和更多的撤离相关特征进行预测,最终模型的Accuracy提高至89%(RMSE = 514)。再次添加Facebook移动数据,模型的RMSE值降至393,并提高了Accuracy至93%。Abstract
Traffic prediction during hurricane evacuation is essential for optimizing the use of transportation infrastructures. It can reduce evacuation time by providing information on future congestion in advance. However, evacuation traffic prediction can be challenging as evacuation traffic patterns is significantly different than regular period traffic. A data-driven traffic prediction model is developed in this study by utilizing traffic detector and Facebook movement data during Hurricane Ian, a rapidly intensifying hurricane. We select 766 traffic detectors from Florida's 4 major interstates to collect traffic features. Additionally, we use Facebook movement data collected during Hurricane Ian's evacuation period. The deep-learning model is first trained on regular period (May-August 2022) data to understand regular traffic patterns and then Hurricane Ian's evacuation period data is used as test data. The model achieves 95% accuracy (RMSE = 356) during regular period, but it underperforms with 55% accuracy (RMSE = 1084) during the evacuation period. Then, a transfer learning approach is adopted where a pretrained model is used with additional evacuation related features to predict evacuation period traffic. After transfer learning, the model achieves 89% accuracy (RMSE = 514). Adding Facebook movement data further reduces model's RMSE value to 393 and increases accuracy to 93%. The proposed model is capable to forecast traffic up to 6-hours in advance. Evacuation traffic management officials can use the developed traffic prediction model to anticipate future traffic congestion in advance and take proactive measures to reduce delays during evacuation.
摘要
预测风暴撤离交通是至关重要的,以便优化交通基础设施的使用。它可以降低撤离时间,通过提供未来堵塞的信息。然而,撤离交通预测可能是挑战,因为撤离交通模式与常规时间交通模式有所不同。本研究中提出了一种基于数据驱动的交通预测模型,通过利用交通检测器和Facebook运动数据进行风暴伊安的撤离期间预测。我们选择了766个交通检测器,分别位于佛罗里达州的4大高速公路。此外,我们还使用了在风暴伊安撤离期间收集的Facebook运动数据。深度学习模型首先在常规时间(5月-8月2022年)的数据上训练,以理解常规交通模式,然后使用风暴伊安撤离期间的数据进行测试。模型在常规时间上达到95%的准确率(RMSE=356),但在撤离期间表现不佳,准确率为55%(RMSE=1084)。然后,我们采用了传输学习方法,使用预训练的模型,并添加了撤离相关特征。经过传输学习,模型的准确率提高到89%(RMSE=514)。再次添加Facebook运动数据,模型的RMSE值降低至393,准确率提高到93%。该模型可以预测交通情况,并且可以预测交通情况到6小时之前。风暴撤离管理官员可以使用该模型预测未来交通堵塞,并采取积极措施,以降低撤离过程中的延迟。
methods: 该论文使用了一种新的方法,即在 Bayesian neural network 中添加空间嵌入层,以适应空间环境。此外,论文还提出了一些variants of SBNNs,以及如何使用这些模型来表示各种常见的空间过程。
results: 论文的结果表明,SBNNs 可以比传统的 Bayesian neural network 更好地描述空间数据中的不同过程,并且可以用于表示各种常见的空间过程。此外,论文还提出了一些新的工具来进行 SBNNs 的推断。Abstract
Statistical models for spatial processes play a central role in statistical analyses of spatial data. Yet, it is the simple, interpretable, and well understood models that are routinely employed even though, as is revealed through prior and posterior predictive checks, these can poorly characterise the spatial heterogeneity in the underlying process of interest. Here, we propose a new, flexible class of spatial-process models, which we refer to as spatial Bayesian neural networks (SBNNs). An SBNN leverages the representational capacity of a Bayesian neural network; it is tailored to a spatial setting by incorporating a spatial "embedding layer" into the network and, possibly, spatially-varying network parameters. An SBNN is calibrated by matching its finite-dimensional distribution at locations on a fine gridding of space to that of a target process of interest. That process could be easy to simulate from or we have many realisations from it. We propose several variants of SBNNs, most of which are able to match the finite-dimensional distribution of the target process at the selected grid better than conventional BNNs of similar complexity. We also show that a single SBNN can be used to represent a variety of spatial processes often used in practice, such as Gaussian processes and lognormal processes. We briefly discuss the tools that could be used to make inference with SBNNs, and we conclude with a discussion of their advantages and limitations.
摘要
统计模型 для空间过程扮演了中心角色于统计分析空间数据中。然而,即使是简单、可解释、具有良好了解的模型仍然广泛使用,尽管,通过先前和后预测检查,这些模型可能不能准确捕捉空间过程中的差异性。我们提出了一种新的、灵活的空间过程模型,称为空间 bayesian neural network(SBNN)。一个 SBNN 利用了 bayesian neural network 的表达能力,并在网络中添加了空间“嵌入层”,以适应空间设置。一个 SBNN 通过匹配其在细网格上的finite-dimensional分布与目标过程的分布来调整。该过程可以是容易从 simulate 出来的或者我们有很多实例来源。我们提出了一些 SBNN 的变体,大多数可以在相同的复杂性水平上比 conventional BNNs 更好地匹配目标过程的finite-dimensional分布。我们还表明了一个 SBNN 可以用来表示一些常用的空间过程,如 Gaussian 过程和 lognormal 过程。我们 briefly 讨论了使用 SBNNs 进行推理的工具,并结束于一个关于其优点和局限性的讨论。
Soft Matching Distance: A metric on neural representations that captures single-neuron tuning
for: This paper aims to develop a stricter notion of representational (dis)similarity that requires individual neuron matching across networks, and to generalize this metric to compare networks with different sizes.
methods: The paper uses optimal transport theory to derive a natural generalization of the distance metric based on “soft” permutations, which is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions.
results: The proposed metric avoids counter-intuitive outcomes suffered by alternative approaches and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics.Abstract
Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on "soft" permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics.
摘要
通用的神经表示(不)相似性度量是设计为感知到旋转和反射的神经活动空间的变换。驱动于各个单元的调音可能是重要的,有些时候有关注于开发更严格的神经表示(不)相似性度量,需要神经网络中的单元在不同网络中匹配。当两个网络有相同的大小(即同样多个单元)时,可以通过最大化神经单元索引Permutation来形式化距离度量。但是,不清楚如何推广这个度量来度量不同大小的网络之间的距离。我们利用了优化运输理论的连接, derivate一个自然的推广,基于"软" Permutation。这个度量是对称的,满足三角不等式,可以被解释为两个empirical分布之间的沃asserstein距离。此外,我们提出的度量可以避免其他方法所导致的不合理的结果,同时捕捉神经表示中完全被旋转不变度量所遗弃的几何视角。
paper_authors: Gangwon Jeong, Fu Li, Umberto Villa, Mark A. Anastasio
for: 这个研究的目的是为了使用深度学习方法来提高ultrasound computed tomography(USCT)中的速度声速(SOS)的重建精度,并且 investigate the impact of chosen input modalities on image-to-image learned reconstruction(IILR)方法。
results: 研究发现,使用双通道输入可以提高IILR方法的重建精度和特征区域的检测性能。单通道输入(TT或RT图像)alone的情况下,重建精度和特征区域的检测性能均较差。Abstract
Ultrasound computed tomography (USCT) is actively being developed to quantify acoustic tissue properties such as the speed-of-sound (SOS). Although full-waveform inversion (FWI) is an effective method for accurate SOS reconstruction, it can be computationally challenging for large-scale problems. Deep learning-based image-to-image learned reconstruction (IILR) methods are being investigated as scalable and computationally efficient alternatives. This study investigates the impact of the chosen input modalities on IILR methods for high-resolution SOS reconstruction in USCT. The selected modalities are traveltime tomography (TT) and reflection tomography (RT), which produce a low-resolution SOS map and a reflectivity map, respectively. These modalities have been chosen for their lower computational cost relative to FWI and their capacity to provide complementary information: TT offers a direct -- while low resolution -- SOS measure, while RT reveals tissue boundary information. Systematic analyses were facilitated by employing a stylized USCT imaging system with anatomically realistic numerical breast phantoms. Within this testbed, a supervised convolutional neural network (CNN) was trained to map dual-channel (TT and RT images) to a high-resolution SOS map. Moreover, the CNN was fine-tuned using a weighted reconstruction loss that prioritized tumor regions to address tumor underrepresentation in the training dataset. To understand the benefits of employing dual-channel inputs, single-input CNNs were trained separately using inputs from each modality alone (TT or RT). The methods were assessed quantitatively using normalized root mean squared error and structural similarity index measure for reconstruction accuracy and receiver operating characteristic analysis to assess signal detection-based performance measures.
摘要
美国计算 Tomography (USCT) 目前在发展中,以量化声学组织特性,如声速 (SOS) 为目标。虽然全波形反射 (FWI) 是一种高精度的 SOS 重建方法,但可能会对大规模问题具有计算挑战。深度学习基于图像到图像学习的方法被调查为可扩展和计算高效的替代方案。本研究研究了选择的输入模式对 IILR 方法的高分辨率 SOS 重建影响。选择的模式包括旅游时间 Tomography (TT) 和反射 Tomography (RT),它们生成了低分辨率 SOS 地图和反射图像,分别。这些模式选择的原因是它们的计算成本较低,并且可以提供补充信息:TT 提供了直接 -- 低分辨率 -- SOS 测量,而 RT 揭示了组织边界信息。在使用静态 USCT 图像系统和数字胸部phantom进行系统性分析的测试环境中,一个以图像为输入的 convolutional neural network (CNN) 被训练来将双通道 (TT 和 RT 图像) 映射到高分辨率 SOS 地图。此外,CNN 还被微调使用一个权重重建损失函数,以优先级刻意诊断区域,以 Addressing tumor underrepresentation in the training dataset。为了了解使用双通道输入的好处,单通道 CNNs 分别使用每个模式的输入图像来训练(TT 或 RT)。这些方法被评估量化使用 normalized root mean squared error 和 structure similarity index measure 来评估重建精度和 Receiver operating characteristic analysis 来评估基于信号检测的性能指标。
Combined Channel and Spatial Attention-based Stereo Endoscopic Image Super-Resolution
results: 根据da Vinci数据集的训练,提出的模型可以提高PSNR值达2.12 dB(比较2)和1.29 dB(比较4),同时SSIM值也提高了0.03(比较2)和0.0008(比较4)。这种方法可以帮助医生和Physician更准确地诊断和治疗endoscopic图像。Abstract
Stereo Imaging technology integration into medical diagnostics and surgeries brings a great revolution in the field of medical sciences. Now, surgeons and physicians have better insight into the anatomy of patients' organs. Like other technologies, stereo cameras have limitations, e.g., low resolution (LR) and blurry output images. Currently, most of the proposed techniques for super-resolution focus on developing complex blocks and complicated loss functions, which cause high system complexity. We proposed a combined channel and spatial attention block to extract features incorporated with a specific but very strong parallax attention module (PAM) for endoscopic image super-resolution. The proposed model is trained using the da Vinci dataset on scales 2 and 4. Our proposed model has improved PSNR up to 2.12 dB for scale 2 and 1.29 dB for scale 4, while SSIM is improved by 0.03 for scale 2 and 0.0008 for scale 4. By incorporating this method, diagnosis and treatment for endoscopic images can be more accurate and effective.
摘要
单声图像技术在医疗诊断和手术中得到了很大的革命,为医疗科学带来了更好的顾问。现在医生和医生都可以更好地了解患者的器官 анато�。然而,单声摄像头也有其限制,例如低分辨率(LR)和模糊的输出图像。现在大多数提议的超解析技术都是建立复杂的封包和复杂的损失函数,这会导致高系统复杂性。我们提出了一个结合通道和空间注意力块的特殊专注模组(PAM),用于检测照片超解析。我们的提议模型在 scales 2 和 4 上训练,实现了 PSNR 的提升至 2.12 dB 和 1.29 dB,而 SSIM 则提高了 0.03 和 0.0008。通过这种方法,医疗诊断和治疗可以更精准和有效。
paper_authors: Zhaolin Wang, Xidong Mu, Yuanwei Liu
for: 提出了一种新的近场速度测量概念,可以同时测量目标在运动过程中的径向和横向速度。
methods: 提出了基于最大可能性估计的方法,用于同时估计径向和横向速度的echo信号。
results: 通过帮助近场速度测量,提出了一种无需频率估计的预测扩散框架,实现了无间断的数据传输。数值示例验证了该方法的有效性。Abstract
The novel concept of near-field velocity sensing is proposed. In contrast to far-field velocity sensing, near-field velocity sensing enables the simultaneous estimation of both radial and transverse velocities of a moving target. A maximum-likelihood-based method is proposed for jointly estimating the radial and transverse velocities from the echo signals. Assisted by near-field velocity sensing, a predictive beamforming framework is proposed for a moving communication user, which requires no channel estimation but achieves seamless data transmission. Finally, numerical examples validate the proposed approaches.
摘要
新的概念——近场速度测量被提出。与远场速度测量相比,近场速度测量可同时测量移动目标的径向和横向速度。基于最大可能性的方法被提议用于同时估计径向和横向速度的echo信号。帮助了近场速度测量,一种预测扩散框架被提议用于移动通信用户,不需 Channel estimation,但可实现无缝数据传输。最后,数值示例证明了提出的方法。Here's the word-for-word translation:新的概念——近场速度测量被提出,与远场速度测量相比,近场速度测量可同时测量移动目标的径向和横向速度。基于最大可能性的方法被提议用于同时估计径向和横向速度的echo信号。帮助了近场速度测量,一种预测扩散框架被提议用于移动通信用户,不需 Channel estimation,但可实现无缝数据传输。最后,数值示例证明了提出的方法。
Wireless rf sensor with dual sensing capability for ionic solution and target dielectric objects
results: 具有唯一的设计,使其不受周围环境影响Abstract
A novel microstrip-based sensor designed for detecting changes in ionic content of water and the addition of solid contaminant objects is presented. The sensor can be installed on the exterior wall of dielectric containers and customized according to the material of the container to enable wireless sensing. It's operation within the lower microwave frequency range (670 to 730 MHz) serves to minimize signal attenuation in water and streamlines circuitry design. The most significant feature of this sensor is its unique design, rendering it impervious to its surrounding environment.
摘要
一种新型微带式感测器,用于检测水中离子含量的变化以及固体杂质物的添加,被介绍。该感测器可以安装在dielectric容器外墙上,并可以根据容器材料进行个性化定制,以实现无线感测。它的运作频率范围为670-730MHz,以便在水中减少信号强度抑制,同时简化电路设计。该感测器的最重要特点是它独特的设计,使其不受周围环境影响。Here's the breakdown of the translation:* 一种新型微带式感测器 (a new type of microstrip-based sensor)* 用于检测水中离子含量的变化 (for detecting changes in ionic content of water)* 以及固体杂质物的添加 (and the addition of solid contaminant objects)* 被介绍 (being introduced)* 该感测器可以安装在dielectric容器外墙上 (the sensor can be installed on the exterior wall of dielectric containers)* 并可以根据容器材料进行个性化定制 (and can be customized according to the material of the container)* 以实现无线感测 (to achieve wireless sensing)* 它的运作频率范围为670-730MHz (its operating frequency range is 670-730MHz)* 以便在水中减少信号强度抑制 (to reduce signal attenuation in water)* 同时简化电路设计 (while simplifying circuitry design)* 该感测器的最重要特点是它独特的设计 (the most important feature of the sensor is its unique design)* 使其不受周围环境影响 (so that it is not affected by the surrounding environment)
paper_authors: Sha Zhu, Yiwen Zhang, Jiaxue Feng, Yongji Wang, Kunpeng Zhai, Hanke Feng, Edwin Yue Bun Pun, Ning Hua Zhu, Cheng Wang
for: This paper presents a centimeter-resolution integrated photonic radar operating in the mmWave V band (40-50 GHz) for high-resolution sensing and detection of targets.
methods: The paper uses a 4-inch wafer-scale thin-film lithium niobate (TFLN) technology to overcome the limitations of electronic radars and achieve a broadband linear frequency modulated mmWave radar waveform through optical frequency multiplication of a low-frequency input signal.
results: The paper achieves multi-target ranging with a resolution of 1.50 cm and velocity measurement with a resolution of 0.067 m/s, as well as imaging of targets with various shapes and postures with a two-dimensional resolution of 1.50 cm * 1.06 cm.Here’s the Chinese version:
results: 这篇论文实现了多个目标距离测量,分辨率为1.50cm,以及测速度测量,分辨率为0.067m/s,同时还成功构建了反 synthetic aperture radar(ISAR),并成功图像多种形状和姿态的目标,图像分辨率为1.50cm*1.06cm。Abstract
Millimeter-wave (mmWave,>30 GHz) radars are the key enabler in the coming 6G era for high-resolution sensing and detection of targets. Photonic radar provides an effective approach to overcome the limitations of electronic radars thanks to the high frequency, broad bandwidth, and excellent reconfigurability of photonic systems. However, conventional photonic radars are mostly realized in tabletop systems composed of bulky discrete components, whereas the more compact integrated photonic radars are difficult to reach the mmWave bands due to the unsatisfactory bandwidths and signal integrity of the underlining electro-optic modulators. Here, we overcome these challenges and demonstrate a centimeter-resolution integrated photonic radar operating in the mmWave V band (40-50 GHz) based on a 4-inch wafer-scale thin-film lithium niobate (TFLN) technology. The fabricated TFLN mmWave photonic integrated circuit consists of a first electro-optic modulator capable of generating a broadband linear frequency modulated mmWave radar waveform through optical frequency multiplication of a low-frequency input signal, and a second electro-optic modulator responsible for frequency de-chirp of the received reflected echo wave, therefore greatly relieving the bandwidth requirements for the analog-to-digital converter in the receiver. Thanks to the absence of optical and electrical filters in the system, our integrated photonic mmWave radar features continuous on-demand tunability of the center frequency and bandwidth, currently only limited by the bandwidths of electrical amplifiers. We achieve multi-target ranging with a resolution of 1.50 cm and velocity measurement with a resolution of 0.067 m/s. Furthermore, we construct an inverse synthetic aperture radar (ISAR) and successfully demonstrate the imaging of targets with various shapes and postures with a two-dimensional resolution of 1.50 cm * 1.06 cm.
摘要
millimeter wave (mmWave,>30 GHz) 雷达是 sixth generation (6G) 时代的关键能力,具有高分辨率探测和检测目标的能力。光子雷达技术提供了一种有效的方法来超越电子雷达的限制,因为光子系统具有高频率、广频带宽和优秀的可重新配置性。然而,传统的光子雷达通常是由多个粗糙的独立部件组成的桌面系统,而更 компакт的集成光子雷达具有 mmWave 频率带的差异和信号完整性问题。在这里,我们解决了这些挑战,并实现了基于 4 英寸芯片级薄膜锂铝铌 (TFLN) 技术的中心 Resolution 集成光子 mmWave 雷达,operating in the mmWave V band (40-50 GHz)。制造的 TFLN mmWave 光子集成电路包括一个能够生成广频线性修改 mmWave 雷达波形的第一个电 Optic modulator,以及一个负责接收反射回波的第二个电 Optic modulator。通过光子频率 multiplication 的低频输入信号,该系统实现了广频修改和频率去抖,从而大大减轻接收器 Analog-to-digital Converter 的频率要求。由于系统中缺乏光学和电子过滤器,我们的集成光子 mmWave 雷达具有无间断的受 demand 调试中心频率和带宽,当前只受电子增强器的带宽限制。我们实现了多个目标的距离测量,其中最高分辨率为 1.50 cm,以及速度测量的分辨率为 0.067 m/s。此外,我们还构建了一个反 Synthetic Aperture Radar (ISAR),并成功地实现了目标的二维图像测量,其分辨率为 1.50 cm * 1.06 cm。
Semantic-Relay-Aided Text Transmission: Placement Optimization and Bandwidth Allocation
results: 提高文本传输效率(improved text transmission efficiency),比对 conventinal decode-and-forward relay(CF Relay)有较高的效果(better performance than conventional CF Relay)Abstract
Semantic communication has emerged as a promising technology to break the Shannon limit by extracting the meaning of source data and sending relevant semantic information only. However, some mobile devices may have limited computation and storage resources, which renders it difficult to deploy and implement the resource-demanding deep learning based semantic encoder/decoder. To tackle this challenge, we propose in this paper a new semantic relay (SemRelay), which is equipped with a semantic receiver for assisting text transmission from a resource-abundant base station (BS) to a resource-constrained mobile device. Specifically, the SemRelay first decodes the semantic information sent by the BS (with a semantic transmitter) and then forwards it to the user by adopting conventional bit transmission, hence effectively improving the text transmission efficiency. We formulate an optimization problem to maximize the achievable (effective) bit rate by jointly designing the SemRelay placement and bandwidth allocation. Although this problem is non-convex and generally difficult to solve, we propose an efficient penalty-based algorithm to obtain a high-quality suboptimal solution. Numerical results show the close-to-optimal performance of the proposed algorithm as well as significant rate performance gain of the proposed SemRelay over conventional decode-and-forward relay.
摘要
The SemRelay decodes the semantic information sent by the BS (with a semantic transmitter) and then forwards it to the user using conventional bit transmission, thereby improving text transmission efficiency. We formulate an optimization problem to maximize the achievable (effective) bit rate by jointly designing the SemRelay placement and bandwidth allocation. Although this problem is non-convex and difficult to solve, we propose an efficient penalty-based algorithm to obtain a high-quality suboptimal solution.Numerical results show that the proposed algorithm achieves close-to-optimal performance and offers significant rate performance gains compared to conventional decode-and-forward relay.
Stacked Intelligent Metasurface-Aided MIMO Transceiver Design
results: 该论文提供了SIM-aided MIMO transceiver设计的概述,包括硬件体系结构和与现有解决方案的比较。此外,论文还详细介绍了应用场景和开放研究挑战,以及使用高级SIM体系结构实现下一代无线网络的可能性。最后,论文提供了数字结果,以证明在无线系统中使用波动信号处理的优势。Abstract
Next-generation wireless networks are expected to utilize the limited radio frequency (RF) resources more efficiently with the aid of intelligent transceivers. To this end, we propose a promising transceiver architecture relying on stacked intelligent metasurfaces (SIM). An SIM is constructed by stacking an array of programmable metasurface layers, where each layer consists of a massive number of low-cost passive meta-atoms that individually manipulate the electromagnetic (EM) waves. By appropriately configuring the passive meta-atoms, an SIM is capable of accomplishing advanced computation and signal processing tasks, such as multiple-input multiple-output (MIMO) precoding/combining, multi-user interference mitigation, and radar sensing, as the EM wave propagates through the multiple layers of the metasurface, which effectively reduces both the RF-related energy consumption and processing delay. Inspired by this, we provide an overview of the SIM-aided MIMO transceiver design, which encompasses its hardware architecture and its potential benefits over state-of-the-art solutions. Furthermore, we discuss promising application scenarios and identify the open research challenges associated with the design of advanced SIM architectures for next-generation wireless networks. Finally, numerical results are provided for quantifying the benefits of wave-based signal processing in wireless systems.
摘要
Inspired by this, we provide an overview of the SIM-aided MIMO transceiver design, including its hardware architecture and potential benefits over existing solutions. We also discuss promising application scenarios and identify open research challenges associated with the design of advanced SIM architectures for next-generation wireless networks. Finally, we provide numerical results to quantify the benefits of wave-based signal processing in wireless systems.Here is the translation in Simplified Chinese:下一代无线网络即将使用有限的广播频率资源更有效地使用,并且通过智能转发器来实现。为此,我们提出了一种有前途的转发器架构,即堆叠智能金属表盘(SIM)。每层SIM都由一大量的低成本Passive元件组成,这些元件个别对电磁波(EM)波进行处理。通过合适配置这些元件,SIM可以完成复杂的计算和信号处理任务,例如多输入多出力(MIMO)预处理/组合、多用户干扰抑制和雷达探测。这将有效减少广播相关的能量消耗和处理延迟。受这些想法启发,我们提供SIM帮助MIMO转发器设计的概述,包括硬件架构和其优势。我们还讨论了可能的应用场景,并识别了进一步开发SIM架构的研究挑战。最后,我们提供了量化无线系统中波形处理的数字结果。
MEGA: A Memory-Efficient GNN Accelerator Exploiting Degree-Aware Mixed-Precision Quantization
results: 我们实现了MEGA加速器在28nm技术节点上,并进行了广泛的实验。结果表明,MEGA可以在四种状态目前的GNN加速器上实现平均速度提升38.3倍,7.1倍,4.0倍,3.6倍,而且保持任务的精度。同时,MEGA也可以实现7.2倍,5.4倍,4.5倍的能耗减少。Abstract
Graph Neural Networks (GNNs) are becoming a promising technique in various domains due to their excellent capabilities in modeling non-Euclidean data. Although a spectrum of accelerators has been proposed to accelerate the inference of GNNs, our analysis demonstrates that the latency and energy consumption induced by DRAM access still significantly impedes the improvement of performance and energy efficiency. To address this issue, we propose a Memory-Efficient GNN Accelerator (MEGA) through algorithm and hardware co-design in this work. Specifically, at the algorithm level, through an in-depth analysis of the node property, we observe that the data-independent quantization in previous works is not optimal in terms of accuracy and memory efficiency. This motivates us to propose the Degree-Aware mixed-precision quantization method, in which a proper bitwidth is learned and allocated to a node according to its in-degree to compress GNNs as much as possible while maintaining accuracy. At the hardware level, we employ a heterogeneous architecture design in which the aggregation and combination phases are implemented separately with different dataflows. In order to boost the performance and energy efficiency, we also present an Adaptive-Package format to alleviate the storage overhead caused by the fine-grained bitwidth and diverse sparsity, and a Condense-Edge scheduling method to enhance the data locality and further alleviate the access irregularity induced by the extremely sparse adjacency matrix in the graph. We implement our MEGA accelerator in a 28nm technology node. Extensive experiments demonstrate that MEGA can achieve an average speedup of 38.3x, 7.1x, 4.0x, 3.6x and 47.6x, 7.2x, 5.4x, 4.5x energy savings over four state-of-the-art GNN accelerators, HyGCN, GCNAX, GROW, and SGCN, respectively, while retaining task accuracy.
摘要
图 neural network (GNN) 在不同领域变得抢手的技术,因为它们可以非常好地模型非欧几何数据。 Although a variety of accelerators have been proposed to accelerate the inference of GNNs, our analysis shows that the latency and energy consumption caused by DRAM access still significantly hinders the improvement of performance and energy efficiency. To address this issue, we propose a Memory-Efficient GNN Accelerator (MEGA) through algorithm and hardware co-design in this work. Specifically, at the algorithm level, through an in-depth analysis of the node property, we find that the data-independent quantization in previous works is not optimal in terms of accuracy and memory efficiency. This motivates us to propose the Degree-Aware mixed-precision quantization method, in which a proper bitwidth is learned and allocated to a node according to its in-degree to compress GNNs as much as possible while maintaining accuracy. At the hardware level, we employ a heterogeneous architecture design in which the aggregation and combination phases are implemented separately with different dataflows. In order to boost the performance and energy efficiency, we also present an Adaptive-Package format to alleviate the storage overhead caused by the fine-grained bitwidth and diverse sparsity, and a Condense-Edge scheduling method to enhance the data locality and further alleviate the access irregularity induced by the extremely sparse adjacency matrix in the graph. We implement our MEGA accelerator in a 28nm technology node. Extensive experiments show that MEGA can achieve an average speedup of 38.3x, 7.1x, 4.0x, 3.6x and 47.6x, 7.2x, 5.4x, 4.5x energy savings over four state-of-the-art GNN accelerators, HyGCN, GCNAX, GROW, and SGCN, respectively, while retaining task accuracy.
OFDM-based Waveforms for Joint Sensing and Communications Robust to Frequency Selective IQ Imbalance
methods: 这个研究使用了一种新的OFDM波形,它 neither increases the noise floor nor reduces the maximum unambiguous range,并且提出了一种适应频率选择的通信系统,包括通道估计、同步和数据估计方法,它们是 Specifically designed to deal with frequency selective IQ imbalance in wideband systems。
results: 这个研究通过 simulations 示出了这些通信系统的有效性,并且显示了这些系统在噪声底和最大不ambiguous 距离方面的改善。Abstract
Orthogonal frequency-division multiplexing (OFDM) is a promising waveform candidate for future joint sensing and communication systems. It is well known that the OFDM waveform is vulnerable to in-phase and quadrature-phase (IQ) imbalance, which increases the noise floor in a range-Doppler map (RDM). A state-of-the-art method for robustifying the OFDM waveform against IQ imbalance avoids an increased noise floor, but it generates additional ghost objects in the RDM [1]. A consequence of these additional ghost objects is a reduction of the maximum unambiguous range. In this work, a novel OFDM-based waveform robust to IQ imbalance is proposed, which neither increases the noise floor nor reduces the maximum unambiguous range. The latter is achieved by shifting the ghost objects in the RDM to different velocities such that their range variations observed over several consecutive RDMs do not correspond to the observed velocity. This allows tracking algorithms to identify them as ghost objects and eliminate them for the follow-up processing steps. Moreover, we propose complete communication systems for both the proposed waveform as well as for the state-of-the-art waveform, including methods for channel estimation, synchronization, and data estimation that are specifically designed to deal with frequency selective IQ imbalance which occurs in wideband systems. The effectiveness of these communication systems is demonstrated by means of bit error ratio (BER) simulations.
摘要
隐式frequency-division multiplexing(OFDM)是未来 JOINT sensing和通信系统的优秀waveform候选人。OFDM波形容于均匀和 quadrature-phase(IQ)不匹配,从而增加range-Doppler map(RDM)中的噪声底。现有的state-of-the-art方法可以避免增加噪声底,但会生成RDM中的幻象物体。这些幻象物体会 reducion maximum unambiguous range。在这种工作中,我们提出了一种robust OFDM波形, neither increases the noise floor nor reduces the maximum unambiguous range。这是通过在RDM中移动幻象物体的速度,使其在不同的速度下变化,以至于在多个连续RDM中不同速度下的范围变化不同于观测到的速度。这使得跟踪算法可以将其标记为幻象物体,并在后续处理步骤中消除它们。此外,我们还提出了为两种waveform(包括提案的波形和现有waveform)的完整通信系统,包括频率选择性IQ不匹配的通道估计、同步和数据估计方法,这些方法特别针对随着宽频带的IQ不匹配。我们通过BER simulations示出这些通信系统的有效性。
Reconciling Radio Tomographic Imaging with Phaseless Inverse Scattering
paper_authors: Amartansh Dubey, Zan Li, Ross Murch
For: This paper aims to improve the accuracy of Radio Tomographic Imaging (RTI) by reconciling it with formal inverse scattering approaches and enhancing its performance using inverse scattering techniques.* Methods: The paper uses empirical RTI models and formal inverse scattering approaches to compare and enhance RTI’s performance.* Results: The enhanced RTI method outperforms traditional RTI while having similar computational complexity, as demonstrated through numerical and experimental results using low-cost 2.4 GHz Wi-Fi transceivers for indoor imaging applications.Here is the same information in Simplified Chinese:* For: 这篇论文的目的是提高Radio Tomographic Imaging(RTI)的准确率,通过与正式反射扩散方法进行比较和改进RTI的性能。* Methods: 这篇论文使用了empirical RTI模型和正式反射扩散方法来比较和改进RTI的性能。* Results: 改进后的RTI方法可以超越传统的RTI,同时保持与RTI相同的计算复杂性,通过使用低成本的2.4 GHz Wi-Fi传输器进行indoor应用。Abstract
Radio Tomographic Imaging (RTI) is a phaseless imaging approach that can provide shape reconstruction and localization of objects using received signal strength (RSS) measurements. RSS measurements can be straightforwardly obtained from wireless networks such as Wi-Fi and therefore RTI has been extensively researched and accepted as a good indoor RF imaging technique. However, RTI is formulated on empirical models using an assumption of light-of-sight (LOS) propagation that does not account for intricate scattering effects. There are two main objectives of this work. The first objective is to reconcile and compare the empirical RTI model with formal inverse scattering approaches to better understand why RTI is an effective RF imaging technique. The second objective is to obtain straightforward enhancements to RTI, based on inverse scattering, to enhance its performance. The resulting enhancements can provide reconstructions of the shape and also material properties of the objects that can aid image classification. We also provide numerical and experimental results to compare RTI with the enhanced RTI for indoor imaging applications using low-cost 2.4 GHz Wi-Fi transceivers. These results show that the enhanced RTI can outperform RTI while having similar computational complexity to RTI.
摘要
Radio Tomographic Imaging (RTI) 是一种无相位成像方法,可以提供物体形态重建和位置确定使用接收信号强度 (RSS) 测量。 RSS 测量可以直接从无线网络 such as Wi-Fi 获得,因此 RTI 在indoor RF 成像技术中得到了广泛的研究和认可。然而, RTI 基于employmodels 的 assumption of line-of-sight (LOS) 媒体传播,不能考虑复杂的散射效应。这个工作的两个主要目标是:1. 与形式 inverse scattering 方法进行对比和结合 RTI 模型,以更好地理解 RTI 是一种有效的 RF 成像技术。2. 基于 inverse scattering 方法,对 RTI 进行改进,以提高其性能。改进后的 RTI 可以提供更好的形态重建和物体属性的重建,这可以帮助图像分类。我们还提供了数字和实验结果,以比较 RTI 与改进后的 RTI 在indoor 成像应用中的性能。结果表明,改进后的 RTI 可以超过 RTI,而且与 RTI 的计算复杂度相似。
Plug-In RIS: A Novel Approach to Fully Passive Reconfigurable Intelligent Surfaces
paper_authors: Mahmoud Raeisi, Ibrahim Yildirim, Mehmet Cagri Ilter, Majid Gerami, Ertugrul Basar
for: 提高 millimeter wave 通信系统中阻挡区域的性能
methods: 使用固定磁场技术和先进的杂排列技术来实现位置相关的磁场规划
results: 在有限CSI情况下, plug-in RIS 可以提供高效的解决方案,并与传统全CSI-启用 RIS 解决方案 exhibit striking convergence in average bit error rate and achievable rate performance.Abstract
This paper presents a promising design concept for reconfigurable intelligent surfaces (RISs), named plug-in RIS, wherein the RIS is plugged into an appropriate position in the environment, adjusted once according to the location of both base station and blocked region, and operates with fixed beams to enhance the system performance. The plug-in RIS is a novel system design, streamlining RIS-assisted millimeter-wave (mmWave) communication without requiring decoupling two parts of the end-to-end channel, traditional control signal transmission, and online RIS configuration. In plug-in RIS-aided transmission, the transmitter efficiently activates specific regions of the divided large RIS by employing hybrid beamforming techniques, each with predetermined phase adjustments tailored to reflect signals to desired user locations. This user-centric approach enhances connectivity and overall user experience by dynamically illuminating the targeted user based on location. By introducing plug-in RIS's theoretical framework, design principles, and performance evaluation, we demonstrate its potential to revolutionize mmWave communications in limited channel state information (CSI) scenarios. Simulation results illustrate that plug-in RIS provides power/cost-efficient solutions to overcome blockage in the mmWave communication system and a striking convergence in average bit error rate and achievable rate performance with traditional full CSI-enabled RIS solutions.
摘要
The novel design of the plug-in RIS eliminates the need for decoupling the two parts of the end-to-end channel, traditional control signal transmission, and online RIS configuration, streamlining the RIS-assisted mmWave communication. In the plug-in RIS-aided transmission, the transmitter efficiently activates specific regions of the divided large RIS using hybrid beamforming techniques, each with predetermined phase adjustments tailored to reflect signals to desired user locations. This user-centric approach enhances connectivity and overall user experience by dynamically illuminating the targeted user based on location.The theoretical framework, design principles, and performance evaluation of the plug-in RIS are presented in this paper, demonstrating its potential to revolutionize mmWave communications in limited channel state information (CSI) scenarios. Simulation results show that the plug-in RIS provides power/cost-efficient solutions to overcome blockage in the mmWave communication system and achieves a striking convergence in average bit error rate and achievable rate performance with traditional full CSI-enabled RIS solutions.
Joint Visibility Region and Channel Estimation for Extremely Large-scale MIMO Systems
paper_authors: Anzheng Tang, Jun-bo Wang, Yijin Pan, Wence Zhang, Yijian Chen, Hongkang Yu, Rodrigo C. de Lamare
for: 本研究 investigate the channel estimation (CE) problem for extremely large-scale multiple-input-multiple-output (XL-MIMO) systems, considering both the spherical wavefront effect and spatial non-stationarity (SnS).
methods: 我们提出了一种 two-stage visibility region (VR) detection and CE framework, which leverages sparsity in both the spatial and wavenumber domains to achieve an accurate estimation. In the first stage, we use a structured message passing (MP) scheme to obtain the belief regarding the visibility of antennas. In the second stage, we use the obtained VR information and wavenumber-domain sparsity to accurately estimate the SnS channel employing the belief-based orthogonal matching pursuit (BB-OMP) method.
results: simulations demonstrate that the proposed algorithms lead to a significant enhancement in VR detection and CE accuracy, especially in low signal-to-noise ratio (SNR) scenarios.Abstract
In this work, we investigate the channel estimation (CE) problem for extremely large-scale multiple-input-multiple-output (XL-MIMO) systems, considering both the spherical wavefront effect and spatial non-stationarity (SnS). Unlike existing non-stationary CE methods that rely on the statistical characteristics of channels in the spatial or temporal domain, our approach seeks to leverage sparsity in both the spatial and wavenumber domains simultaneously to achieve an accurate estimation.To this end, we introduce a two-stage visibility region (VR) detection and CE framework. Specifically, in the first stage, the belief regarding the visibility of antennas is obtained through a structured message passing (MP) scheme, which fully exploits the block sparse structure of the antenna-domain channel. In the second stage, using the obtained VR information and wavenumber-domain sparsity, we accurately estimate the SnS channel employing the belief-based orthogonal matching pursuit (BB-OMP) method. Simulations demonstrate that the proposed algorithms lead to a significant enhancement in VR detection and CE accuracy, especially in low signal-to-noise ratio (SNR) scenarios.
摘要
在这项工作中,我们研究了超大规模多输入多输出(XL-MIMO)系统中的通道估计(CE)问题,考虑了球面冲击效应和空间非站点性(SnS)。 unlike existing non-stationary CE方法,我们的方法不仅利用了通道在空间或时域频率域的统计特性,而且同时充分利用了antenna-domain通道的块稀畴结构。为达到这个目的,我们提出了两个阶段的可见区域(VR)探测和CE框架。在第一阶段,通过一种结构化的消息传递(MP)方案,我们可以获得天线域通道的可见性信念。在第二阶段,使用获得的VR信息和波数域稀畴性,我们可以高精度地估计SnS通道,使用信念基本的搜索匹配策略(BB-OMP)。 simulation结果表明,我们的算法可以在低信号噪响比(SNR)场景下提高VR检测和CE准确率。
paper_authors: Yuanbo Hou, Qiaoqiao Ren, Huizhong Zhang, Andrew Mitchell, Francesco Aletta, Jian Kang, Dick Botteldooren
for: This paper proposes an AI-based approach for automatic soundscape characterization, including sound recognition and appraisal.
methods: The proposed method uses a dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze sound sources and predict human-perceived annoyance.
results: The proposed method outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the sound source classification and annoyance rating prediction tasks, and shows consistent results with human perception.Abstract
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition. Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF model. (4) Generalization tests show that the proposed model's ARP in the presence of model-unknown sound sources is consistent with expert expectations and can explain previous findings from the literature on sound-scape augmentation.
摘要
听音环境研究通常会尝试捕捉用户对听音环境的认知和理解,但是对于长期监测或评估 intervención,需要基于听音信号的方法。因此,前期研究主要集中在 psycho-acoustic 量或自动听音识别方面。只有一些尝试包括评价(如在 circumplex 框架中)。这篇文章提出了基于人工智能(AI)的双支树层卷积神经网络(DCNN-CaF),用于自动听音特征化,包括听音识别和评价。使用包含人类标注的听音源标签和感知困扰的DeLTA数据集,DCNN-CaF被提议用于听音源类别(SSC)和人类感知困扰评分预测(ARP)。实验结果表明:1. 使用听音强度和Mel特征的DCNN-CaF,与只使用一个特征的DCNN-CaF相比,表现出更好的性能。2. DCNN-CaF与十字关注融合的方法,在SSC和ARP任务上表现出比其他常见的AI模型和听音相关传统机器学习方法更好的性能。3. 相关分析表明,人类和提议的AI模型之间听音源和困扰之间的关系类似。4. 总结测试表明,提议的模型的ARP在模型不知道听音源时的存在下保持一致性,与专家预期一致,并可以解释过去 literatura 中的听音环境增强现象。
CREPE Notes: A new method for segmenting pitch contours into discrete notes
results: 本方法在两个具有挑战性的单声道乐器音乐数据集上达到了状态级 Ergebnisse,同时与其他深度学习基于方法相比,减少了97%的总参数数量。Abstract
Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. However, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes. This sub-task of note segmentation is necessary to enable a range of applications including musicological analysis and symbolic music generation. Building on CREPE, a state-of-the-art monophonic pitch tracking solution based on a simple neural network, we propose a simple and effective method for post-processing CREPE's output to achieve monophonic note segmentation. The proposed method demonstrates state-of-the-art results on two challenging datasets of monophonic instrumental music. Our approach also gives a 97% reduction in the total number of parameters used when compared with other deep learning based methods.
摘要
Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. However, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes. This sub-task of note segmentation is necessary to enable a range of applications including musicological analysis and symbolic music generation. Building on CREPE, a state-of-the-art monophonic pitch tracking solution based on a simple neural network, we propose a simple and effective method for post-processing CREPE's output to achieve monophonic note segmentation. The proposed method demonstrates state-of-the-art results on two challenging datasets of monophonic instrumental music. Our approach also gives a 97% reduction in the total number of parameters used when compared with other deep learning based methods.Here's the breakdown of the text in Simplified Chinese:Tracking the fundamental frequency (f0) of a monophonic instrumental performance is effectively a solved problem with several solutions achieving 99% accuracy. however, the related task of automatic music transcription requires a further processing step to segment an f0 contour into discrete notes.This sub-task of note segmentation is necessary to enable a range of applications including musicological analysis and symbolic music generation.Building on CREPE, a state-of-the-art monophonic pitch tracking solution based on a simple neural network, we propose a simple and effective method for post-processing CREPE's output to achieve monophonic note segmentation.The proposed method demonstrates state-of-the-art results on two challenging datasets of monophonic instrumental music.Our approach also gives a 97% reduction in the total number of parameters used when compared with other deep learning based methods.
Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
results: 研究表明,使用预训练的SSL模型可以在听力评估中得到显著提高的表达质量和整体表达能力,并且在不同的听力损伤水平下具有更好的转移性。Abstract
Without the need for a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. While deep learning models have been used to develop non-intrusive speech assessment methods with promising results, there is limited research on hearing-impaired subjects. This study proposes a multi-objective non-intrusive hearing-aid speech assessment model, called HASA-Net Large, which predicts speech quality and intelligibility scores based on input speech signals and specified hearing-loss patterns. Our experiments showed the utilization of pre-trained SSL models leads to a significant boost in speech quality and intelligibility predictions compared to using spectrograms as input. Additionally, we examined three distinct fine-tuning approaches that resulted in further performance improvements. Furthermore, we demonstrated that incorporating SSL models resulted in greater transferability to OOD dataset. Finally, this study introduces HASA-Net Large, which is a non-invasive approach for evaluating speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and hearing-loss patterns to accurately predict speech quality and intelligibility levels for individuals with normal and impaired hearing and demonstrates superior prediction performance and transferability.
摘要
无需清晰参考,非侵入式Speech评估方法在 objective 评估中受到了广泛关注。而深度学习模型已经被用来开发非侵入式Speech评估方法,但有限的研究在听力障碍者中。这项研究提出了一种多目标非侵入式听力器Speech评估模型,称为HASA-Net Large,该模型根据输入Speech信号和指定的听力损耗模式预测Speech质量和 inteligibilty 分数。我们的实验表明使用预训练的 SSL 模型会导致Speech质量和 inteligibilty 预测得到显著提升,比使用spectrograms作为输入更好。此外,我们还考虑了三种不同的细化方法,这些方法导致了更好的性能提升。此外,我们还证明了将 SSL 模型integrated 到 OOD 数据集中的更好传播性。最后,本研究介绍了HASA-Net Large,这是一种非侵入式的Speech质量和 inteligibilty 评估方法,该方法使用原始波形和听力损耗模式来准确预测听力正常和听力障碍者的Speech质量和 inteligibilty 水平,并达到了更高的预测性和传播性。
Autoencoder with Group-based Decoder and Multi-task Optimization for Anomalous Sound Detection
results: 根据DCASE 2021 Task 2的开发集合,这篇论文的方法在七台机器上的测试集合中,相对于官方AE和MobileNetV2的平均投票值提高13.11%和15.20%。Abstract
In industry, machine anomalous sound detection (ASD) is in great demand. However, collecting enough abnormal samples is difficult due to the high cost, which boosts the rapid development of unsupervised ASD algorithms. Autoencoder (AE) based methods have been widely used for unsupervised ASD, but suffer from problems including 'shortcut', poor anti-noise ability and sub-optimal quality of features. To address these challenges, we propose a new AE-based framework termed AEGM. Specifically, we first insert an auxiliary classifier into AE to enhance ASD in a multi-task learning manner. Then, we design a group-based decoder structure, accompanied by an adaptive loss function, to endow the model with domain-specific knowledge. Results on the DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines.
摘要
在工业领域,机器异常声音检测(ASD)的需求非常大。然而,收集足够的异常样本是困难的,这会促进不upervised ASD算法的快速发展。基于自适应Encoder(AE)的方法在不upervised ASD方面广泛应用,但它们受到短 Circuit、anti-noise能力不够和特征质量不佳等问题的困扰。为了解决这些挑战,我们提出了一个新的AE基于框架,称为AEGM。 Specifically,我们首先将auxiliary分类器 inserting into AE以增强ASD的多任务学习方式。然后,我们设计了一种群体基本解码结构,并附加了一个适应损失函数,以使模型具备域pecific的知识。Results on DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines.
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation
results: 我们的方法比前期工作在voice转换任务中表现更好。Abstract
Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.
摘要
更好的发音 Representation 分离是voice转换质量的关键。最近,基于说话人标签的对比学习成功地应用于voice转换。然而,在相似的说话人之间转换时,模型性能会降低。因此,我们提出一种增强的负样本选择方法来解决这个问题。具体来说,我们基于提议的说话人融合模块创建困难的负样本,以提高说话人Encoder的学习能力。此外,为了考虑细腻的说话人风格模型,我们采用参考Encoder来提取细腻的风格特征,并在全局风格上进行增强对比学习。实验结果表明,我们的方法在voice转换任务上的表现比前一个工作更好。
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
results: 研究人员通过对DCASE2023 foley音频生成 benchmark进行测试,发现该模型在10步内可以达到类似于顶尖基eline的Fréchet音频距离(FAD)分数,并在50步内达到了状态的掌握水平。此外,研究人员还发现了扩散基于音频生成模型的一种潜在问题,即它们可能会生成与训练数据高度相似的样本。Abstract
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/
摘要
Audio 扩散模型可以生成各种听起来。现有模型通常在幂征频域中使用缓冲phaserecovery模块来重construct波形。这会对高精度音频生成带来挑战。在这篇论文中,我们提出了EDMSound,一种基于扩散模型的生成模型,在幂征频域下的框架之下。通过有效的deterministic采样器,我们在10步内达到了与顶峰基eline的相似性 Fréchet 音频距离(FAD)分数,并在50步内达到了状态之 искусственный智能术语(DCASE2023) foley 声音生成 benchmark 的顶峰性能。我们还发现了扩散基于音频生成模型的一个潜在问题,即它们往往会生成与训练数据具有高听觉相似性的样本。项目页面:https://agentcooper2002.github.io/EDMSound/
Multi-channel Conversational Speaker Separation via Neural Diarization
methods: 提出了一种基于神经网络分类的 speaker separation via neural diarization(SSND)框架,利用终端到终端的分类系统来标识每个个体的语音活动。
results: 在 LibriCSS dataset 上进行了评估,并取得了大幅提高的 диари化和 ASR Result,代表了state-of-the-art 水平。Abstract
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called "speaker separation via neural diarization" (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments-a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin.
摘要
Translated into Simplified Chinese:在叠加的语音中,自动语音识别(ASR)系统的性能会受到很大的影响,因为它们是单个说话者的设计。为了在会议环境中提高 ASR 性能,常用 continuous speaker separation(CSS)。然而,CSS 需要一个短 separation window,以避免在 window 内有多个说话者,并且sequential grouping of discontinuous speech segments。为了解决这些限制,我们提出了一个新的多通道框架,即 "speaker separation via neural diarization"(SSND)。我们的方法使用了一个端到端的 diarization 系统,以确定每个个体说话者的语音活动。通过利用估计的 speaker 边界,我们生成了一个序列 embedding,以便将说话者分配到 multi-talker separation 模型的输出流中。SSND 通过在 diarization 阶段使用位置基于的训练,而不是在 separation 阶段,解决了 talker-independent speaker separation 的 permutation ambiguity 问题。这种独特的方法使得可以有效地处理长段,而不是 CSS 中的 sequential grouping。此外,SSND 自然适合 speaker-attributed ASR。我们对 open LibriCSS 数据集进行了评估,并在 diarization 和 ASR 领域取得了大幅提升的 estado-of-the-art 结果。