eess.IV - 2023-11-28

Dynamic Change of Amplitude for OCT Functional Imaging

  • paper_url: http://arxiv.org/abs/2311.17090
  • repo_url: None
  • paper_authors: Yang Jianlong, Zhang Haoran, Liu Chang, Gu Chengfu
  • for: 本文概述了optical coherence tomography(OCT)功能成像技术的原理和应用,以及其未来发展。
  • methods: 本文介绍了多种基于OCT其他光场性能的功能成像技术,包括Doppler OCT、光子共振成像、极化敏感OCT和可见光OCT等。
  • results: 本文总结了这些技术在诊断和评估疾病方面的优势和挑战,以及其未来发展的前景。
    Abstract Optical coherence tomography (OCT) is capable of non-destructively obtaining cross-sectional information of samples with micrometer spatial resolution, which plays an important role in ophthalmology and endovascular medicine. Measuring OCT amplitude can obtain three-dimensional structural information of the sample, such as the layered structure of the retina, but is of limited use for functional information such as tissue specificity, blood flow, and mechanical properties. OCT functional imaging techniques based on other optical field properties including phase, polarization state, and wavelength have emerged, such as Doppler OCT, optical coherence elastography, polarization-sensitive OCT, and visible-light OCT. Among them, functional imaging techniques based on dynamic changes of amplitude have significant robustness and complexity advantages, and achieved significant clinical success in label-free blood flow imaging. In addition, dynamic light scattering OCT for 3D blood flow velocity measurement, dynamic OCT with the ability to display label-free tissue/cell specificity, and OCT thermometry for monitoring the temperature field of thermophysical treatments are the frontiers in OCT functional imaging. In this paper, the principles and applications of the above technologies are summarized, the remaining technical challenges are analyzed, and the future development is envisioned.
    摘要

eess.SP - 2023-11-28

RIS-Enhanced MIMO Channels in Urban Environments: Experimental Insights

  • paper_url: http://arxiv.org/abs/2311.16985
  • repo_url: None
  • paper_authors: James Rains, Anvar Tukmanov, Qammer Abbasi, Muhammad Imran
  • for: 该研究是为了检验智能广播环境 paradigm 是否可以明显提高当代城市macrocell 性能。
  • methods: 该研究使用了可重新配置智能表面 (RIS) 对实际的 sub-6 GHz MIMO 频率域通道进行影响。研究使用了一种自然 inspirited 的扫描算法来最大化用户位置的通道增益,发现在某些情况下可以提高通道容量50%。
  • results: 分析表明,在这些设置下,RIS的引入可能会影响频率域通道的空间特性。研究组提供了 RIS 原型图文、Gerber文件和源代码,以便未来的无线通信研究人员进行实验。
    Abstract Can the smart radio environment paradigm measurably enhance the performance of contemporary urban macrocells? In this study, we explore the impact of reconfigurable intelligent surfaces (RISs) on a real-world sub-6 GHz MIMO channel. A rooftop-mounted macrocell antenna has been adapted to enable frequency domain channel measurements to be ascertained. A nature-inspired beam search algorithm has been employed to maximize channel gain at user positions, revealing a potential 50% increase in channel capacity in certain circumstances. Analysis reveals, however, that the spatial characteristics of the channel can be adversely affected through the introduction of a RIS in these settings. The RIS prototype schematics, Gerber files, and source code have been made available to aid in future experimental efforts of the wireless research community.
    摘要 可以吗?我们在这个研究中研究了使用可编程智能表面(RIS)对现代城市macrocell的性能是否可以有明显提高?我们使用了一个顶楼安装的 macrocell天线,并使用频域频率测量来确定频率域通道的性能。我们采用了一种基于自然的搜索算法,以最大化用户位置的通道增益,发现在某些情况下,通道容量可以提高50%。然而,分析表明,在这些设置下,RIS的空间特性可能会受到负面影响。我们已经提供了RIS原型图文、Gerber文件和源代码,以帮助未来的无线通信研究人员进行实验。

HARQ Retransmissions in C-V2X: A BSM Latency Analysis

  • paper_url: http://arxiv.org/abs/2311.16983
  • repo_url: None
  • paper_authors: Abdurrahman Fouda, Randall Berry, Ivan Vukovic
  • For: The paper studies the transmission latency and reliability of periodic basic safety messages (BSMs) in cellular vehicular-to-everything (C-V2X) systems with hybrid automatic repeat request (HARQ) retransmissions and semi-persistent scheduling (SPS) in C-V2X transmission mode 4.* Methods: The paper uses extensive system-level simulations that closely follow the SPS process to evaluate the BSM transmission latency and reliability with HARQ retransmissions. Additionally, the paper provides an analytical model for the tail behavior of the BSM latency distribution with HARQ retransmissions.* Results: The study reveals the impact of several deployment settings (e.g., bandwidth configurations and vehicle density) on the BSM transmission latency and reliability with HARQ retransmissions in C-V2X systems.Here’s the simplified Chinese text format you requested:* For: 研究Periodic Basic Safety Messages (BSMs)在Cellular Vehicular-to-Everything (C-V2X)系统中Hybrid Automatic Repeat Request (HARQ)重传和Semipersistent Scheduling (SPS)的传输延迟和可靠性。* Methods: 使用 extensible system-level simulations closely follow SPS process 评估BSM传输延迟和可靠性。同时提供了BSM延迟分布的尾部行为的analytical模型。* Results: 研究发现不同部署设置(如频道配置和车辆密度)对BSM传输延迟和可靠性带来的影响。
    Abstract Cellular vehicular-to-everything (C-V2X) systems offer the potential for improving road safety, in part through the exchange of periodic basic safety messages (BSMs) between nearby vehicles. The reliability and latency of these messages is a key metric. Hybrid automatic repeat request (HARQ) retransmissions are one technique used to this end. However, HARQ may come at the expense of consuming the limited available wireless resources, especially in highly congested scenarios. This paper studies BSM transmission latency and reliability when HARQ retransmissions are used with the semi-persistent scheduling (SPS) in C-V2X transmission mode 4. We do so through extensive system-level simulations that closely follow the SPS process. Furthermore, we provide an analytical model for the tail behavior of the BSM latency distribution with HARQ retransmissions that is a good approximation to the simulation results. Our study reveals the impact of several deployment settings (e.g., bandwidth configurations and vehicle density).
    摘要 mobile cellular vehicular-to-everything (C-V2X) 系统可能提高公路安全性,其中包括透过附近车辆之间交换时间讯息 (BSM) 的交换。这些讯息的可靠性和延迟时间是重要的指标。混合自动重复请求 (HARQ) 是一种用于提高这些讯息的可靠性和快速传输的技术。但是,HARQ 可能会占用有限的无线资源,特别是在高度拥堵的enario中。这篇文章研究了 C-V2X 传输模式4中BSM 传输延迟和可靠性,以及使用 HARQ 重复请求和半定期分配 (SPS)。我们透过严格遵循 SPS 过程的系统级别的模拟,并提供了BSM 延迟分布的尾部行为的分析模型,这是对模拟结果的好近似。我们的研究显示了不同的部署设定(例如带宽配置和车辆密度)对 C-V2X 传输的影响。

Study of BSM Inter-Packet Gap Tails in C-V2X Networks

  • paper_url: http://arxiv.org/abs/2311.16904
  • repo_url: None
  • paper_authors: Abdurrahman Fouda, Randall Berry, Ivan Vukovic
  • For: This paper investigates the tail behavior of inter-packet gaps (IPGs) and information age (IA) distributions in C-V2X mode 4, a decentralized resource allocation method based on semi-persistent scheduling (SPS).* Methods: The study employs high-fidelity system-level simulations to evaluate the performance of interleaved one-shot SPS transmissions and proposes an accurate analytical model to characterize the IPG tail behavior of C-V2X BSM transmissions.* Results: The numerical results demonstrate significant improvement in the IPG and IA tail distributions in various simulation scenarios, and the proposed analytical model validates the results by matching the asymptotic slopes of IPG distribution in different BSM transmission modes.
    Abstract Cellular vehicle-to-everything (C-V2X) enables safety-critical connected vehicular service by exchanging basic safety messages (BSMs) among nearby vehicular users (VUEs). Timely transmission of BSMs is crucial to avoid stale information at VUEs. However, successive packet losses can lead to large inter-packet gaps (IPGs), reducing the BSMs' reliability. This paper investigates the tail behavior of IPG and information age (IA) distributions in C-V2X mode 4, a decentralized resource allocation method based on semi-persistent scheduling (SPS). We study the improvements and trade-offs introduced by SAE one-shot transmission to decrease the number of successive BSM losses at destination VUEs. The study employs high-fidelity system-level simulations that closely follow the SPS process of CV2X mode 4 to evaluate the performance of interleaved one-shot SPS transmissions. The numerical results demonstrate significant improvement in the IPG and IA tail distributions in various simulation scenarios. Additionally, we propose an accurate analytical model to characterize the IPG tail behavior of C-V2X BSM transmissions. The proposed model is validated by comparing its results with those obtained using the system-level simulations. Our validation shows that the proposed model generates analytical results that coincide with the asymptotic slopes of IPG distribution in different BSM transmission modes.
    摘要 mobile vehicle-to-everything (C-V2X) 可以提供安全关联的连接车辆服务,通过在附近车辆用户 (VUEs) 之间交换基本安全消息 (BSMs)。在时刻传输 BSMs 是关键,以避免 VUEs 上的旧消息。然而,连续的包失失可能会导致大的间隔包 (IPGs),从而降低 BSMs 的可靠性。本文研究 C-V2X 模式 4 中 IPG 和信息年龄 (IA) 分布的尾部特性。我们研究了 SAE 一击传输是否可以降低目标 VUEs 上连续 BSM losses。我们使用高精度系统水平的 Simulations 来评估 C-V2X 模式 4 中一击 SPS 传输的性能。 numerics 结果显示,在不同的 simulate enario 中,一击 SPS 传输可以提供显著改善 IPG 和 IA 尾部分布。此外,我们提出了一个准确的分析模型,用于描述 C-V2X BSM 传输中 IPG 尾部特性。该模型通过与系统水平 Simulations 的结果进行比较,并证明了其可以准确地描述 IPG 尾部分布的各种模式。

Localization of a Passive Source with a Sensor Network based Experimental Molecular Communication Platform

  • paper_url: http://arxiv.org/abs/2311.16848
  • repo_url: None
  • paper_authors: Fatih Gulec, Damla Yagmur Koda, Baris Atakan, Andrew W. Eckford
  • for: 本研究目的是为了在无知源排放的空气污染物监测中估计分子传输器(TX)的位置。
  • methods: 本研究使用了一种新的实验平台,包括一个集成的感知网络(SN)与蒸发的酒精分子作为被动TX。在SNCLA中,使用了一个高斯气泡模型来 derivate位置估计器。通过测量信号的SN来计算或估计参数,如传输质量、风速、检测时间和实际浓度,以用于位置估计器的输入。
  • results: 数值计算结果表明,SNCLA在强风的情况下表现更好。实验数据显示,沸腾分子不会在SN中均匀传播,风速存在影响。此外,基于实验数据的统计分析表明,感知信号采集到的信号采样遵循Log-normal分布,而附加噪声采用Student’s t-分布,与文献中的Gaussian假设不同。
    Abstract In a practical molecular communication scenario such as monitoring air pollutants released from an unknown source, it is essential to estimate the location of the molecular transmitter (TX). This paper presents a novel Sensor Network-based Localization Algorithm (SNCLA) for passive transmission by using a novel experimental platform which mainly comprises a clustered sensor network (SN) with $24$ sensor nodes and evaporating ethanol molecules as the passive TX. In SNCLA, a Gaussian plume model is employed to derive the location estimator. The parameters such as transmitted mass, wind velocity, detection time, and actual concentration are calculated or estimated from the measured signals via the SN to be employed as the input for the location estimator. The numerical results show that the performance of SNCLA is better for stronger winds in the medium. Our findings show that evaporated molecules do not propagate homogeneously through the SN due to the presence of the wind. In addition, our statistical analysis based on the measured experimental data shows that the sensed signals by the SN have a log-normal distribution, while the additive noise follows a Student's t-distribution in contrast to the Gaussian assumption in the literature.
    摘要 在实际分子通信场景中,如监测未知源释放的空气污染物,估计分子传输器(TX)的位置是非常重要。这篇论文提出了一种新的感知网络地图算法(SNCLA),使用了一个主要由24个感知节点组成的分布式感知网络(SN)和蒸发乙醇分子作为被动TX。在SNCLA中,我们采用了 Gaussian 气泡模型来 derivate 位置估计器。通过测量信号的SN来计算或估算传输 Parameters 如传输质量、风速、检测时间和实际浓度,并将其作为输入参数提供给位置估计器。数值结果表明,SNCLA在强风的情况下表现更好。我们的实验数据分析表明,在SN中传输分子不具备均匀分布,因为存在风速的影响。此外,我们的统计分析表明,通过感知网络探测到的信号呈 log-normal 分布,而附加的噪声遵循 Student's t-distribution 而非文献中的 Gaussian 假设。

A Short Overview of 6G V2X Communication Standards

  • paper_url: http://arxiv.org/abs/2311.16810
  • repo_url: None
  • paper_authors: Donglin Wang, Yann Nana Nganso, Hans D. Schotten
    for:* 6G V2X communication system is designed to support the needs of linked autonomous cars and improve user experiences, air quality, road safety, and transportation settings.methods:* The paper compares the applications of various communication technologies, including Wi-Fi, LTE, 5G, and 6G, and focuses on new technologies for 6G V2X, such as brain-vehicle interface, blockchain-based V2X, and Machine Learning (ML).results:* The paper discusses the security challenges of 6G V2X and addresses the strengths, open challenges, development, and improving areas of further study in this field.Here is the same information in Simplified Chinese text:for:* 6G V2X通信系统是为了支持未来链接自驾车的需求,提高用户体验、空气质量、路 Safety 和交通设定。methods:* 本文比较不同通信技术的应用,包括 Wi-Fi、LTE、5G 和 6G,并专注在6G V2X新技术上,如脑车接口、区块基础 V2X 和机器学习(ML)。results:* 本文讨论6G V2X安全挑战,并评估这个领域的优点、开放挑战、发展和进一步研究领域。
    Abstract We are on the verge of a new age of linked autonomous cars with unheard-of user experiences, dramatically improved air quality and road safety, extremely varied transportation settings, and a plethora of cutting-edge apps. A substantially improved Vehicle-to-Everything (V2X) communication network that can simultaneously support massive hyper-fast, ultra-reliable, and low-latency information exchange is necessary to achieve this ambitious goal. These needs of the upcoming V2X are expected to be satisfied by the Sixth Generation (6G) communication system. In this article, we start by introducing the history of V2X communications by giving details on the current, developing, and future developments. We compare the applications of communication technologies such as Wi-Fi, LTE, 5G, and 6G. we focus on the new technologies for 6G V2X which are brain-vehicle interface, blocked-based V2X, and Machine Learning (ML). To achieve this, we provide a summary of the most recent ML developments in 6G vehicle networks. we discuss the security challenges of 6G V2X. We address the strengths, open challenges, development, and improving areas of further study in this field.
    摘要 我们正在reshold of a new era of interconnected autonomous cars with unprecedented user experiences, significantly improved air quality and road safety, and a wide range of cutting-edge apps. To achieve this ambitious goal, we need a substantially improved Vehicle-to-Everything (V2X) communication network that can simultaneously support massive hyper-fast, ultra-reliable, and low-latency information exchange. These needs are expected to be met by the Sixth Generation (6G) communication system.In this article, we will start by introducing the history of V2X communications, including the current, developing, and future developments. We will compare the applications of communication technologies such as Wi-Fi, LTE, 5G, and 6G. We will focus on the new technologies for 6G V2X, including brain-vehicle interface, blockchain-based V2X, and Machine Learning (ML). To achieve this, we will provide a summary of the most recent ML developments in 6G vehicle networks.We will also discuss the security challenges of 6G V2X and address the strengths, open challenges, development, and improving areas of further study in this field.Here is the translation in Simplified Chinese:我们正在reshold of a new era of interconnected autonomous cars with unprecedented user experiences, significantly improved air quality and road safety, and a wide range of cutting-edge apps. To achieve this ambitious goal, we need a substantially improved Vehicle-to-Everything (V2X) communication network that can simultaneously support massive hyper-fast, ultra-reliable, and low-latency information exchange. These needs are expected to be met by the Sixth Generation (6G) communication system.在这篇文章中,我们将开始介绍V2X通信的历史,包括当前、发展和未来的发展。我们将比较不同的通信技术,如Wi-Fi、LTE、5G和6G的应用。我们将专注于6G V2X新技术,包括脑 Vehicle 接口、区块链基础V2X和机器学习(ML)。为了实现这一目标,我们将提供6G车辆网络最新的ML发展情况。我们还将讨论6G V2X安全挑战,并评估这一领域的优点、开放挑战、发展和进一步研究领域。

A Novel 3D Non-stationary Localization-assisted ISAC Channel Model

  • paper_url: http://arxiv.org/abs/2311.16798
  • repo_url: None
  • paper_authors: Runruo Yang, Yang Wu, Jie Huang, Cheng-Xiang Wang
  • for: 本研究旨在提出一种基于三维非站立射线扩散(3D Non-Stationary Localization-Assisted ISAC)的 sixth generation(6G)无线通信系统应用场景。
  • methods: 本文提出了一种基于粒子滤波器的三维非站立射线扩散(3D Non-Stationary Localization-Assisted ISAC)通道模型,通过利用反射测量技术,掌握了通道中第一次反射 scatterer 和最后一次反射 scatterer 的位置。
  • results: 对比RT结果,本文的通道模型的 simulated 结果与实际场景中的通道性能具有良好的一致性,证明了模型的正确性。通过利用扩散参数,本文的通道模型能更好地映射实际环境。
    Abstract Integrated sensing and communication (ISAC) has attracted wide attention as an emerging application scenario for the sixth generation (6G) wireless communication system. In this paper, a novel three-dimensional (3D) non-stationary localization-assisted ISAC geometry-based stochastic model (GBSM) is proposed. The locations of the first-bounce scatterer and last-bounce scatterer in the communication channel can be estimated by the particle filter with the assistance of backscattering sensing. The important channel statistical properties of the proposed channel model are simulated and compared with the ray tracing (RT) results, including the delay spread, azimuth angle of departure/arrival (AAoD/AAoA) spread, and elevation angle of departure/arrival (EAoD/EAoA) spread. The simulation results of the proposed channel model show a good agreement with the RT results, which proves the correctness of the proposed channel model. Utilizing the localization parameters of scatterers, the proposed ISAC channel model can better map the real environment.
    摘要 sixth generation(6G)无线通信系统中的集成感知通信(ISAC)应用场景已经吸引了广泛的关注。本文提出了一种新的三维非站ARY(3D)非站ARY抽象模型(GBSM)。通过粒子滤波器的帮助,该模型可以估算第一次反射物和最后一次反射物在通信频道中的位置。该模型的通道统计性质,包括延迟跨度、发射角度/接收角度(AAoD/AAoA)的扩散、和发射高度/接收高度(EAoD/EAoA)的扩散,通过与射线追踪(RT)结果进行比较,并与RT结果达到了良好的一致性,这证明了该模型的正确性。通过抽象粒子的位置,该模型可以更好地映射实际环境。

A General 3D Non-Stationary 5G Wireless Channel Model

  • paper_url: http://arxiv.org/abs/2311.16783
  • repo_url: None
  • paper_authors: Shangbin Wu, Cheng-Xiang Wang, el-Hadi M. Aggoune, Mohammed M. Alwakeel, Xiao-Hu You
  • for: 这个论文旨在提出一个统一的几何基础模型(GBSM),用于描述 fifth generation(5G)无线通信系统中的小规模折叠抑降 channel 特性。
  • methods: 该模型基于 WINNER II 和 Saleh-Valenzuela(SV)通道模型,考虑了数组-时间集群演化,并可以根据模型参数适当调整,以获得不同场景的简化通道模型。
  • results: 论文通过对提案的通道模型的统计性质进行研究,以证明其能够准确捕捉不同场景的通道特性,并与某些通道测量数据 exhibit excellent fitting。
    Abstract A novel unified framework of geometry-based stochastic models (GBSMs) for the fifth generation (5G) wireless communication systems is proposed in this paper. The proposed general 5G channel model aims at capturing small-scale fading channel characteristics of key 5G communication scenarios, such as massive multiple-input multiple-output (MIMO), high-speed train (HST), vehicle-to-vehicle (V2V), and millimeter wave (mmWave) communication scenarios. It is a three-dimensional (3D) non-stationary channel model based on the WINNER II and Saleh-Valenzuela (SV) channel models considering array-time cluster evolution. Moreover, it can easily be reduced to various simplified channel models by properly adjusting model parameters. Statistical properties of the proposed general 5G small-scale fading channel model are investigated to demonstrate its capability of capturing channel characteristics of various scenarios, with excellent fitting to some corresponding channel measurements.
    摘要 “本文提出了一个统一的几何基础的数学模型(GBSM)框架,用于第五代(5G)无线通信系统。提议的通用5G通道模型针对5G通信场景中的小规模折射通道特性,如大规模多input多output(MIMO)、高速列车(HST)、车辆间通信(V2V)和毫米波(mmWave)通信场景进行捕捉。这是一个三维非站ARY(3D)不稳定通道模型,基于WINNER II和Saleh-Valenzuela(SV)通道模型,考虑阵列时间对应演化。此外,它可以轻松地将通道模型简化为不同的实际化通道模型,通过适当地调整模型参数。本文 investigate了提议的通用5G小规模折射通道模型的Statistical Property,以示其能够对各种场景的通道特性进行捕捉,具有优秀的适合性。”Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

Active RIS Enhanced Spectrum Sensing for Cognitive Radio Networks

  • paper_url: http://arxiv.org/abs/2311.16568
  • repo_url: None
  • paper_authors: Jungang Ge, Ying-Chang Liang, Sumei Sun, Yonghong Zeng
  • for: 提高二级用户 spectrum sensing 性能在弱主信号下
  • methods: 使用活动再配置智能表面帮助进行信号检测,并优化反射率矩阵来提高检测概率
  • results: 对比传统的被动RIS,活动RIS在较弱干扰下可以提高检测精度,但在强干扰下被动RIS可以更好地减少干扰In English:
  • for: Improving secondary user spectrum sensing performance in weak primary signal environments
  • methods: Using an active reconfigurable intelligent surface to assist with signal detection, and optimizing the reflecting coefficient matrix to improve detection probability
  • results: Compared to traditional passive RIS, the active RIS can improve detection accuracy in weak interference environments, but the passive RIS can better mitigate interference in strong interference scenarios
    Abstract In opportunistic cognitive radio networks, when the primary signal is very weak compared to the background noise, the secondary user requires long sensing time to achieve a reliable spectrum sensing performance, leading to little remaining time for the secondary transmission. To tackle this issue, we propose an active reconfigurable intelligent surface (RIS) assisted spectrum sensing system, where the received signal strength from the interested primary user can be enhanced and underlying interference within the background noise can be mitigated as well. In comparison with the passive RIS, the active RIS can not only adapt the phase shift of each reflecting element but also amplify the incident signals. Notably, we study the reflecting coefficient matrix (RCM) optimization problem to improve the detection probability given a maximum tolerable false alarm probability and limited sensing time. Then, we show that the formulated problem can be equivalently transformed to a weighted mean square error minimization problem using the principle of the well-known weighted minimum mean square error (WMMSE) algorithm, and an iterative optimization approach is proposed to obtain the optimal RCM. In addition, to fairly compare passive RIS and active RIS, we study the required power budget of the RIS to achieve a target detection probability under a special case where the direct links are neglected and the RIS-related channels are line-of-sight. Via extensive simulations, the effectiveness of the WMMSE-based RCM optimization approach is demonstrated. Furthermore, the results reveal that the active RIS can outperform the passive RIS when the underlying interference within the background noise is relatively weak, whereas the passive RIS performs better in strong interference scenarios because the same power budget can support a vast number of passive reflecting elements for interference mitigation.
    摘要 在机会性广播网络中,当主要信号强度非常低时,次要用户需要长时间进行可靠的spectrum sensing,这会导致剩下的时间非常少 для次要传输。为解决这个问题,我们提议使用活动可 configurable智能表面(RIS)帮助spectrum sensing系统,可以增强接收到 interessprimary user的信号强度,同时抑制后台干扰。相比 passive RIS,活动 RIS可以不仅调整反射元素的相位,还可以增强 incident signals。特别是,我们研究了反射系数矩阵(RCM)优化问题,以提高检测概率,并限制检测时间。我们发现,这个问题可以转化为一个加重平均方差Error minimization problem,并使用已知的Weighted Minimum Mean Square Error(WMMSE)算法的原理进行解决。我们还提出了一种Iterative optimization approach来获得最优的RCM。此外,为公正地比较passive RIS和活动 RIS,我们研究了通过RIS来 achieve a target detection probability under a special case where the direct links are neglected and the RIS-related channels are line-of-sight。通过广泛的 simulations,我们证明了WMMSE-based RCM optimization approach的效果。此外,结果还显示了活动 RIS在弱型干扰下可以超越 passive RIS,而passive RIS在强型干扰下表现更好,因为同样的功率预算可以支持一大量的Passive reflecting elements for interference mitigation。

Energy Efficiency Optimization in Active Reconfigurable Intelligent Surface-Aided Integrated Sensing and Communication Systems

  • paper_url: http://arxiv.org/abs/2311.16433
  • repo_url: None
  • paper_authors: Junjie Ye, Mohamed Rihan, Peichang Zhang, Lei Huang, Stefano Buzzi, Zhen Chen
  • for: 提高 интеграцион感知通信系统的能效性 (EE)
  • methods: 使用活动智能表面 (RIS) 帮助提高 ISAC 系统的 EE,并且利用增强功率的扩展来提高系统的 spectral efficiency
  • results: 比对于普通 RIS 和频率效率优化情况, 提出的方法可以获得显著的 EE 提升
    Abstract Energy efficiency (EE) is a challenging task in integrated sensing and communication (ISAC) systems, where high spectral efficiency and low energy consumption appear as conflicting requirements. Although passive reconfigurable intelligent surface (RIS) has emerged as a promising technology for enhancing the EE of the ISAC system, the multiplicative fading feature hinders its effectiveness. This paper proposes the use of active RIS with its amplification gains to assist the ISAC system for EE improvement. Specifically, we formulate an EE optimization problem in an active RIS-aided ISAC system under system power budgets, considering constraints on user communication quality of service and sensing signal-to-noise ratio (SNR). A novel alternating optimization algorithm is developed to address the highly non-convex problem by leveraging a combination of the generalized Rayleigh quotient optimization approach, semidefinite relaxation (SDR), and the majorization-minimization (MM) framework. Furthermore, to accelerate the algorithm and reduce computational complexity, we derive a semi-closed form for eigenvalue determination. Numerical results demonstrate the effectiveness of the proposed approach, showcasing significant improvements in EE compared to both passive RIS and spectrum efficiency optimization cases.
    摘要 “能效率(EE)在 интеграрованных感知通信(ISAC)系统中是一个挑战,因为高 spectral efficiency 和低能耗之间存在矛盾。虽然被动式重配置智能表面(RIS)已经出现为ISAC系统中提高EE的技术,但是multiplicative fading特征限制了其效iveness。本文提出使用活动RIS,其增益可以帮助ISAC系统提高EE。Specifically,我们形ulated了一个EE优化问题在一个活动RIS-assisted ISAC系统中,considering系统功率预算、用户通信质量服务要求和探测信号噪声比(SNR)的限制。我们开发了一种新的交互式优化算法,利用一种combined generalized Rayleigh quotient optimization approach、semidefinite relaxation(SDR)和主要化-最小化(MM)框架。此外,为了加速算法和减少计算复杂度,我们 derive了一种半闭形式的eigenvalue determination。数值结果表明提出的方法的有效性,显示与pasive RIS和spectrum efficiency optimization案例相比,具有显著的EE提高。”Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

A Deep Q-Learning based, Base-Station Connectivity-Aware, Decentralized Pheromone Mobility Model for Autonomous UAV Networks

  • paper_url: http://arxiv.org/abs/2311.16409
  • repo_url: None
  • paper_authors: Shreyas Devaraju, Alexander Ihler, Sunil Kumar
  • for: 这 paper 的目的是提出一种自适应协调方法来实现低 SWaP Fixed-wing UAV 网络中的高网络连接性和快速覆盖区域。
  • methods: 该 paper 使用了一种分布式蟒蠕(BS-CAP)移动模型,该模型可以在不知道网络的完整拓扑情况下,在分布式方式实现自适应协调。此外,paper 还提出了基于深度Q学习策略的BS-CAP模型(BSCAP-DQN)来进一步调整覆盖率和连接率的交互关系。
  • results: simulations 表明,提出的两种方法都可以有效地实现高网络连接性和快速覆盖区域,与现有方法相比有显著提高。
    Abstract UAV networks consisting of low SWaP (size, weight, and power), fixed-wing UAVs are used in many applications, including area monitoring, search and rescue, surveillance, and tracking. Performing these operations efficiently requires a scalable, decentralized, autonomous UAV network architecture with high network connectivity. Whereas fast area coverage is needed for quickly sensing the area, strong node degree and base station (BS) connectivity are needed for UAV control and coordination and for transmitting sensed information to the BS in real time. However, the area coverage and connectivity exhibit a fundamental trade-off: maintaining connectivity restricts the UAVs' ability to explore. In this paper, we first present a node degree and BS connectivity-aware distributed pheromone (BS-CAP) mobility model to autonomously coordinate the UAV movements in a decentralized UAV network. This model maintains a desired connectivity among 1-hop neighbors and to the BS while achieving fast area coverage. Next, we propose a deep Q-learning policy based BS-CAP model (BSCAP-DQN) to further tune and improve the coverage and connectivity trade-off. Since it is not practical to know the complete topology of such a network in real time, the proposed mobility models work online, are fully distributed, and rely on neighborhood information. Our simulations demonstrate that both proposed models achieve efficient area coverage and desired node degree and BS connectivity, improving significantly over existing schemes.
    摘要 无人机网络,由低SWaP(体积、重量和功率)、 fixes-wing无人机组成,在许多应用中使用,包括区域监测、搜寻救援、surveillance和跟踪。为了有效地完成这些任务,需要一个可拡展、分布式、自主无人机网络架构,高度连接。而为了快速探索区域,需要快速的区域覆盖,同时保持无人机之间的连接。然而,区域覆盖和连接却存在fundamental的负担假设:维护连接限制了无人机的探索能力。在这篇论文中,我们首先提出一种节点度和基站(BS)连接意识的分布式pheromone(BS-CAP) mobilility模型,以自主协调无人机运动在分布式无人机网络中。这个模型保持了1个邻居和BS之间的连接,同时实现快速的区域覆盖。接着,我们提议一种基于深度Q学习策略的BS-CAP模型(BSCAP-DQN),以进一步调整和改进覆盖和连接的负担假设。由于实时不可能知道这种网络的完整topology,我们的移动模型在线、分布式、基于邻居信息工作。我们的仿真结果表明,我们的两种提议模型可以有效地实现区域覆盖和desired节点度和BS连接,与现有方案相比有显著提高。

cs.SD - 2023-11-27

Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency

  • paper_url: http://arxiv.org/abs/2311.15683
  • repo_url: None
  • paper_authors: Chenyu Tang, Muzi Xu, Wentian Yi, Zibo Zhang, Edoardo Occhipinti, Chaoqun Dong, Dafydd Ravenscroft, Sung-Min Jung, Sanghyo Lee, Shuo Gao, Jong Min Kim, Luigi G. Occhipinti
  • for: 这个研究旨在开发一种具有实用、敏感、精准的可穿戴式无声朗读 интерфей斯(SSI)技术,用于日常通信应用。
  • methods: 研究人员开发了一种可biocompatible、可持续使用的纺织品腕带, embedding graphene基于的压缩传感器,可以准确地探测轻微的喉部运动。这种传感器比传统的声音认识方法高度敏感,可以大幅提高时间能效率。
  • results: 研究人员使用了一种计算效率高、能效率低的神经网络模型,特别是一维卷积神经网络模型,来解码语音信号。这种模型可以快速适应新用户和新词汇,只需少量样本即可达95.25%的准确率。这种创新表明了一种实用、敏感、精准的可穿戴式SSI技术,适用于日常通信应用。
    Abstract Our research presents a wearable Silent Speech Interface (SSI) technology that excels in device comfort, time-energy efficiency, and speech decoding accuracy for real-world use. We developed a biocompatible, durable textile choker with an embedded graphene-based strain sensor, capable of accurately detecting subtle throat movements. This sensor, surpassing other strain sensors in sensitivity by 420%, simplifies signal processing compared to traditional voice recognition methods. Our system uses a computationally efficient neural network, specifically a one-dimensional convolutional neural network with residual structures, to decode speech signals. This network is energy and time-efficient, reducing computational load by 90% while achieving 95.25% accuracy for a 20-word lexicon and swiftly adapting to new users and words with minimal samples. This innovation demonstrates a practical, sensitive, and precise wearable SSI suitable for daily communication applications.
    摘要 我们的研究推出了一种可穿戴的无声沟通界面(SSI)技术,在设备舒适、时间能效和语音解码精度方面具有优势。我们开发了一种可生物兼容、耐用的织物颈缚,嵌入了基于 grafene 的压缩传感器,可以准确探测轻微的喉部运动。这种传感器,比其他压缩传感器提高了420%的敏感度,使得信号处理更加简单。我们的系统使用了一种计算效率高的神经网络,具体来说是一个一维 convolutional neural network with residual structures,来解码语音信号。这种神经网络可以大幅减少计算负担,同时保持95.25%的准确率(对20个词汇库),并快速适应新用户和新词汇的MINIMAL samples。这种创新表明了一种实用、敏感和精准的可穿戴 SSI,适用于日常通信应用。

Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks

  • paper_url: http://arxiv.org/abs/2311.15597
  • repo_url: None
  • paper_authors: Tobias Gburrek, Joerg Schmalenstroeer, Reinhold Haeb-Umbach
    for: 这个论文是用于开发一个会议笔记系统的前端,使用声学传感器网络(ASN)记录的信号。methods: 该系统使用了盲目同步signal和计算时差到达(TDOA)信息,然后使用这些信息来估计发言者的活动时间,并使用这些活动时间信息作为基础进行三维混合模型的初始化。results: 实验结果表明,使用TDOA估计和三维混合模型可以提高会议笔记系统的准确率,比不使用外部 диари化信息的系统更有利。
    Abstract We propose a diarization system, that estimates "who spoke when" based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the microphones' positions are typically unknown, and the recorded signals are initially unsynchronized in general. Here, we approach these issues by first blindly synchronizing the signals and then estimating time differences of arrival (TDOAs). The TDOA information is exploited to estimate the speakers' activity, even in the presence of multiple speakers being simultaneously active. This speaker activity information serves as a guide for a spatial mixture model, on which basis the individual speaker's signals are extracted via beamforming. Finally, the extracted signals are forwarded to a speech recognizer. Additionally, a novel initialization scheme for spatial mixture models based on the TDOA estimates is proposed. Experiments conducted on real recordings from the LibriWASN data set have shown that our proposed system is advantageous compared to a system using a spatial mixture model, which does not make use of external diarization information.
    摘要 我们提议一个会议记录系统,可以根据空间信息来确定“谁在什么时候说话”。这个系统将作为Front-end,并在听取器网络(ASN)上运行。虽然空间分布的麦克风有利,但是利用空间多样性进行分类和信号增强是挑战,因为麦克风的位置通常未知,并且录制的信号通常是不同步的。我们解决这些问题的方法是:首先盲目同步信号,然后利用时差抵达(TDOA)信息来估算发言者的活动时间。这个发言者活动信息可以作为基础,使用扩散模型来提取个人发言者的信号。最后,提取的信号将被传递给语音识别器。此外,我们还提出了一种基于TDOA估算的Initialization scheme for spatial mixture models。实验结果表明,我们的提议系统在使用LibriWASN数据集的实际录制中表现优于不使用外部分类信息的系统。

eess.AS - 2023-11-27

Voice Anonymization for All – Bias Evaluation of the Voice Privacy Challenge Baseline System

  • paper_url: http://arxiv.org/abs/2311.15804
  • repo_url: None
  • paper_authors: Anna Leschanowsky, Ünal Ege Gaznepoglu, Nils Peters
  • for: 保护人们的隐私,声音匿名技术可以提供解决方案,但这些系统需要在不同 subgroup 上具有相同的性能。
  • methods: 本研究使用了 Voice Privacy Challenge 的评测数据集,分析了三种声音匿名系统和三种攻击模型对不同 subgroup 的性能的影响。
  • results: 研究发现,随着攻击者技能的提高,声音匿名系统中的 subgroup 偏见变得更加严重,高亮了需要为各 subgroup 创建包容性的评测数据集和全面评估策略。
    Abstract In an age of voice-enabled technology, voice anonymization offers a solution to protect people's privacy, provided these systems work equally well across subgroups. This study investigates bias in voice anonymization systems within the context of the Voice Privacy Challenge. We curate a novel benchmark dataset to assess performance disparities among speaker subgroups based on sex and dialect. We analyze the impact of three anonymization systems and attack models on speaker subgroup bias and reveal significant performance variations. Notably, subgroup bias intensifies with advanced attacker capabilities, emphasizing the challenge of achieving equal performance across all subgroups. Our study highlights the need for inclusive benchmark datasets and comprehensive evaluation strategies that address subgroup bias in voice anonymization.
    摘要 在声音启用技术时代,声音匿名化提供了一种保护人们隐私的解决方案,只要这些系统在不同 subgroup 中工作 equally well 然后。这个研究在 Voice Privacy Challenge 的context中调查声音匿名化系统中 subgroup 偏见的问题。我们创建了一个新的 benchmark dataset 来评估 speaker subgroup 之间的性能差异。我们分析了三种匿名系统和三种攻击模型对 speaker subgroup 偏见的影响,发现了显著的性能差异。特别是在高级攻击者能力下, subgroup 偏见变得更加严重,这 подчерки着在所有 subgroup 中都实现等效性的挑战。我们的研究强调了 inclusive 的 benchmark dataset 和全面的评估策略,以解决声音匿名化中 subgroup 偏见的问题。

cs.CV - 2023-11-27

Small and Dim Target Detection in IR Imagery: A Review

  • paper_url: http://arxiv.org/abs/2311.16346
  • repo_url: None
  • paper_authors: Nikhil Kumar, Pravendra Singh
  • for: 本研究的主要目标是探讨小和暗目标在红外图像中的探测,尤其是在背景充满杂乱细节和红外特征随着热动力学变化的情况下。
  • methods: 本文涵盖了多种方法,从传统的图像处理技术到最新的深度学习方法,包括多帧方法和单帧方法。单帧方法包括传统的图像处理技术以及更高级的深度学习方法。我们发现深度学习方法的性能比传统的图像处理技术更好。
  • results: 我们的研究发现,深度学习方法在小和暗目标探测中表现更好,而传统的图像处理技术则表现较差。此外,我们还提供了各种可用的数据集的全面compile,并对现有技术的缺陷和局限性进行了评估,为未来的研究和开发提供了irection。
    Abstract While there has been significant progress in object detection using conventional image processing and machine learning algorithms, exploring small and dim target detection in the IR domain is a relatively new area of study. The majority of small and dim target detection methods are derived from conventional object detection algorithms, albeit with some alterations. The task of detecting small and dim targets in IR imagery is complex. This is because these targets often need distinct features, the background is cluttered with unclear details, and the IR signatures of the scene can change over time due to fluctuations in thermodynamics. The primary objective of this review is to highlight the progress made in this field. This is the first review in the field of small and dim target detection in infrared imagery, encompassing various methodologies ranging from conventional image processing to cutting-edge deep learning-based approaches. The authors have also introduced a taxonomy of such approaches. There are two main types of approaches: methodologies using several frames for detection, and single-frame-based detection techniques. Single frame-based detection techniques encompass a diverse range of methods, spanning from traditional image processing-based approaches to more advanced deep learning methodologies. Our findings indicate that deep learning approaches perform better than traditional image processing-based approaches. In addition, a comprehensive compilation of various available datasets has also been provided. Furthermore, this review identifies the gaps and limitations in existing techniques, paving the way for future research and development in this area.
    摘要 traditional image processing and machine learning algorithms have made significant progress in object detection, but exploring small and dim target detection in the IR domain is a relatively new area of study. most small and dim target detection methods are based on conventional object detection algorithms, but with some modifications. detecting small and dim targets in IR imagery is challenging because these targets often lack distinct features, the background is cluttered with unclear details, and the IR signatures of the scene can change over time due to thermodynamic fluctuations. the main objective of this review is to highlight the progress made in this field. this is the first review of small and dim target detection in infrared imagery, covering various methodologies from conventional image processing to cutting-edge deep learning-based approaches. the authors have also developed a taxonomy of such approaches. there are two main types of approaches: methodologies using multiple frames for detection and single-frame-based detection techniques. single-frame-based detection techniques include a wide range of methods, from traditional image processing-based approaches to more advanced deep learning methodologies. our findings show that deep learning approaches perform better than traditional image processing-based approaches. in addition, a comprehensive list of available datasets has been provided. this review identifies the gaps and limitations in existing techniques, paving the way for future research and development in this area.

Spatially Adaptive Cloth Regression with Implicit Neural Representations

  • paper_url: http://arxiv.org/abs/2311.16344
  • repo_url: None
  • paper_authors: Lei Shu, Vinicius Azevedo, Barbara Solenthaler, Markus Gross
  • for: 这篇论文旨在解决计算机图形中细节的布料折叠问题,具有非常高的计算复杂性和复杂的方法。
  • methods: 这篇论文提出了一种新的多向折叠技术,使用隐式神经网络表示 superficies,并采用了一种新的隐藏样本方法和一种新的对抗训练方法,以提高计算效率和精度。
  • results: 该论文通过多种布料对象互动场景的实验表明,与传统粗粒度表示相比,该方法在模拟高级别的本地折叠时表现更加稳定和准确。
    Abstract The accurate representation of fine-detailed cloth wrinkles poses significant challenges in computer graphics. The inherently non-uniform structure of cloth wrinkles mandates the employment of intricate discretization strategies, which are frequently characterized by high computational demands and complex methodologies. Addressing this, the research introduced in this paper elucidates a novel anisotropic cloth regression technique that capitalizes on the potential of implicit neural representations of surfaces. Our first core contribution is an innovative mesh-free sampling approach, crafted to reduce the reliance on traditional mesh structures, thereby offering greater flexibility and accuracy in capturing fine cloth details. Our second contribution is a novel adversarial training scheme, which is designed meticulously to strike a harmonious balance between the sampling and simulation objectives. The adversarial approach ensures that the wrinkles are represented with high fidelity, while also maintaining computational efficiency. Our results showcase through various cloth-object interaction scenarios that our method, given the same memory constraints, consistently surpasses traditional discrete representations, particularly when modelling highly-detailed localized wrinkles.
    摘要 Computer graphics 中的细腐皮表现具有 significiant challenges. 细腐皮的非均匀结构使得使用复杂的离散策略成为必要,这些策略通常具有高计算成本和复杂的方法。为了解决这一问题,这篇研究报告提出了一种新的折衣表示技术,利用 implicit neural surface 的潜力。我们的首要贡献是一种创新的无网格采样方法,可以减少传统网格结构的依赖,从而更好地捕捉细腐皮的细节。我们的第二次贡献是一种 novel adversarial 训练方案,它是为了寻找一个平衡点,使得细腐皮的表示和计算效率之间寻找一个平衡点。我们的结果表明,在不同的细腐皮对象互动场景中,我们的方法,具有同样的内存限制下,可以持续性超越传统离散表示方法,特别是当模拟高度地方化的细腐皮时。

Multi-3D-Models Registration-Based Augmented Reality (AR) Instructions for Assembly

  • paper_url: http://arxiv.org/abs/2311.16337
  • repo_url: None
  • paper_authors: Seda Tuzun Canadinc, Wei Yan
  • for: 提供了一种新的、无标记、步骤、场景内3D增强现实(AR)指令方法,用于小部件组装。
  • methods: 使用深度学习训练的3D模型基于协调,实现了3D组装部件的真实可视化和用户控制。
  • results: 测试和人类可见性评估表明,BRICKxAR(M3D)可以提供可靠的3D AR指令,并允许用户 manipulate assembly model。
    Abstract This paper introduces a novel, markerless, step-by-step, in-situ 3D Augmented Reality (AR) instruction method and its application - BRICKxAR (Multi 3D Models/M3D) - for small parts assembly. BRICKxAR (M3D) realistically visualizes rendered 3D assembly parts at the assembly location of the physical assembly model (Figure 1). The user controls the assembly process through a user interface. BRICKxAR (M3D) utilizes deep learning-trained 3D model-based registration. Object recognition and tracking become challenging as the assembly model updates at each step. Additionally, not every part in a 3D assembly may be visible to the camera during the assembly. BRICKxAR (M3D) combines multiple assembly phases with a step count to address these challenges. Thus, using fewer phases simplifies the complex assembly process while step count facilitates accurate object recognition and precise visualization of each step. A testing and heuristic evaluation of the BRICKxAR (M3D) prototype and qualitative analysis were conducted with users and experts in visualization and human-computer interaction. Providing robust 3D AR instructions and allowing the handling of the assembly model, BRICKxAR (M3D) has the potential to be used at different scales ranging from manufacturing assembly to construction.
    摘要 BRICKxAR (M3D) uses deep learning-trained 3D model-based registration to address the challenges of object recognition and tracking as the assembly model updates at each step. Additionally, the method combines multiple assembly phases with a step count to simplify the complex assembly process and facilitate accurate object recognition and precise visualization of each step.A testing and heuristic evaluation of the BRICKxAR (M3D) prototype and qualitative analysis were conducted with users and experts in visualization and human-computer interaction. The results show that BRICKxAR (M3D) has the potential to be used at different scales, ranging from manufacturing assembly to construction, and provides robust 3D AR instructions for handling the assembly model.Here's the text in Simplified Chinese:这篇论文介绍了一种新的、无标记、步骤式、场景内3D增强现实(AR)指南方法,称为BRICKxAR(多3D模型/M3D),用于小件组装。BRICKxAR(M3D)可以实时视觉化assembly模型中的3D组装部件(图1)。用户通过用户界面控制assembly过程。BRICKxAR(M3D)使用深度学习训练的3D模型基于 регистрацию来解决对象识别和跟踪的挑战,因为assembly模型在每个步骤更新时会更改。此外,不是每个3D组装部件都可以在摄像头中可见。BRICKxAR(M3D)将多个assembly阶段与步数相结合,以简化复杂的assembly过程,并且帮助精准地识别和可见化每个步骤。对BRICKxAR(M3D)原型的测试和人机交互分析,以及讨论和评估方法,与用户和人机交互视觉专家进行了合作。结果表明,BRICKxAR(M3D)在不同的尺度,从制造Assembly到建筑,都有可能被应用,并提供了强大的3D AR指南,用于处理assembly模型。

Characterizing Video Question Answering with Sparsified Inputs

  • paper_url: http://arxiv.org/abs/2311.16311
  • repo_url: None
  • paper_authors: Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar A. Sigurdsson
  • for: 本研究旨在提高视频问答任务的数据效率,通过选择最佳的视频帧来提高任务性能。
  • methods: 本研究使用Gumbel-based学习选择模块来适应ively选择最佳的视频帧,以提高任务性能。
  • results: 从实验结果来看,只使用2-4帧的视频可以保持94.8%-95.2%的任务性能,这表明可以减少视频长度来提高任务效率。同时,研究还发现了视频和文本输入之间的补做行为,这 suggets the potential of improving data efficiency for video-and-language tasks。
    Abstract In Video Question Answering, videos are often processed as a full-length sequence of frames to ensure minimal loss of information. Recent works have demonstrated evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss the case of single frame selection. In our work, we extend the setting to multiple number of inputs and other modalities. We characterize the task with different input sparsity and provide a tool for doing that. Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task. In this way, we experiment over public VideoQA benchmarks and provide analysis on how sparsified inputs affect the performance. From our experiments, we have observed only 5.2%-5.8% loss of performance with only 10% of video lengths, which corresponds to 2-4 frames selected from each video. Meanwhile, we also observed the complimentary behaviour between visual and textual inputs, even under highly sparsified settings, suggesting the potential of improving data efficiency for video-and-language tasks.
    摘要 在视频问答中,视频通常被处理为全长序列帧,以确保最小化信息损失。然而, latest works have shown that sparse video inputs are sufficient to maintain high performance. However, these works usually only discuss the case of single frame selection. In our work, we extend the setting to multiple numbers of inputs and other modalities. We characterize the task with different input sparsity and provide a tool for doing so. Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task. Through experiments on public VideoQA benchmarks, we provide analysis on how sparsified inputs affect performance. Our results show that only 5.2%-5.8% of performance is lost with only 10% of video lengths, corresponding to 2-4 frames selected from each video. Furthermore, we observe complementary behavior between visual and textual inputs, even under highly sparsified settings, suggesting the potential of improving data efficiency for video-and-language tasks.

Robust Self-calibration of Focal Lengths from the Fundamental Matrix

  • paper_url: http://arxiv.org/abs/2311.16304
  • repo_url: https://github.com/kocurvik/robust_self_calibration
  • paper_authors: Viktor Kocur, Daniel Kyselica, Zuzana Kúkelová
  • for: 根据给定的基本矩阵,计算两个相机的自适应 фокаль点和主点的问题是计算机视学的基本问题之一。
  • methods: 我们提出了一种高效和可靠的迭代方法,使用基本矩阵和先验知识来估算相机参数。
  • results: 我们的迭代方法可以减少基本矩阵计算的计算时间,并在实际数据上达到更高的估算精度,即使在使用不准确的先验知识时。
    Abstract The problem of self-calibration of two cameras from a given fundamental matrix is one of the basic problems in geometric computer vision. Under the assumption of known principal points and square pixels, the well-known Bougnoux formula offers a means to compute the two unknown focal lengths. However, in many practical situations, the formula yields inaccurate results due to commonly occurring singularities. Moreover, the estimates are sensitive to noise in the computed fundamental matrix and to the assumed positions of the principal points. In this paper, we therefore propose an efficient and robust iterative method to estimate the focal lengths along with the principal points of the cameras given a fundamental matrix and priors for the estimated camera parameters. In addition, we study a computationally efficient check of models generated within RANSAC that improves the accuracy of the estimated models while reducing the total computational time. Extensive experiments on real and synthetic data show that our iterative method brings significant improvements in terms of the accuracy of the estimated focal lengths over the Bougnoux formula and other state-of-the-art methods, even when relying on inaccurate priors.
    摘要 “自测Camera对基本矩阵的自测问题是computer vision领域的基本问题之一。假设知道主要点和方差矩阵,则著名的Bougnoux公式可以计算两个不知道的镜头焦距。然而,在实际应用中,这个公式往往会产生不准确的结果,因为 comunmente occursing singularities。此外,估计适用到computed基本矩阵中的误差和假设主要点的位置。因此,在这篇论文中,我们提出了一个高效和可靠的迭代法,以便根据基本矩阵和假设镜头参数的统计分布,估计镜头焦距和主要点。此外,我们还研究了一种可以快速检查模型的计算方法,以提高估计模型的精度,同时降低总计算时间。实际实验表明,我们的迭代方法可以在实际应用中带来重要的改善,即精度更高的镜头焦距估计,即使假设镜头参数的统计分布不准确。”

Aligning Non-Causal Factors for Transformer-Based Source-Free Domain Adaptation

  • paper_url: http://arxiv.org/abs/2311.16294
  • repo_url: None
  • paper_authors: Sunandini Sanyal, Ashish Ramayee Asokan, Suvaansh Bhambri, Pradyumna YM, Akshay Kulkarni, Jogendra Nath Kundu, R Venkatesh Babu
  • for: 这篇研究的目的是提高预测性能,尤其是在零知道源预测 задачі中。
  • methods: 本研究提出了一个基于对非 causal 因素的排序和静态映射的框架,以实现对 causal 因素的对齐。此外,研究者还使用了 Style 分类任务来对 non-causal 因素进行全局对齐。
  • results: 研究者的方法在多个 DA 测试 benchmark 上取得了state-of-the-art 的结果。
    Abstract Conventional domain adaptation algorithms aim to achieve better generalization by aligning only the task-discriminative causal factors between a source and target domain. However, we find that retaining the spurious correlation between causal and non-causal factors plays a vital role in bridging the domain gap and improving target adaptation. Therefore, we propose to build a framework that disentangles and supports causal factor alignment by aligning the non-causal factors first. We also investigate and find that the strong shape bias of vision transformers, coupled with its multi-head attention, make it a suitable architecture for realizing our proposed disentanglement. Hence, we propose to build a Causality-enforcing Source-Free Transformer framework (C-SFTrans) to achieve disentanglement via a novel two-stage alignment approach: a) non-causal factor alignment: non-causal factors are aligned using a style classification task which leads to an overall global alignment, b) task-discriminative causal factor alignment: causal factors are aligned via target adaptation. We are the first to investigate the role of vision transformers (ViTs) in a privacy-preserving source-free setting. Our approach achieves state-of-the-art results in several DA benchmarks.
    摘要
  1. Non-causal factor alignment: Non-causal factors are aligned using a style classification task, leading to overall global alignment.2. Task-discriminative causal factor alignment: Causal factors are aligned via target adaptation.Our approach is the first to investigate the role of ViTs in a privacy-preserving source-free setting. Our approach achieves state-of-the-art results in several domain adaptation benchmarks.

VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identification

  • paper_url: http://arxiv.org/abs/2311.16278
  • repo_url: None
  • paper_authors: Baolu Li, Ping Liu, Lan Fu, Jinlong Li, Jianwu Fang, Zhigang Xu, Hongkai Yu
  • for: 提高 Vehicle Re-ID 模型在实际世界中的性能,解决不同摄像头视角导致的特征空间混淆问题。
  • methods: 提出一种可以在不同摄像头中拍摄的车辆图像的大量合成方法,以增强特征识别。在实际世界中可能无法获得对应的对比数据,因此提出了首个不需要三元数据的可灵活对应 pose 导向图像生成方法(VehicleGAN)。
  • results: 实验结果表明,对于 VeRi-776 和 VehicleID 数据集,提出的 VehicleGAN 和 JML 可以提高 Vehicle Re-ID 模型的准确率和效果。
    Abstract Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.
    摘要 车辆重新标识(Re-ID)在过去的一代研究广泛,但是不同摄像头视角导致车辆各种姿态下的特征空间混淆,仍然是现实世界中车辆Re-ID模型的挑战。为提高车辆Re-ID模型,本文提出了合成大量目标姿态下的车辆图像的想法,即将车辆各种姿态下的图像投影到统一的目标姿态上,以提高特征识别。考虑到现实世界中可能无法获得相同车辆在不同交通监控摄像头中的对应数据,我们提出了首个可以在不同视角下 Synthesize 车辆图像的Pair-flexible Pose Guided Image Synthesis方法,称为VehicleGAN。此外,由于实际和 sintetic 数据之间的特征分布差异,直接使用传统的度量学学习基于数据水平融合(i.e., 数据增强)是不满足的,因此我们提出了一种新的联合度量学习(JML),通过有效的特征级别融合来提高Re-ID模型的准确性。实验结果表明,我们提出的VehicleGAN和JML在公共的 VeRi-776 和 VehicleID 数据集上具有高精度和效果。

Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification

  • paper_url: http://arxiv.org/abs/2311.17074
  • repo_url: None
  • paper_authors: Siyuan Huang, Yifan Zhou, Ram Prabhakar Kathirvel, Rama Chellappa, Chun Pong Lau
  • for: 提高人脸识别性能和通用性,并解决不同领域的人脸识别问题。
  • methods: 利用精准人acentric semantic representation,通过自然语言处理和KoLeo regularization等技术来提高semantic表示性。
  • results: 在三类人脸识别 datasets(标准人脸识别、CC-ReID和无结构人脸识别)上显示出优于现状方法的性能。 simultaneously introduce a new large-scale person dataset with fine-grained semantics to assist ReID methods in acquiring robust performance.
    Abstract Interactive Segmentation Models (ISMs) like the Segment Anything Model have significantly improved various computer vision tasks, yet their application to Person Re-identification (ReID) remains limited. On the other hand, existing semantic pre-training models for ReID often have limitations like predefined parsing ranges or coarse semantics. Additionally, ReID and Clothes-Changing ReID (CC-ReID) are usually treated separately due to their different domains. This paper investigates whether utilizing precise human-centric semantic representation can boost the ReID performance and improve the generalization among various ReID tasks. We propose SemReID, a self-supervised ReID model that leverages ISMs for adaptive part-based semantic extraction, contributing to the improvement of ReID performance. SemReID additionally refines its semantic representation through techniques such as image masking and KoLeo regularization. Evaluation across three types of ReID datasets -- standard ReID, CC-ReID, and unconstrained ReID -- demonstrates superior performance compared to state-of-the-art methods. In addition, recognizing the scarcity of large person datasets with fine-grained semantics, we introduce the novel LUPerson-Part dataset to assist ReID methods in acquiring the fine-grained part semantics for robust performance.
    摘要 <> transtable begin人体重要部分 segmentation模型(ISMs),如 segment anything 模型,已经在计算机视觉任务中提高了表现,但它们在人识别(ReID)领域的应用仍然有限。而现有的semantic pre-training模型 для ReID 通常具有预定的解析范围或粗糙的 semantics。此外,ReID 和 Clothes-Changing ReID(CC-ReID)通常被视为不同的领域,这导致了它们的研究和应用分别进行。本文 investigate 是否可以通过使用精确的人类中心semantic representation来提高 ReID 性能和提高不同 ReID 任务之间的一致性。我们提出了 SemReID,一种自动化人体重要部分 semantic extraction 的 ReID 模型,利用 ISMs 进行 adaptive part-based semantic extraction,以提高 ReID 性能。 SemReID 还使用了图像遮盾和 KoLeo 规则来细化semantic representation。对于标准 ReID、CC-ReID 和无限制 ReID 三种 ReID 数据集进行评估,SemReID 的性能均比当前最佳方法高。此外, acknowledge 人体数据集的缺乏,我们引入了novel LUPerson-Part数据集,以帮助 ReID 方法在获得细化部分semantics方面提高表现。<> transtable end

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

  • paper_url: http://arxiv.org/abs/2311.16241
  • repo_url: https://github.com/google-research/semivl
  • paper_authors: Lukas Hoyer, David Joseph Tan, Muhammad Ferjad Naeem, Luc Van Gool, Federico Tombari
  • For: 这个研究目的是将双极性模型(VLM)给半指定式semantic segmentation,以减少注解量。* Methods: 这个方法使用了VLM的丰富先天知识,并将其应用到半指定式semantic segmentation中,以学习更好的semantic decision boundary。它还引入了一个空间精确化策略,以便在标签效率下进行label-efficient learning。* Results: 这个方法在4个semantic segmentation dataset上进行评估,和之前的半指定式方法相比,它表现出了 significiant improvement(+13.5 mIoU在COCO上,+6.1 mIoU在Pascal VOC上)。
    Abstract In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl
    摘要 在半supervised semantic segmentation中,一模型被训练于有限数量的标注图像以及一大量的无标注图像,以降低高的标注努力。而前一代方法可以学习良好的分割边界,但容易将类别混淆,因为有限的监督。在 contrary,视觉语言模型(VLM)可以从图像caption datasets中学习多样的semantic知识,但生成的分割结果具有噪音。在 SemiVL 中,我们提出将rich prior from VLM pre-trainingintegrated into semi-supervised semantic segmentation,以学习更好的semantic decision boundaries。为了将 VLM 从全球到局部的reasoning,我们引入了空间精细调整策略,以便 Label-efficient learning。此外,我们设计了语言引导 decoder,以同时进行视觉和语言的共同理解。最后,我们提出了解决类别标签的内在歧义的方法,通过给模型提供语言引导,以形式化类 definitions。我们评估 SemiVL 在 4 个semantic segmentation dataset上,其显著超过了之前的半supervised方法。例如,SemiVL 在 COCO 上提高了state-of-the-art 的 +13.5 mIoU,并在 Pascal VOC 上提高了 +6.1 mIoU,具有92个标签。项目页面:https://github.com/google-research/semivl

GART: Gaussian Articulated Template Models

  • paper_url: http://arxiv.org/abs/2311.16099
  • repo_url: https://github.com/JiahuiLei/GART
  • paper_authors: Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, Kostas Daniilidis
  • For: 表示用于捕捉和渲染非定形软体的单投影视频中的非定形软体的表示和外观。* Methods: 使用混合移动3D高斯函数来显式地approximate非定形软体的几何和外观,利用分类模板模型先验(SMPL、SMAL等)的learnable forward skinning,并扩展到更复杂的非定形变形。* Results: 可以通过分子积分渲染法从单投影视频中重建GART,并在新的姿势下更快于150fps进行渲染。
    Abstract We introduce Gaussian Articulated Template Model GART, an explicit, efficient, and expressive representation for non-rigid articulated subject capturing and rendering from monocular videos. GART utilizes a mixture of moving 3D Gaussians to explicitly approximate a deformable subject's geometry and appearance. It takes advantage of a categorical template model prior (SMPL, SMAL, etc.) with learnable forward skinning while further generalizing to more complex non-rigid deformations with novel latent bones. GART can be reconstructed via differentiable rendering from monocular videos in seconds or minutes and rendered in novel poses faster than 150fps.
    摘要 我们介绍Gaussian Articulated Template Model(GART),一种明确、高效、表达力强的非静止人物捕捉和呈现方法,从单一影像中取得3D对象的位置和形状。GART使用一种组合移动3D高斯函数来明确地描述可变的主题物的几何和外观。它利用可分类模板模型(SMPL、SMAL等)的学习前向皮肤渠道,同时对更复杂的非静止塑形进行扩展,通过新的秘密骨的概念。GART可以通过可微分渲染从单一影像中重建,并在150帧/秒之下呈现新的姿势。

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

  • paper_url: http://arxiv.org/abs/2311.16097
  • repo_url: None
  • paper_authors: Christian Diller, Angela Dai
  • For: 生成动态3D人物互动场景(HOI)的任务。* Methods: joint diffusion process和cross-attention模型人体和物体的运动,以及在推理过程中使用Contact为导航。* Results: 可以生成真实和物理可能的人物互动序列,并且可以根据给定的物体轨迹来生成相应的人体运动,表明了人物之间的互dependent学习。
    Abstract We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference synthesis of realistic, coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
    摘要 我们提出了CG-HOI,首个用文本生成动态3D人物互动(HOIs)的方法。我们模型了人体和物体之间的运动,以semantically rich的人体运动为准则,它们很少发生在孤立状态下。我们的关键发现是:明确模型人体表面和物体几何结构之间的接触可以用作强大的引导,在训练和推理过程中。通过这种引导,我们可以生成更真实和物理可能的互动序列,人体和对应的物体在一起运动。我们的方法首先学习了人体运动、物体运动和接触的联合扩散过程,通过对准关注相互关联。然后,我们可以在推理过程中利用这些学习的接触作为引导,生成更加真实和物理可能的HOIs。我们的方法可以应用于静态的真实3D场景扫描,并且可以根据给定的物体轨迹来生成匹配的人体动作,无需重新训练,表明了人体-物体之间的强大互dependent学习。

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

  • paper_url: http://arxiv.org/abs/2311.16096
  • repo_url: https://github.com/lizhe00/animatablegaussians
  • paper_authors: Zhe Li, Zerong Zheng, Lizhen Wang, Yebin Liu
  • for: 模型人物动画 Avatar 从 RGB 视频中提取。
  • methods: 使用 MLP-based 神经震荡场 (NeRF) 表示 3D 人物,但是它还是困难的为纯 MLP 进行姿势依赖的裤子细节预测。
  • results: 我们引入了 Animatable Gaussians,一种新的人物表示方法,使用强大的 2D CNN 和 3D Gaussian splatting 创建高质量 Avatar。我们学习了输入视频中的参数模板,然后将参数模板分配到两个前置 & 后置 canonical Gaussian 地图上,每个像素表示一个 3D Gaussian。这些学习的模板是适应穿着裤子的,以便模型穿着裤子的裤子。我们还引入了一种 pose projection 策略,以便更好地适应新的姿势。总的来说,我们的方法可以创造出真实、动态和泛化的 Avatar。实验表明,我们的方法超过了其他状态对照方法。代码:https://github.com/lizhe00/AnimatableGaussians
    Abstract Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to regress pose-dependent garment details. To this end, we introduce Animatable Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians with the animatable avatar, we learn a parametric template from the input videos, and then parameterize the template on two front \& back canonical Gaussian maps where each pixel represents a 3D Gaussian. The learned template is adaptive to the wearing garments for modeling looser clothes like dresses. Such template-guided 2D parameterization enables us to employ a powerful StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling detailed dynamic appearances. Furthermore, we introduce a pose projection strategy for better generalization given novel poses. Overall, our method can create lifelike avatars with dynamic, realistic and generalized appearances. Experiments show that our method outperforms other state-of-the-art approaches. Code: https://github.com/lizhe00/AnimatableGaussians
    摘要 模型人工头像从RGB视频中是一个长期存在的和挑战性的问题。现有的方法通常采用MLP基于神经辐射场(NeRF)来表示3D人体,但是纯MLP还是很难直接预测pose相关的衣物细节。为了解决这个问题,我们提出了一种新的人工头像表示方法,即可变GAUSsians。我们利用了强大的2D卷积神经网络和3DGAUSS分配来生成高精度头像。为了将3DGAUSS分配到可动头像,我们学习了输入视频中的参数模板,然后将这个模板参数化在两个前后笔直的可能性映射中,每个像素都表示一个3DGAUSS。我们的模板是可适应穿着裤子和其他裤子的服装,以便更好地模拟裤子的摆放。这种模板引导的2D参数化使我们能够使用一个强大的StyleGAN基于神经网络来学习pose相关的GAUSS分配。此外,我们还提出了一种pose投影策略,以便更好地泛化给新的pose。总之,我们的方法可以创造出真实、动态和泛化的人工头像。实验表明,我们的方法超越了其他状态的艺术方法。代码:https://github.com/lizhe00/AnimatableGaussians

Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images

  • paper_url: http://arxiv.org/abs/2311.16094
  • repo_url: None
  • paper_authors: Aiyu Cui, Jay Mahajan, Viraj Shah, Preeti Gomathinayagam, Svetlana Lazebnik
    for:This paper focuses on virtual try-on technology for in-the-wild scenes, specifically street scenes, and aims to fill the gap in current research by introducing a new benchmark and a novel method that can learn without paired data.methods:The proposed method combines a DensePose warping correction method with diffusion-based inpainting controlled by pose and semantic segmentation to achieve robust performance across shop and street domains.results:The authors’ experiments demonstrate competitive performance for standard studio try-on tasks and state-of-the-art (SOTA) performance for street try-on and cross-domain try-on tasks.
    Abstract Virtual try-on has become a popular research topic, but most existing methods focus on studio images with a clean background. They can achieve plausible results for this studio try-on setting by learning to warp a garment image to fit a person's body from paired training data, i.e., garment images paired with images of people wearing the same garment. Such data is often collected from commercial websites, where each garment is demonstrated both by itself and on several models. By contrast, it is hard to collect paired data for in-the-wild scenes, and therefore, virtual try-on for casual images of people against cluttered backgrounds is rarely studied. In this work, we fill the gap in the current virtual try-on research by (1) introducing a Street TryOn benchmark to evaluate performance on street scenes and (2) proposing a novel method that can learn without paired data, from a set of in-the-wild person images directly. Our method can achieve robust performance across shop and street domains using a novel DensePose warping correction method combined with diffusion-based inpainting controlled by pose and semantic segmentation. Our experiments demonstrate competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.
    摘要 现代技术研究中,虚拟试穿成为了流行的研究主题,但大多数现有方法都是在工作室图像上进行研究,即使这些图像具有整洁的背景。这些方法可以通过对人体图像和穿着同一件衣服的模特儿图像进行对拼训练,以学习将衣服图像适应人体形态。然而,收集在实际场景中的对应数据很Difficult,因此虚拟试穿在快餐场景中的研究很少。在这个工作中,我们填充了现有虚拟试穿研究的缺口,我们首先提出了街头试穿 benchmark,以评估在街景中的性能。其次,我们提出了一种不需要对数据对拼的新方法,可以从无对数据直接学习。我们的方法通过对人体图像进行高密度描述 pose 和 semanticsegmentation controlled diffusion-based inpainting来纠正抗键截图像,并实现了在店面和街景上的稳定性。我们的实验表明,我们的方法可以在标准工作室试穿任务上达到竞争性的性能,并在街头试穿和跨领域试穿任务上达到了领先的性能。

Self-correcting LLM-controlled Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16090
  • repo_url: None
  • paper_authors: Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell
  • for: 文本到图像生成领域内,提高Diffusion模型的精度和可靠性。
  • methods: 提出了Self-correcting LLM-controlled Diffusion(SLD)框架,通过将文本输入转化为图像,然后评估图像与文本输入的匹配程度,进行自动修正,以确保图像的准确性。
  • results: 实验结果表明,SLD可以修正大多数错误生成,特别是在生成数学、特征绑定和空间关系方面表现出色。此外,通过轻量级地调整LLM的指令,SLD可以完成图像编辑任务, bridge the gap between text-to-image generation和图像编辑管道。
    Abstract Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.
    摘要 文本到图像生成技术在扩散模型出现后,经历了重要的进步。然而,当前的文本到图像扩散模型仍然很难准确地理解和执行复杂的输入文本提示。与现有的模型不同,我们介绍了一种自我修复的扩散控制模型(SLD)。SLD是一个框架,通过将输入提示转化为图像,评估图像与提示的匹配程度,并对图像中的错误进行自我修复。受到LLM控制器的引导,SLD将文本到图像生成转化为循环闭合过程,确保图像的准确性。SLD不仅无需训练,而且可以轻松地与现有的扩散模型结合使用,如DALL-E 3,以提高现状最佳的扩散模型性能。实验结果表明,我们的方法可以更正大多数的错误生成,特别是在生成数字、Attribute Binding和空间关系方面。此外,通过简单地调整LLM的指令,SLD可以完成图像编辑任务,将文本到图像生成和图像编辑管道相连接。我们将在未来的研究和应用中发布我们的代码。

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.16498
  • repo_url: https://github.com/magic-research/magic-animate
  • paper_authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou
  • for: 这个研究旨在解决人像动画任务,即生成一个参考人物按照特定的动作序列进行动画。现有的动画方法通常使用框架折叠技术来动画参考图像。尽管达到了合理的结果,但这些方法面临着缺乏时间模型和保持参考图像的精度等挑战。
  • methods: 我们提出了MagicAnimate,一个基于扩散的框架,旨在提高时间一致性、精度保持和动画质量。为此,我们首先开发了一个视频扩散模型来编码时间信息。其次,我们引入了一个新的外观编码器来保持帧内的外观一致性。利用这两个创新,我们进一步使用了一种简单的视频融合技术来促进长视频动画的缓冲过程。
  • results: 我们的方法在两个标准测试集上比基eline方法高出38%以上的视频质量。特别是在挑战性的TikTok舞蹈数据集上,我们的方法比最强的基eline方法高出41%的视频质量。
    Abstract This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.
    摘要 In this study, we propose MagicAnimate, a diffusion-based framework that aims to enhance temporal consistency, preserve the reference image faithfully, and improve animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Then, to ensure the appearance coherence across frames, we introduce a novel appearance encoder that retains the intricate details of the reference image. Finally, we use a simple video fusion technique to create smooth transitions for long video animations.Our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. The code and model will be made available.

DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization

  • paper_url: http://arxiv.org/abs/2311.16060
  • repo_url: https://github.com/jeffery9707/diffslva
  • paper_authors: Zhaoyang Xia, Carol Neidle, Dimitris N. Metaxas
  • for: 这篇论文的目的是为了提供一种可以实现手语录影匿名化的新方法,以便为聋公共社区提供实际的应用。
  • methods: 这篇论文使用了预训大规模扩散模型,并且将 ControlNet 整合到系统中,以便跳过精确的姿势估测。此外,它还开发了一个特别的模组,用于捕捉手语中的脸部表情,这些表情是手语中critical的语言信息的一部分。
  • results: 这篇论文的实验结果显示,这种新的匿名化方法可以实现对手语录影中的姿势和脸部表情的匿名化,并且可以保留手语中的重要语言信息。这个方法可以在实际应用中使用,并且可以为聋公共社区提供 significative benefits。
    Abstract Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.
    摘要 美国手语(ASL)没有标准的书面形式,因此听力受损人们经常分享视频以沟通他们的native语言。然而,由于手部和面部都会传递重要的语言信息,因此手语视频无法保持听力人的隐私。听力人有表达兴趣,为了许多应用程序,手语视频匿名化技术。然而,由于手部运动和面部表达的复杂性,这些技术的开发具有有限的成功。现有的方法主要基于视频中精确的姿势估计,并且通常需要手语视频数据集进行训练。这些要求使得它们无法处理“野外”的视频,因为当前的手语视频数据集具有有限的多样性。为解决这些限制,我们的研究推出了DiffSLVA,一种新的方法,利用预训练的大规模扩散模型,实现零 shot文本引导手语视频匿名化。我们采用ControlNet,利用低级图像特征,如HED(整体嵌入式边缘检测)边缘,以避免需要姿势估计。此外,我们开发了专门用于捕捉面部表达的模块,这些表达对于手语语言中的语义信息是关键的。我们将以上两种方法相结合,以实现更好地保持原始手语者的语义内容。这种创新的方法使得可以,对于第一次,在实际应用中使用手语视频匿名化,这将为听力和听力困难的社区带来 significativ beneficial。我们通过一系列的匿名化实验,证明了我们的方法的效果。

Seeing Beyond Cancer: Multi-Institutional Validation of Object Localization and 3D Semantic Segmentation using Deep Learning for Breast MRI

  • paper_url: http://arxiv.org/abs/2311.16213
  • repo_url: None
  • paper_authors: Arda Pekis, Vignesh Kannan, Evandros Kaklamanos, Anu Antony, Snehal Patel, Tyler Earnest
  • for: breast cancer staging, prognosis, and surgical planning
  • methods: semantic segmentation using 2D object detectors and 3D U-nets, pre-trained on ImageNet and COCO, and operated on MIP images
  • results: superior Dice score on tumor segmentation while maintaining competitive performance on other studied tissues across multiple institutions
    Abstract The clinical management of breast cancer depends on an accurate understanding of the tumor and its anatomical context to adjacent tissues and landmark structures. This context may be provided by semantic segmentation methods; however, previous works have been largely limited to a singular focus on the tumor alone and rarely other tissue types. In contrast, we present a method that exploits tissue-tissue interactions to accurately segment every major tissue type in the breast including: chest wall, skin, adipose tissue, fibroglandular tissue, vasculature and tumor via standard-of-care Dynamic Contrast Enhanced MRI. Comparing our method to prior state-of-the-art, we achieved a superior Dice score on tumor segmentation while maintaining competitive performance on other studied tissues across multiple institutions. Briefly, our method proceeds by localizing the tumor using 2D object detectors, then segmenting the tumor and surrounding tissues independently using two 3D U-nets, and finally integrating these results while mitigating false positives by checking for anatomically plausible tissue-tissue contacts. The object detection models were pre-trained on ImageNet and COCO, and operated on MIP (maximum intensity projection) images in the axial and sagittal planes, establishing a 3D tumor bounding box. By integrating multiple relevant peri-tumoral tissues, our work enables clinical applications in breast cancer staging, prognosis and surgical planning.
    摘要 临床管理乳腺癌取决于正确地理解肿体和其相邻组织的Context。这个Context可以通过semantic segmentation方法提供,但之前的工作通常只关注于肿体本身,罕见其他组织类型。相比之下,我们提出了一种方法,利用组织之间的交互来准确地分类每个主要乳腺组织,包括胸壁、皮肤、脂肪组织、纤维肉组织、血管和肿体。我们的方法包括:首先,使用2D对象检测器来定位肿体;然后,使用三个3D U-net来独立地分类肿体和周围的组织;最后,将这些结果集成,并避免假阳性结果通过检查可能的组织间接触。对象检测模型在ImageNet和COCO上进行预训练,并在MIP图像(最大强度 проек)上运行,以确定3D肿体 bounding box。通过结合多个相关的周围组织,我们的工作启动了乳腺癌阶段诊断、预后和手术规划等临床应用。

Segment Every Out-of-Distribution Object

  • paper_url: http://arxiv.org/abs/2311.16516
  • repo_url: None
  • paper_authors: Wenjie Zhao, Jia Li, Xin Dong, Yu Xiang, Yunhui Guo
  • for: 本研究旨在提高semantic segmentation模型在real-world中的应用,解决出现在distribution外的object detection问题。
  • methods: 该paper引入了一种将 anomaly Score 转换为segmentation Mask的方法,称为S2M。与traditional方法不同的是,S2M直接将异常分数转换为整个异常object的分 segments。
  • results: 实验表明,S2M在不同的benchmark上,平均与state-of-the-art的Performance gap约为10%的IoU和30%的mean F1 score。
    Abstract Semantic segmentation models, while effective for in-distribution categories, face challenges in real-world deployment due to encountering out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for safety-critical applications. Existing methods rely on anomaly scores, but choosing a suitable threshold for generating masks presents difficulties and can lead to fragmentation and inaccuracy. This paper introduces a method to convert anomaly Score To segmentation Mask, called S2M, a simple and effective framework for OoD detection in semantic segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments the entire OoD object. By transforming anomaly scores into prompts for a promptable segmentation model, S2M eliminates the need for threshold selection. Extensive experiments demonstrate that S2M outperforms the state-of-the-art by approximately 10\% in IoU and 30\% in mean F1 score, on average, across various benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets.
    摘要 Semantic segmentation模型,虽然有效对内分布类,但在实际应用中遇到了外分布(OoD)对象,导致检测这些OoD对象是关键的。现有方法依靠异常分数,但选择适当的阈值生成mask存在困难,可能导致分化和不准确。这篇论文提出了将异常分数转换为分 segmentation mask的方法,称为S2M,这是一种简单有效的OoD检测方法。与分配异常分数到像素不同,S2M直接对整个OoD对象进行分割。通过将异常分数转换为提示可变分 segmentation模型,S2M消除了阈值选择的需要。广泛的实验表明,S2M在多个benchmark上,包括Fishyscapes、Segment-Me-If-You-Can和RoadAnomaly dataset,平均提高了10%的IoU和30%的 mean F1 score,胜过当前状态的艺术。

Exploring Attribute Variations in Style-based GANs using Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16052
  • repo_url: None
  • paper_authors: Rishubh Parihar, Prasanna Balaji, Raghav Magazine, Sarthak Vora, Tejan Karmali, Varun Jampani, R. Venkatesh Babu
  • for: 本研究旨在提供一种多样化特征编辑方法,以便用户可以生成多个可能的编辑结果。
  • methods: 该方法基于 pré-训练的 GAN 的独立幂 space,并使用 Denoising Diffusion Probabilistic Model (DDPM) 学习 latent 分布。
  • results: 经过大量的qualitative和量化实验,本方法在多种数据集上显示出高效性,并且应用于3D编辑也可以获得良好的结果。
    Abstract Existing attribute editing methods treat semantic attributes as binary, resulting in a single edit per attribute. However, attributes such as eyeglasses, smiles, or hairstyles exhibit a vast range of diversity. In this work, we formulate the task of \textit{diverse attribute editing} by modeling the multidimensional nature of attribute edits. This enables users to generate multiple plausible edits per attribute. We capitalize on disentangled latent spaces of pretrained GANs and train a Denoising Diffusion Probabilistic Model (DDPM) to learn the latent distribution for diverse edits. Specifically, we train DDPM over a dataset of edit latent directions obtained by embedding image pairs with a single attribute change. This leads to latent subspaces that enable diverse attribute editing. Applying diffusion in the highly compressed latent space allows us to model rich distributions of edits within limited computational resources. Through extensive qualitative and quantitative experiments conducted across a range of datasets, we demonstrate the effectiveness of our approach for diverse attribute editing. We also showcase the results of our method applied for 3D editing of various face attributes.
    摘要 原始特征编辑方法往往将 semantic attribute 视为二进制,从而导致每个特征只能进行单个编辑。然而,特征如眼镜、笑容或发型实际上具有很大的多样性。在这项工作中,我们将 attribute editing 任务 reformulate 为多元特征编辑任务,以利用特征编辑的多 Dimensional 性。这使得用户可以生成多个可能的编辑。我们利用预训练 GAN 的独立 latent space 和 DDPM 模型来学习 latent distribution для多元编辑。具体来说,我们在一个包含单个特征变化的图像对的 latent space 中训练 DDPM。这导致 latent subspace 允许多元特征编辑。通过在高度压缩 latent space 中进行扩散,我们可以模型 Rich 的编辑分布,而不需要过多的计算资源。经过对多种数据集的广泛 Qualitative 和 Quantitative 实验,我们证明了我们的方法的效果性。我们还展示了我们的方法在 3D 编辑中的应用结果。

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.16518
  • repo_url: https://github.com/cswry/seesr
  • paper_authors: Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, Lei Zhang
  • for: 提高实际图像超分辨率问题中的semantic fidelity
  • methods: 使用degradation-aware提问Extractor生成准确的软和硬semantic prompts,并在推理过程中将LR图像 integrate到初始抽样噪声中以mitigate T2I模型生成过多的随机细节
  • results: 方法可以更好地复制实际图像细节并保持semantics
    Abstract Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts can encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics.
    摘要 因为强大的生成性先验,预训练的文本到图像(T2I)扩散模型在解决现实世界图像超分辨问题上变得越来越受欢迎。然而,由于输入低分辨率(LR)图像的质量严重受损,可能导致地方结构的破坏,从而导致生成的高分辨率图像的内容具有错误的 semantics。为了解决这个问题,我们提出了一种 semantics-aware approach,以更好地保持生成图像的semantic fidelity。首先,我们训练了适应受损描述器,可以生成准确的软和硬 semantics 描述符,即图像标签,以增强 T2I 模型的地方感知能力。而软 semantics 描述符则提供了额外的表示信息,以抵消硬 semantics 描述符的不足。这些 semantics 描述符可以鼓励 T2I 模型生成详细而semantically 准确的结果。另外,在推理过程中,我们将LR图像集成到初始抽象噪声中,以mitigate T2I 模型生成过多的随机细节。实验结果表明,我们的方法可以更好地重现图像细节和保持 semantics。

Relightable 3D Gaussian: Real-time Point Cloud Relighting with BRDF Decomposition and Ray Tracing

  • paper_url: http://arxiv.org/abs/2311.16043
  • repo_url: None
  • paper_authors: Jian Gao, Chun Gu, Youtian Lin, Hao Zhu, Xun Cao, Li Zhang, Yao Yao
  • for: 这个论文旨在提出一种基于点云的可微分点云渲染框架,用于材质和照明分解多视图图像,以实现编辑、推理 tracing 和实时重新照明三维点云。
  • methods: 该论文使用了3D Gaussian点云来表示场景,每个点还包含了法向量、BRDF参数和从不同方向来的入射光。为了实现可靠的照明估计,每个点的入射光被分解为全球和本地组件,以及视点依赖性可见性。场景被优化通过3D Gaussian Splatting技术,而BRDF和照明则通过物理基于微分渲染进行分解。
  • results: 对比现有材质估计方法,该论文的方法能够更好地估计BRDF,并且可以在实时 rendering 和重新照明中实现高品质的视图渲染结果。同时,该论文还提出了一种基于维度体 Hierarchy的高效可见性碰撞抑制方法,以实现高效的可见性碰撞抑制。
    Abstract We present a novel differentiable point-based rendering framework for material and lighting decomposition from multi-view images, enabling editing, ray-tracing, and real-time relighting of the 3D point cloud. Specifically, a 3D scene is represented as a set of relightable 3D Gaussian points, where each point is additionally associated with a normal direction, BRDF parameters, and incident lights from different directions. To achieve robust lighting estimation, we further divide incident lights of each point into global and local components, as well as view-dependent visibilities. The 3D scene is optimized through the 3D Gaussian Splatting technique while BRDF and lighting are decomposed by physically-based differentiable rendering. Moreover, we introduce an innovative point-based ray-tracing approach based on the bounding volume hierarchy for efficient visibility baking, enabling real-time rendering and relighting of 3D Gaussian points with accurate shadow effects. Extensive experiments demonstrate improved BRDF estimation and novel view rendering results compared to state-of-the-art material estimation approaches. Our framework showcases the potential to revolutionize the mesh-based graphics pipeline with a relightable, traceable, and editable rendering pipeline solely based on point cloud. Project page:https://nju-3dv.github.io/projects/Relightable3DGaussian/.
    摘要 我们提出了一种新的可 diferenciable 点 cloud 基础结构,用于从多视图图像中提取材质和照明分解,以实现编辑、推理跟踪和实时重新照明三维点云。特别是,我们将三维场景表示为一组可重新照明的三维 Gaussian 点,每个点还关联有法向、 BRDF 参数和从不同方向来的入射光。为了实现可靠的照明估计,我们进一步将每个点的入射光分为全球和本地组件,以及视点依赖的可见性。三维场景通过三维 Gaussian Splatting 技术进行优化,而 BRDF 和照明则通过物理基于的可 diferenciable 渲染进行分解。此外,我们还提出了一种创新的点 cloud 基础的折射跟踪方法,基于矩形体堆 hierarchical 来实现高效的可见性筛选,以实现实时渲染和重新照明三维 Gaussian 点 cloud 的准确阴影效果。我们的框架在 BRDF 估计和新视图渲染效果方面进行了广泛的实验,与当前材质估计方法相比,具有显著改善。我们的框架展示了可以革命化 mesh-based 图形管道,通过基于点云的可 diferenciable 渲染管道,实现可编辑、可跟踪和可重新照明的三维场景渲染。项目页面:https://nju-3dv.github.io/projects/Relightable3DGaussian/.

Weakly-Supervised 3D Reconstruction of Clothed Humans via Normal Maps

  • paper_url: http://arxiv.org/abs/2311.16042
  • repo_url: None
  • paper_authors: Jane Wu, Diego Thomas, Ronald Fedkiw
  • for: 该论文是用深度学习方法来进行人体3D重建,使用弱监督,只需要提供2D正常图。
  • methods: 该方法使用2D正常图进行监督,通过推导出 signed distance function(SDF),并使用 Marching Tetrahedra 算法计算三角形网格上的表面。
  • results: 该方法可以很好地生成3D人体模型,并且可以选择使用多视图损失来提高结果。
    Abstract We present a novel deep learning-based approach to the 3D reconstruction of clothed humans using weak supervision via 2D normal maps. Given a single RGB image or multiview images, our network infers a signed distance function (SDF) discretized on a tetrahedral mesh surrounding the body in a rest pose. Subsequently, inferred pose and camera parameters are used to generate a normal map from the SDF. A key aspect of our approach is the use of Marching Tetrahedra to (uniquely) compute a triangulated surface from the SDF on the tetrahedral mesh, facilitating straightforward differentiation (and thus backpropagation). Thus, given only ground truth normal maps (with no volumetric information ground truth information), we can train the network to produce SDF values from corresponding RGB images. Optionally, an additional multiview loss leads to improved results. We demonstrate the efficacy of our approach for both network inference and 3D reconstruction.
    摘要 我们提出了一种基于深度学习的新方法,用于三维重建人体形态,使用弱监督的2D正常图。从单个RGB图像或多视图图像中,我们的网络可以推断出身体在休息姿势下的签名距离函数(SDF),并将其精确地计算到四面体网格中。然后,推断出的姿势和摄像头参数可以用来生成正常图从SDF中。我们的方法的一个关键特点是使用三角形网格来计算SDF的三角形表面,这使得很容易进行微分(以及反推)。因此,只需要提供正常图(没有体积信息)作为指导,我们可以在RGB图像上训练网络生成SDF值。在选择时,可以添加多视图损失来提高结果。我们在网络推断和三维重建方面证明了方法的有效性。

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

  • paper_url: http://arxiv.org/abs/2311.16037
  • repo_url: None
  • paper_authors: Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, Qi Tian
  • for: 本文主要针对3D场景编辑 tasks,尤其是通过文本指令进行精细、特定的编辑。
  • methods: 本文提出了一个系统性的框架,即GaussianEditor,通过3D高斯拟合来实现3D场景的精细编辑。
  • results: 相比之前的方法,GaussianEditor可以更加精细和准确地编辑3D场景,而且训练时间也比较快,只需20分钟左右。
    Abstract Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).
    摘要 最近,基于2D扩散模型的3D场景编辑技术已经取得了很好的结果。然而,当前的扩散模型主要通过预测积分空间中的噪声来生成图像,而编辑通常会对整个图像进行应用,这会使得对3D场景进行细腻、特别是地方化的编辑变得困难。 drawing inspiration from recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to delicately edit 3D scenes via 3D Gaussians with text instructions. Thanks to the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is then used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).

GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation

  • paper_url: http://arxiv.org/abs/2311.16497
  • repo_url: None
  • paper_authors: Yuxiang Guo, Anshul Shah, Jiang Liu, Rama Chellappa, Cheng Peng
  • for: 本研究旨在提出一种新的、基于维度的 Contour-Pose 表示方法,用于人体行走识别。
  • methods: 该方法使用了一种本地-全球架构,称为 GaitContour,以利用新的表示方法并高效地计算人体行走表示。
  • results: 实验结果表明,GaitContour 比前置方法更高效,同时也比 silhouette-based 方法更高效。在挑战性 dataset 上,GaitContour 甚至可以超过 silhouette-based 方法。
    Abstract Gait recognition holds the promise to robustly identify subjects based on walking patterns instead of appearance information. In recent years, this field has been dominated by learning methods based on two principal input representations: dense silhouette masks or sparse pose keypoints. In this work, we propose a novel, point-based Contour-Pose representation, which compactly expresses both body shape and body parts information. We further propose a local-to-global architecture, called GaitContour, to leverage this novel representation and efficiently compute subject embedding in two stages. The first stage consists of a local transformer that extracts features from five different body regions. The second stage then aggregates the regional features to estimate a global human gait representation. Such a design significantly reduces the complexity of the attention operation and improves efficiency and performance simultaneously. Through large scale experiments, GaitContour is shown to perform significantly better than previous point-based methods, while also being significantly more efficient than silhouette-based methods. On challenging datasets with significant distractors, GaitContour can even outperform silhouette-based methods.
    摘要 “走姿识别技术可以强制地识别人们基于行走模式而不是外表信息。在过去几年,这个领域主要由学习方法基于两种主要的输入表示:紧密的抽象面罩或簇分的动作关键。在这个工作中,我们提出了一个新的、点基于的Contour-Pose表示方法,可以紧扣地表示人体形状和人体部位信息。我们还提出了一个内部-到-全球架构,called GaitContour,来利用这个新的表示方法,高效地计算人们的对应物。这个设计可以实现对于注意力操作的简化,同时提高效率和性能。在大规模实验中,GaitContour被证明可以与之前的点基于方法相比,表现更好,同时也比紧密面罩基于的方法更高效。甚至在具有干扰素的测试 dataset 上,GaitContour可以超越紧密面罩基于的方法。”

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation

  • paper_url: http://arxiv.org/abs/2311.16492
  • repo_url: None
  • paper_authors: Zijian Zhou, Miaojing Shi, Holger Caesar
  • for: 提高图像理解的全面性,同时 segmentation 对象和预测对象之间的关系
  • methods: 利用语言信息和视觉信息,通过注意力机制进行关系预测
  • results: 与前一代方法相比,显著提高了PSG数据集上的relation预测精度,解决了实际应用中的长尾问题
    Abstract Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations.
    摘要 PSG(Panoptic Scene Graph Generation)目标是在图像上实现全面的图像理解,同时将对象分割和对象之间关系预测。然而,长尾问题中的关系导致实际应用中的结果不满足。先前的方法主要依靠视觉信息或使用有限的语言信息,如物体或关系名称,而忽略了语言信息的utilty。利用最近的大语言模型(LLMs)的进步,我们提议使用语言信息来助力关系预测,特别是 для罕见关系。为此,我们提出了视力语言提示(VLPrompt)模型,它从图像中获取视觉信息,并从 LLMS 获取语言信息。然后,通过基于注意机制的提示网络,它实现了精确的关系预测。我们的广泛实验表明,VLPrompt 明显超过了先前的状态码方法在 PSG 数据集上,证明了语言信息的包容和长尾问题的缓解帮助。

Automated Measurement of Vascular Calcification in Femoral Endarterectomy Patients Using Deep Learning

  • paper_url: http://arxiv.org/abs/2311.16001
  • repo_url: https://github.com/pip-alireza/deepcalcscoring
  • paper_authors: Alireza Bagheri Rajeoni, Breanna Pederson, Daniel G. Clair, Susan M. Lessner, Homayoun Valafar
  • For: The paper aims to develop a deep learning model for automated analysis of vascular calcification in patients with peripheral arterial disease (PAD) undergoing femoral endarterectomy surgery.* Methods: The authors employ a deep neural network (DNN) model to segment the vascular system in computed tomographic angiogram (CTA) images and measure vascular calcification from the left renal artery to the patella.* Results: The DNN model achieves 83.4% average Dice accuracy in segmenting arteries from aorta to patella, outperforming previous state-of-the-art methods. Additionally, the authors present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, with a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores.Here’s the Chinese translation of the three points:* For: 这篇论文的目的是开发一种深度学习模型,用于自动分析 péripheral arterial disease (PAD) 患者在 femoral endarterectomy 手术中的血管 calcification。* Methods: 作者使用深度神经网络 (DNN) 模型,对 computed tomographic angiogram (CTA) 图像进行血管分类和血管 calcification 测量,从左肾动脉到股骨。* Results: DNN 模型在 aorta 到股骨 血管分类中取得 83.4% 的平均 Dice 准确率,超过了之前的状态的艺术方法。此外,作者还提供了一种robust的自动 calcification 测量统计分析,MAPE 为 9.5%,并且在自动和手动 calcification 分数之间存在0.978 的相关系数。
    Abstract Atherosclerosis, a chronic inflammatory disease affecting the large arteries, presents a global health risk. Accurate analysis of diagnostic images, like computed tomographic angiograms (CTAs), is essential for staging and monitoring the progression of atherosclerosis-related conditions, including peripheral arterial disease (PAD). However, manual analysis of CTA images is time-consuming and tedious. To address this limitation, we employed a deep learning model to segment the vascular system in CTA images of PAD patients undergoing femoral endarterectomy surgery and to measure vascular calcification from the left renal artery to the patella. Utilizing proprietary CTA images of 27 patients undergoing femoral endarterectomy surgery provided by Prisma Health Midlands, we developed a Deep Neural Network (DNN) model to first segment the arterial system, starting from the descending aorta to the patella, and second, to provide a metric of arterial calcification. Our designed DNN achieved 83.4% average Dice accuracy in segmenting arteries from aorta to patella, advancing the state-of-the-art by 0.8%. Furthermore, our work is the first to present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, attaining a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores. These findings underscore the potential of deep learning techniques as a rapid and accurate tool for medical professionals to assess calcification in the abdominal aorta and its branches above the patella. The developed DNN model and related documentation in this project are available at GitHub page at https://github.com/pip-alireza/DeepCalcScoring.
    摘要 athersclerosis, a chronic inflammatory disease affecting the large arteries, presents a global health risk. Accurate analysis of diagnostic images, like computed tomographic angiograms (CTAs), is essential for staging and monitoring the progression of atherosclerosis-related conditions, including peripheral arterial disease (PAD). However, manual analysis of CTA images is time-consuming and tedious. To address this limitation, we employed a deep learning model to segment the vascular system in CTA images of PAD patients undergoing femoral endarterectomy surgery and to measure vascular calcification from the left renal artery to the patella. Utilizing proprietary CTA images of 27 patients undergoing femoral endarterectomy surgery provided by Prisma Health Midlands, we developed a Deep Neural Network (DNN) model to first segment the arterial system, starting from the descending aorta to the patella, and second, to provide a metric of arterial calcification. Our designed DNN achieved 83.4% average Dice accuracy in segmenting arteries from aorta to patella, advancing the state-of-the-art by 0.8%. Furthermore, our work is the first to present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, attaining a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores. These findings underscore the potential of deep learning techniques as a rapid and accurate tool for medical professionals to assess calcification in the abdominal aorta and its branches above the patella. The developed DNN model and related documentation in this project are available at GitHub page at .

Adversarial Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights

  • paper_url: http://arxiv.org/abs/2311.15994
  • repo_url: None
  • paper_authors: Ryoya Nara, Yusuke Matsui
  • for: The paper is written for researchers and practitioners in the field of computer vision and machine learning, specifically those interested in adversarial attacks and defenses.
  • methods: The paper proposes a new method called Adversarial Doodles, which uses black Bézier curves to generate interpretable adversarial examples that can provide insights into the mechanism of the target classifier. The method optimizes the doodled area and introduces random perspective transformation to obtain compact attacks that can be replicated by hand.
  • results: The paper demonstrates the effectiveness of Adversarial Doodles in fooling state-of-the-art deep neural network (DNN)-based image classification models. The authors show that the generated adversarial examples have interpretable shapes and provide describable insights into the relationship between the attacks and the classifier’s output. For example, the authors demonstrate that adding two strokes on the head of a bird image can cause the classifier to misclassify it as a butterfly.
    Abstract DNN-based image classification models are susceptible to adversarial attacks. Most previous adversarial attacks do not focus on the interpretability of the generated adversarial examples, and we cannot gain insights into the mechanism of the target classifier from the attacks. Therefore, we propose Adversarial Doodles, which have interpretable shapes. We optimize black b\'ezier curves to fool the target classifier by overlaying them onto the input image. By introducing random perspective transformation and regularizing the doodled area, we obtain compact attacks that cause misclassification even when humans replicate them by hand. Adversarial doodles provide describable and intriguing insights into the relationship between our attacks and the classifier's output. We utilize adversarial doodles and discover the bias inherent in the target classifier, such as "We add two strokes on its head, a triangle onto its body, and two lines inside the triangle on a bird image. Then, the classifier misclassifies the image as a butterfly."
    摘要

DiffAnt: Diffusion Models for Action Anticipation

  • paper_url: http://arxiv.org/abs/2311.15991
  • repo_url: None
  • paper_authors: Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer
  • for: 预测未来动作的未定性。
  • methods: 使用扩散模型来捕捉不同的未来动作。
  • results: 在四个 benchmark 数据集上(Breakfast、50Salads、EpicKitchens 和 EGTEA Gaze+)实现了或与现状方法相当的Result,表明了生成方法的效iveness。
    Abstract Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. In this framework, future actions are iteratively generated from standard Gaussian noise in the latent space, conditioned on the observed video, and subsequently transitioned into the action space. Extensive experiments on four benchmark datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action anticipation. Our code and trained models will be published on GitHub.
    摘要 anticipating future actions inherently uncertain, given observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. this uncertainty becomes even larger when predicting far into the future. however, majority of existing action anticipation models adhere to deterministic approach, neglecting to account for future uncertainties. in this work, we rethink action anticipation from generative view, employing diffusion models capture different possible future actions. in this framework, future actions iteratively generated from standard gaussian noise in latent space, conditioned on observed video, and subsequently transitioned into action space. extensive experiments four benchmark datasets, i.e., breakfast, 50salads, epickitchens, and egtea gaze+, are performed and proposed method achieves superior or comparable results state-of-the-art methods, showing effectiveness of generative approach action anticipation. our code and trained models will published on github.

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

  • paper_url: http://arxiv.org/abs/2311.15980
  • repo_url: None
  • paper_authors: Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao
  • for: 这篇论文的目的是提出一种基于2.5D扩散的高效、多视图、高品质3D内容生成方法。
  • methods: 该方法使用一个预训练的2D扩散模型进行微调,并通过一种新的多视图正见映射方法来拼接生成的多视图正见图像。
  • results: 经过广泛的实验表明,该方法可以在10秒内生成多视图正见图像,并且不需要任何后期优化。该方法可以生成多样化、模式寻找自由、高质量的3D内容。
    Abstract Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.
    摘要 近期的生成AI技术突破有显著的可能性 для创造3D内容。然而,当前的方法都是 Either apply a pre-trained 2D diffusion model with time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data, resulting in a loss of generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, thereby bridging the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: .

Text2Loc: 3D Point Cloud Localization from Natural Language

  • paper_url: http://arxiv.org/abs/2311.15977
  • repo_url: None
  • paper_authors: Yan Xia, Letian Shi, Zifeng Ding, João F. Henriques, Daniel Cremers
  • for: 本研究 targets the problem of 3D point cloud localization based on natural language descriptions, and introduces a novel neural network called Text2Loc to fully interpret the semantic relationship between points and text.
  • methods: Text2Loc follows a coarse-to-fine localization pipeline, including text-submap global place recognition and fine localization. The global place recognition uses a hierarchical transformer with max-pooling (HTM) to capture relational dynamics among textual hints, while the fine localization uses a novel matching-free method that completely removes the need for text-instance matching and is lighter, faster, and more accurate than previous methods.
  • results: Extensive experiments show that Text2Loc improves the localization accuracy by up to 2 times over the state-of-the-art on the KITTI360Pose dataset.Here is the summary in Traditional Chinese:
  • for: 本研究目标是根据自然语言描述来解决3D点云网络的地点位置Localization问题,并引入了一个名为Text2Loc的神经网络,以全面地理解点和文本之间的Semantic relationship。
  • methods: Text2Loc遵循一个course-to-fine的Localizationipeline,包括文本子对Global place recognition和精确Localization。文本子对Global place recognition使用一个弹性的对称变换器(HTM)来捕捉文本中间的关系动力,而精确Localization使用一个新的匹配�Free方法,完全消除了文本实例匹配的需求,并且轻量、快速、更加精确于前一代方法。
  • results: 广泛的实验显示,Text2Loc可以提高地点位置的准确性,与前一代比较,在KITTI360Pose数据集上提高了2倍以上。
    Abstract We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. We will make the code publicly available.
    摘要 我们解决了基于一些自然语言描述的3D点云地标问题,并引入了一种新的神经网络Text2Loc,它可以全面理解点云和文本之间的Semantic关系。Text2Loc采用一种层次转换器加权max pooling(HTM)来捕捉文本提示之间的关系动力,并在文本地图匹配中保持正负对比的平衡。此外,我们还提出了一种新的匹配自由精度调整方法,以进一步精细化位置预测结果,完全消除了复杂的文本实例匹配的需求,轻量级、快速、高精度。广泛的实验表明,Text2Loc可以提高地标精度达2倍于状态对的KITTI360Pose数据集。我们将代码公开。

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding in Open World

  • paper_url: http://arxiv.org/abs/2311.15965
  • repo_url: None
  • paper_authors: Thanh-Dat Truong, Utsav Prabhu, Bhiksha Raj, Jackson Cothren, Khoa Luu
  • for: 本研究旨在解决 continual learning 中的 fairness 问题,以提高 continual semantic segmentation 模型的性能和公平性。
  • methods: 本研究提出了一种 Fairness Learning via Contrastive Attention Approach,包括新的 Fairness Contrastive Clustering loss 和 attention-based visual grammar 方法。
  • results: 通过实验,我们的提posed approach 在不同的 continual learning 设定下 achieve State-of-the-Art (SOTA) 性能,并且提高了 continual semantic segmentation 模型的公平性。
    Abstract Continual Learning in semantic scene segmentation aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This paper presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SOTA) performance on different continual learning settings of three standard benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.
    摘要 Translated into Simplified Chinese: kontinuous learning in semantic scene understanding aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This paper presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SOTA) performance on different continual learning settings of three standard benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.Translated into Traditional Chinese: kontinuous learning in semantic scene understanding aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This paper presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SOTA) performance on different continual learning settings of three standard benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.

From Pixels to Titles: Video Game Identification by Screenshots using Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2311.15963
  • repo_url: https://github.com/fbreve/videogame
  • paper_authors: Fabricio Breve
  • for: investigate video game identification through single screenshots
  • methods: utilize five convolutional neural network (CNN) architectures and ImageNet pre-trained weights
  • results: achieve high accuracy in identifying game titles from screenshots, with EfficientNetB3 reaching a peak accuracy of 76.36% and demonstrating reduced convergence epochs.
    Abstract This paper investigates video game identification through single screenshots, utilizing five convolutional neural network (CNN) architectures (MobileNet, DenseNet, EfficientNetB0, EfficientNetB2, and EfficientNetB3) across 22 home console systems, spanning from Atari 2600 to PlayStation 5. Confirming the hypothesis, CNNs autonomously extract image features, enabling the identification of game titles from screenshots without additional features. Using ImageNet pre-trained weights, EfficientNetB3 achieves the highest average accuracy (74.51%), while DenseNet169 excels in 14 of the 22 systems. Employing alternative initial weights from another screenshots dataset boosts accuracy for EfficientNetB2 and EfficientNetB3, with the latter reaching a peak accuracy of 76.36% and demonstrating reduced convergence epochs from 23.7 to 20.5 on average. Overall, the combination of optimal architecture and weights attains 77.67% accuracy, primarily led by EfficientNetB3 in 19 systems. These findings underscore the efficacy of CNNs in video game identification through screenshots.
    摘要 Using pre-trained weights from ImageNet, EfficientNetB3 achieves the highest average accuracy (74.51%), while DenseNet169 performs well across 14 of the 22 systems. The study also shows that employing alternative initial weights from another screenshots dataset can improve accuracy for EfficientNetB2 and EfficientNetB3, with the latter reaching a peak accuracy of 76.36% and demonstrating reduced convergence epochs.Overall, the combination of optimal architecture and weights achieves an accuracy of 77.67%, primarily led by EfficientNetB3 in 19 systems. These findings highlight the effectiveness of CNNs in video game identification through screenshots.

Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images

  • paper_url: http://arxiv.org/abs/2311.16499
  • repo_url: https://github.com/danielshkao/deceptivehuman
  • paper_authors: Shiu-hong Kao, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
  • for: 这篇论文旨在生成高质量可控的3D人NeRF模型,使用现有的控制扩散模型(如ControlNet)进行生成。
  • methods: 该方法使用进步加工技术来提高重建质量,通过使用高质量的人像生成器(如ControlNet)来生成视觉一致的损失。
  • results: 该方法可以生成高光照准确的多视图一致的人NeRF模型,并且可以轻松扩展到多modal输入,如文本提示和其他数据。
    Abstract This paper presents Deceptive-Human, a novel Prompt-to-NeRF framework capitalizing state-of-the-art control diffusion models (e.g., ControlNet) to generate a high-quality controllable 3D human NeRF. Different from direct 3D generative approaches, e.g., DreamFusion and DreamHuman, Deceptive-Human employs a progressive refinement technique to elevate the reconstruction quality. This is achieved by utilizing high-quality synthetic human images generated through the ControlNet with view-consistent loss. Our method is versatile and readily extensible, accommodating multimodal inputs, including a text prompt and additional data such as 3D mesh, poses, and seed images. The resulting 3D human NeRF model empowers the synthesis of highly photorealistic novel views from 360-degree perspectives. The key to our Deceptive-Human for hallucinating multi-view consistent synthetic human images lies in our progressive finetuning strategy. This strategy involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation. Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.
    摘要 The key to Deceptive-Human's ability to hallucinate multi-view consistent synthetic human images lies in its progressive finetuning strategy. This involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation.Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.

Unleashing the Power of Prompt-driven Nucleus Instance Segmentation

  • paper_url: http://arxiv.org/abs/2311.15939
  • repo_url: https://github.com/windygoo/promptnucseg
  • paper_authors: Zhongyi Shui, Yunlong Zhang, Kai Yao, Chenglu Zhu, Yuxuan Sun, Lin Yang
  • for: automatic nuclei instance segmentation in histology images
  • methods: point prompter and a SAM (Segment Anything Model) fine-tuned to output the corresponding mask of the cued nucleus, with negative prompts for overlapping nuclei
  • results: sets a new state-of-the-art performance on three challenging benchmarks
    Abstract Nuclear instance segmentation in histology images is crucial for a broad spectrum of clinical applications. Current prevailing nuclear instance segmentation algorithms rely on regression of nuclei contours, distance maps, watershed markers or a proxy nuclear representation of star-convex polygons. Consequently, these methods necessitate sophisticated post-processing operations to distinguish nuclei instances, which are commonly acknowledged to be error-prone and parameter-sensitive. Recently, the segment anything model (SAM) has earned attracted huge attention within the domain of medical image segmentation due to its impressive generalization ability and promptable property. Nevertheless, its potential on nuclear instance segmentation remains largely underexplored. In this paper, we present a novel prompt-driven framework that consists of a point prompter and a SAM for automatic nuclei instance segmentation. Specifically, the prompter learns to generate a unique point prompt for each nucleus while the SAM is fine tuned to output the corresponding mask of the cued nucleus. Furthermore, we propose to add adjacent nuclei as negative prompts to promote the model's ability to recognize overlapping nuclei. Without bells and whistles, our proposed method sets a new state-of-the-art performance on three challenging benchmarks. Our code is available at \textcolor{magenta}{\url{https://github.com/windygoo/PromptNucSeg} .
    摘要 核心实例分割在 histology 图像中是关键的诊断应用领域。目前主流的核心实例分割算法基于核心 outline 回归、距离地图、水域 marker 或代理核心表示星形多边形。这些方法通常需要复杂的后处理操作来分割核心实例,这些操作通常被承认为是 error-prone 和参数敏感。最近,医疗图像分割领域内的 segment anything model (SAM) 吸引了很大的关注,因为它在多种应用场景中表现出了惊人的通用能力和快速性。然而,它的核心实例分割能力尚未得到充分探索。在这篇论文中,我们提出了一种新的推动式框架,它包括一个点推导器和一个 SAM。特别是,推导器学习生成每个核心的唯一点提示,而 SAM 则是通过 fine-tuning 来输出相应的核心 máscara。此外,我们还提议将邻近的核心作为负例提示,以促进模型认知 overlap 的核心。没有奖励和感叹的情况下,我们的提出的方法在三个挑战性的标准底图上设置了新的状态级表现。我们的代码可以在 \textcolor{magenta}{\url{https://github.com/windygoo/PromptNucSeg} 上获取。

Optimal Transport Aggregation for Visual Place Recognition

  • paper_url: http://arxiv.org/abs/2311.15937
  • repo_url: https://github.com/serizba/salad
  • paper_authors: Sergio Izquierdo, Javier Civera
  • for: 这篇论文旨在提出一种基于视觉特征的地点识别方法(Visual Place Recognition,VPR),用于匹配查询图像与数据库中的图像,仅仅通过视觉特征进行匹配。
  • methods: 该方法使用深度 neural network 提取特征,并将其组合成全局描述符,以便进行匹配。此外,该方法还引入了一种名为 ‘垃圾桶’(dustbin)的特殊 cluster,用于抛弃不具有信息价值的特征,从而提高总体描述符质量。
  • results: 该方法在公共 VPR 数据集上比单stage基线方法表现出色,并且还比两stage方法(包括重新排序)表现更好,即使其具有较低的训练时间。代码和模型可以在 GitHub 上获取。
    Abstract The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
    摘要 “视觉地点识别(VPR)任务的目标是将查询图像与数据库中的不同地点图像进行匹配,完全仅基于视觉特征。现代渠道架构强调在深度背景中提取特征,并将其组合成全局描述符以实现图像的唯一标识。在这种情况下,我们介绍了SALAD(降水算法为本地汇集特征分配),它将NetVLAD的软分配本地特征到群集转换为最优运输问题。在SALAD中,我们考虑了特征与群集之间的关系,以及群集与特征之间的关系,并引入了一个 '垃圾桶' 特征,用于选择不具有信息价值的特征,从而提高总描述符质量。此外,我们利用了并微调了 DINOv2 作为后处,它提供了增强的本地描述力,并减少了训练时间。因此,我们的单阶段方法不仅超过单阶段基线在公共 VPR 数据集上,还超过了两阶段方法,其中第二阶段额外添加了重新排序,并且这些重新排序需要更高的成本。代码和模型可以在 GitHub 上找到。”

ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization

  • paper_url: http://arxiv.org/abs/2311.15916
  • repo_url: None
  • paper_authors: Elahe Vahdani, Yingli Tian
  • for: 这个论文目标是提高点监督的 temporal action detection 性能,只有一帧action实例被标注在训练集中。
  • methods: 这个论文提出了一种名为 ADM-Loc 的新框架,它是基于 actionness distribution modeling 的点监督action localization。ADM-Loc 使用 Gaussian 和 uniform 分布对 action classification 信号进行适应,以提高生成的 action proposal 和实际的 action instance 的对应性。
  • results: ADM-Loc 在 THUMOS14 和 ActivityNet-v1.2 数据集上达到了点监督方法的新高性能。
    Abstract This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set. Self-training aims to provide supplementary supervision for the training process by generating pseudo-labels (action proposals) from a base model. However, most current methods generate action proposals by applying manually designed thresholds to action classification probabilities and treating adjacent snippets as independent entities. As a result, these methods struggle to generate complete action proposals, exhibit sensitivity to fluctuations in action classification scores, and generate redundant and overlapping action proposals. This paper proposes a novel framework termed ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised action Localization. ADM-Loc generates action proposals by fitting a composite distribution, comprising both Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions. ADM-Loc significantly enhances the alignment between the generated action proposals and ground-truth action instances and offers high-quality pseudo-labels for self-training. Moreover, to model action boundary snippets, it enforces consistency in action classification scores during training by employing Gaussian kernels, supervised with the proposed loss functions. ADM-Loc outperforms the state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.
    摘要 Current methods for generating action proposals rely on manually designed thresholds for action classification probabilities and treat adjacent snippets as independent entities. However, these methods often produce incomplete or overlapping action proposals and are sensitive to fluctuations in action classification scores.ADM-Loc addresses these issues by fitting a composite distribution, combining Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions.To model action boundary snippets, ADM-Loc enforces consistency in action classification scores during training using Gaussian kernels and supervised with the proposed loss functions. This leads to high-quality pseudo-labels for self-training and significantly enhances the alignment between the generated action proposals and ground-truth action instances.The proposed method outperforms state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.

Computer Vision for Carriers: PATRIOT

  • paper_url: http://arxiv.org/abs/2311.15914
  • repo_url: None
  • paper_authors: Ari Goodman, Gurpreet Singh, James Hing, Ryan O’Shea
    for: 该研究旨在提高航空母舰上的甲板跟踪过程,使用自动化技术来增加 sortie generation rates。methods: 该研究使用了pasive sensing技术和计算机视觉算法来实现甲板跟踪,而不需要安装hardware-based的Global Positioning System(GPS)仪器。results: 该研究已经开发出了一个名为PATRIOT(Panoramic Asset Tracking of Real-Time Information for the Ouija Tabletop)的研究和解决方案,可以快速、准确地跟踪飞机、人员和支持设备的位置。PATRIOT可以减少人员劳动量、提高效率和安全性,并且可以收集数据来改善物流和后勤支持。
    Abstract Deck tracking performed on carriers currently involves a team of sailors manually identifying aircraft and updating a digital user interface called the Ouija Board. Improvements to the deck tracking process would result in increased Sortie Generation Rates, and therefore applying automation is seen as a critical method to improve deck tracking. However, the requirements on a carrier ship do not allow for the installation of hardware-based location sensing technologies like Global Positioning System (GPS) sensors. PATRIOT (Panoramic Asset Tracking of Real-Time Information for the Ouija Tabletop) is a research effort and proposed solution to performing deck tracking with passive sensing and without the need for GPS sensors. PATRIOT is a prototype system which takes existing camera feeds, calculates aircraft poses, and updates a virtual Ouija board interface with the current status of the assets. PATRIOT would allow for faster, more accurate, and less laborious asset tracking for aircraft, people, and support equipment. PATRIOT is anticipated to benefit the warfighter by reducing cognitive workload, reducing manning requirements, collecting data to improve logistics, and enabling an automation gateway for future efforts to improve efficiency and safety. The authors have developed and tested algorithms to perform pose estimations of assets in real-time including OpenPifPaf, High-Resolution Network (HRNet), HigherHRNet (HHRNet), Faster R-CNN, and in-house developed encoder-decoder network. The software was tested with synthetic and real-world data and was able to accurately extract the pose of assets. Fusion, tracking, and real-world generality are planned to be improved to ensure a successful transition to the fleet.
    摘要 舰船上的垫追踪现在由一组船员手动识别飞机并更新一个名为Ouija Board的数字用户界面。提高垫追踪过程的改进会导致更高的批发生速率,因此应用自动化是追求提高垫追踪的关键方法。然而,舰船上的要求不允许安装硬件基于位置感知技术如GPS传感器。PATRIOT(Panoramic Asset Tracking of Real-Time Information for the Ouija Tabletop)是一项研究努力和提议的解决方案,使用无线遥感技术进行垫追踪,不需要GPS传感器。PATRIOT是一个原型系统,使用现有的摄像头Feed,计算飞机的姿态,并将当前资产的状态更新到虚拟Ouija Board界面。PATRIOT可以提供更快、更准确、 less laborious的资产追踪,包括飞机、人员和支持设备。PATRIOT预计会为战斗员带来减少认知劳动负担,减少人员编制,收集数据以改进后勤,并实现自动化的门槛,以便未来的效率和安全性改进。作者已经开发和测试了资产姿态估算算法,包括OpenPifPaf、HRNet、HHRNet、Faster R-CNN和自家开发的编码器-解码器网络。软件在 sintetic和实际数据上测试,能够准确地提取资产的姿态。将来,将进行融合、跟踪和实际通用性的改进,以确保成功的转移到舰队。

LIFT OFF: LoRaWAN Installation and Fiducial Tracking Operations for the Flightline of the Future

  • paper_url: http://arxiv.org/abs/2311.15912
  • repo_url: None
  • paper_authors: Ari Goodman, Ryan O’Shea
    for: 这个研究是为了提供现场位置情况的实时认知,以便完成任务 efficiently 和满足需求。methods: 这个研究使用了machine-vision компонент和 geolocation sensor компонент,以及创建了一个LoRaWAN广泛局域网络来传输数据。results: 这个研究成功地提供了实时更新的地图,该地图显示了所追踪的资产的位置,包括人员和支持设备的GPS感知器,以及航空器的视觉标志。
    Abstract Real-time situational awareness for the location of assets is critical to ensure missions are completed efficiently and requirements are satisfied. In many commercial settings, the application of global positioning system (GPS) sensors is appropriate to achieve timely knowledge of the position of people and equipment. However, GPS sensors are not appropriate for all situations due to flight clearance and operations security concerns. LIFT OFF: LoRaWAN Installation and Fiducial Tracking Operations for the Flightline of the Future proposes a hybrid framework solution to achieve real-time situational awareness for people, support equipment, and aircraft positions regardless of the environment. This framework included a machine-vision component, which involved setting up cameras to detect AprilTag decals that were installed on the sides of aircraft. The framework included a geolocation sensor component, which involved installing GPS sensors on support equipment and helmets. The framework also included creating a long-range wide area network (LoRaWAN) to transfer data and developing a user interface to display the data. The framework was tested at Naval Air Station Oceana Flightline, the United States Naval Test Pilot School, and at Naval Air Warfare Center Aircraft Division Lakehurst. LIFT OFF successfully provided a real-time updating map of all tracked assets using GPS sensors for people and support equipment and with visual fiducials for aircraft. The trajectories of the assets were recorded for logistical analysis and playback. Future follow-on work is anticipated to apply the technology to other environments including carriers and amphibious assault ships in addition to the flightline.
    摘要 实时情况意识对资产位置是关键,以确保任务效率高并达到需求。在商业场景中,使用全球定位系统(GPS)传感器是合适的方式来实现实时知ledge of人员和设备位置。然而,GPS传感器在某些情况下不是合适的,例如飞行批准和操作安全问题。LIFT OFF:LoRaWAN安装和 fiducial Tracking 操作 для未来的飞行线 proposes 一种混合框架解决方案,以实现实时情况意识 для人员、支持设备和飞机位置,无论环境。这个框架包括机器视觉组件,即在飞机侧安装 AprilTag 徽标以便检测。这个框架还包括地理位置传感器组件,即在支持设备和头盔上安装 GPS 传感器。此外,还创建了一个覆盖广泛区域网络(LoRaWAN)来传输数据,并开发了一个用户界面来显示数据。这个框架在美国 naval Air Station Oceana 飞行线、美国 naval Test Pilot School 和 naval Air Warfare Center Aircraft Division Lakehurst 进行了测试,LIFT OFF 成功地提供了实时更新的地图,显示所跟踪的所有资产位置,包括人员和支持设备的 GPS 位置,以及飞机的视觉 fiducials。资产的轨迹被记录以供日后分析和播放。未来的跟进工作预计将应用该技术到其他环境,包括航空母舰和坦克登陆舰。

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.15908
  • repo_url: https://github.com/claudiom4sir/stablevsr
  • paper_authors: Claudio Rota, Marco Buzzelli, Joost van de Weijer
  • for: 提高视频超分辨率(VSR)的质量,使用扩散模型(DM)。
  • methods: 使用Temporal Conditioning Module(TCM),Temporal Texture Guidance,Frame-wise Bidirectional Sampling策略。
  • results: 提高视频超分辨率的 perceived 质量,比现有的VSR方法更高。
    Abstract In this paper, we address the problem of video super-resolution (VSR) using Diffusion Models (DM), and present StableVSR. Our method significantly enhances the perceptual quality of upscaled videos by synthesizing realistic and temporally-consistent details. We turn a pre-trained DM for single image super-resolution into a VSR method by introducing the Temporal Conditioning Module (TCM). TCM uses Temporal Texture Guidance, which provides spatially-aligned and detail-rich texture information synthesized in adjacent frames. This guides the generative process of the current frame toward high-quality and temporally-consistent results. We introduce a Frame-wise Bidirectional Sampling strategy to encourage the use of information from past to future and vice-versa. This strategy improves the perceptual quality of the results and the temporal consistency across frames. We demonstrate the effectiveness of StableVSR in enhancing the perceptual quality of upscaled videos compared to existing state-of-the-art methods for VSR. The code is available at https://github.com/claudiom4sir/StableVSR.
    摘要 在这篇论文中,我们讨论了视频超分辨(VSR)使用扩散模型(DM),并提出了StableVSR。我们的方法可以显著提高 upscaled 视频的感知质量,通过生成真实和时间相关的细节。我们将先前训练的 DM 转换为 VSR 方法,通过引入 Temporal Conditioning Module (TCM)。TCM 使用 Temporal Texture Guidance,提供邻帧匹配的空间同步和细节富的Texture信息,导引当前帧生成过程向高质量和时间相关的结果。我们提出了 Frame-wise Bidirectional Sampling 策略,以促进以前和后的信息使用。这种策略提高了结果的感知质量和时间相关性。我们证明 StableVSR 可以在提高 upscaled 视频的感知质量方面超越现有的 VSR 方法。代码可以在 https://github.com/claudiom4sir/StableVSR 上下载。

MetaDefa: Meta-learning based on Domain Enhancement and Feature Alignment for Single Domain Generalization

  • paper_url: http://arxiv.org/abs/2311.15906
  • repo_url: None
  • paper_authors: Can Sun, Hao Zheng, Zhigang Hu, Liu Yang, Meiguang Zheng, Bo Xu
    for:这个论文的目的是提出一种基于元学习的单域总结(SDG)技术,以解决频率分布不匹配和域特征分离问题,提高模型的总结性能。methods:这个论文使用了背景替换和视觉损害技术来生成多样化和有效的扩展域。然后,基于类活化图和类agnostic活化图的多通道特征协调模块被设计,以有效地提取充分的传输知识。在这个模块中,域特征可以得到全面探索,通过关注源域和扩展域特征空间中相似的目标区域,并抑制不相似的目标区域的特征表示。results:经过广泛的实验,这个论文在两个公共可用的 dataset 上表现出了显著的总结性能优势,能够在未知多个目标域中进行高效的总结。
    Abstract The single domain generalization(SDG) based on meta-learning has emerged as an effective technique for solving the domain-shift problem. However, the inadequate match of data distribution between source and augmented domains and difficult separation of domain-invariant features from domain-related features make SDG model hard to achieve great generalization. Therefore, a novel meta-learning method based on domain enhancement and feature alignment (MetaDefa) is proposed to improve the model generalization performance. First, the background substitution and visual corruptions techniques are used to generate diverse and effective augmented domains. Then, the multi-channel feature alignment module based on class activation maps and class agnostic activation maps is designed to effectively extract adequate transferability knowledge. In this module, domain-invariant features can be fully explored by focusing on similar target regions between source and augmented domains feature space and suppressing the feature representation of non-similar target regions. Extensive experiments on two publicly available datasets show that MetaDefa has significant generalization performance advantages in unknown multiple target domains.
    摘要 Single domain generalization(SDG)基于meta-学习技术已经成为解决域分布问题的有效方法。然而,因为数据分布不足和域相关特征分离困难,SDG模型难以实现出色的泛化性。为此,一种基于域强化和特征对齐(MetaDefa)的新的meta-学习方法被提出,以提高模型泛化性表现。首先,使用背景替换和视觉损害技术生成了多元和有效的扩充域。然后,基于类活动图和类不知情活动图的多通道特征对齐模块被设计,以有效地提取适用知识。在这个模块中,通过专注于源域和扩充域特征空间的相似目标区域,全面探索域不关参数。此外,通过抑制非相似目标区域的特征表示,有效地避免了域相关特征的混淆。广泛的实验表明,MetaDefa在未知多个目标域中具有显著的泛化性表现优势。

Stability-Informed Initialization of Neural Ordinary Differential Equations

  • paper_url: http://arxiv.org/abs/2311.15890
  • repo_url: https://github.com/westny/neural-stability
  • paper_authors: Theodor Westny, Arman Mohammadi, Daniel Jung, Erik Frisk
  • for: 本文研究Neural Ordinary Differential Equations(neural ODEs)的训练方法,具体来说是研究numerical integration techniques、stability regions、step size和 initialization techniques之间的关系。
  • methods: 本文使用numerical integration techniques来训练neural ODEs,并研究solver的稳定区域对训练和预测性能的影响。
  • results: 本文提出了一种基于稳定性的初始化 Parameters技术,并在多个学习benchmark和实际应用中证明了其效果。
    Abstract This paper addresses the training of Neural Ordinary Differential Equations (neural ODEs), and in particular explores the interplay between numerical integration techniques, stability regions, step size, and initialization techniques. It is shown how the choice of integration technique implicitly regularizes the learned model, and how the solver's corresponding stability region affects training and prediction performance. From this analysis, a stability-informed parameter initialization technique is introduced. The effectiveness of the initialization method is displayed across several learning benchmarks and industrial applications.
    摘要 Here is the text in Simplified Chinese:这篇论文研究神经常微方程(neural ODEs)的训练,特别是数学 интеграル技术、稳定区域、步长和初始化技术之间的交互关系。文章表明,选择的数学 интеграル技术会隐式地规范学习的模型,并且选择的数学 интеграル技术对训练和预测性能产生影响。基于这种分析,文章提出了稳定区域 Informed 参数初始化技术,并在多个学习Benchmark和实际应用中展示了其效果。

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

  • paper_url: http://arxiv.org/abs/2311.15879
  • repo_url: None
  • paper_authors: Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama
  • for: 这个论文旨在提高大型语言模型(LLM)基于图像描述的表达能力,使其能够描述没有在训练数据中出现的 объек。
  • methods: 该方法使用外部视觉名称记忆(EVCap)来提高 LLM 的对象知识,并通过对象的视觉和名称来建立可变的对象知识库。
  • results: 该方法在多个benchmark上表现出色,特别是在针对不同的数据集和 Commonsense-violating 数据进行测试时,与其他相同大小的模型相比,它表现出了更高的性能。
    Abstract Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can be adapted to out-domain data without additional fine-tuning or retraining. Our comprehensive experiments conducted on various benchmarks and synthetic commonsense-violating data demonstrate that EVCap, comprising solely 3.97M trainable parameters, exhibits superior performance compared to other methods of equivalent model size scale. Notably, it achieves competitive performance against specialist SOTAs with an enormous number of parameters. Our code is available at https://jiaxuan-li.github.io/EVCap.
    摘要 大型语言模型(LLM)基于图像描述可以描述没有直接出现在训练数据中的对象;然而,新型对象会频繁出现,需要保持更新对象知识的能力 для开放世界理解。而不是依靠大量数据和扩大网络参数,我们介绍了一种非常有效的检索扩展图像描述方法,通过在External Visual-name memory(EVCap)中检索对象名称来提醒LLM。我们建立了可以在最小成本下更新的对象知识记忆,使得我们可以(i)更新记忆,而不需要较大的成本,和(ii)使用易于训练的轻量级模型来快速地扩展LLM。我们的模型,它只在COCO dataset上训练,可以无需额外 fine-tuning或重新训练,在不同的benchmark上进行适应。我们的实验表明,EVCap,它只有3.97M个可训练参数,在其他方法相同的模型大小的情况下表现出色,并且与专家水平的模型相当。我们的代码可以在https://jiaxuan-li.github.io/EVCap中下载。

InterControl: Generate Human Motion Interactions by Controlling Every Joint

  • paper_url: http://arxiv.org/abs/2311.15864
  • repo_url: https://github.com/zhenzhiwang/intercontrol
  • paper_authors: Zhenzhi Wang, Jingbo Wang, Dahua Lin, Bo Dai
  • for: 模拟人类交互行为,包括任意数量的人类之间的交互。
  • methods: 利用 diffusion models 和对应的控制信号,以及 Large Language Model (LLM) Planner 将交互描述翻译成 contacts plans,然后使用 spatially controllable motion generation methods 生成人类交互。
  • results: 提出了一种名为 InterControl 的新方法,可以在不同人类之间实现 flexible spatial control,并且可以生成准确、合理的人类交互。经过了大量的实验,包括 HumanML3D 和 KIT-ML 数据集,得到了效果的证明。
    Abstract Text-conditioned human motion generation model has achieved great progress by introducing diffusion models and corresponding control signals. However, the interaction between humans are still under explored. To model interactions of arbitrary number of humans, we define interactions as human joint pairs that are either in contact or separated, and leverage {\em Large Language Model (LLM) Planner} to translate interaction descriptions into contact plans. Based on the contact plans, interaction generation could be achieved by spatially controllable motion generation methods by taking joint contacts as spatial conditions. We present a novel approach named InterControl for flexible spatial control of every joint in every person at any time by leveraging motion diffusion model only trained on single-person data. We incorporate a motion controlnet to generate coherent and realistic motions given sparse spatial control signals and a loss guidance module to precisely align any joint to the desired position in a classifier guidance manner via Inverse Kinematics (IK). Extensive experiments on HumanML3D and KIT-ML dataset demonstrate its effectiveness in versatile joint control. We also collect data of joint contact pairs by LLMs to show InterControl's ability in human interaction generation.
    摘要 文本受控人体动作生成模型已经取得了很大的进步,通过引入扩散模型和相应的控制信号。然而,人类之间的交互仍然尚未得到了充分探索。为模型人类之间的交互,我们定义交互为人体 JOINT PAIRS 的连接或分离,并利用 Large Language Model (LLM) Planner 将交互描述翻译成接触计划。基于接触计划,交互生成可以通过基于 JOINT CONTACTS 的 spatially可控动作生成方法实现。我们提出一种新的方法 named InterControl,可以在任意时间和任意 JOINT 上实现 flexible spatial control。我们利用动作扩散模型,并在唯一人体数据上进行训练。我们采用动作控制网络来生成具有较好的准确性和可控性的动作,并通过类ifier guidance manner 的损失引导模块来精准地将任意 JOINT 的位置与所需的位置进行对齐。我们进行了广泛的实验,证明 InterControl 在多个 JOINT 上的可控性和人类交互生成能力的效果。我们还收集了由 LLMs 生成的 JOINT CONTACT PAIRS 数据,以展示 InterControl 在人类交互生成方面的能力。

JSSL: Joint Supervised and Self-supervised Learning for MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2311.15856
  • repo_url: None
  • paper_authors: George Yiasemis, Nikita Moriakov, Clara I. Sánchez, Jan-Jakob Sonke, Jonas Teuwen
  • For: 这篇论文的目的是为了提高基因发散磁共振成像(MRI)的重建质量,并且在临床enario中应对实验动作引起的资料不充分情况。* Methods: 这篇论文使用了自动化学习网络(deep learning)来重建MRI影像,并且提出了一种新的训练方法,即同时在自主学习和监督学习之间训练模型。* Results: 论文的结果显示,该新的训练方法可以对于具有实验动作引起的资料不充分情况下的MRI重建提供了重要的改善,并且提供了一个“实验规则”来选择合适的训练方法。
    Abstract Magnetic Resonance Imaging represents an important diagnostic modality; however, its inherently slow acquisition process poses challenges in obtaining fully sampled k-space data under motion in clinical scenarios such as abdominal, cardiac, and prostate imaging. In the absence of fully sampled acquisitions, which can serve as ground truth data, training deep learning algorithms in a supervised manner to predict the underlying ground truth image becomes an impossible task. To address this limitation, self-supervised methods have emerged as a viable alternative, leveraging available subsampled k-space data to train deep learning networks for MRI reconstruction. Nevertheless, these self-supervised approaches often fall short when compared to supervised methodologies. In this paper, we introduce JSSL (Joint Supervised and Self-supervised Learning), a novel training approach for deep learning-based MRI reconstruction algorithms aimed at enhancing reconstruction quality in scenarios where target dataset(s) containing fully sampled k-space measurements are unavailable. Our proposed method operates by simultaneously training a model in a self-supervised learning setting, using subsampled data from the target dataset(s), and in a supervised learning manner, utilizing data from other datasets, referred to as proxy datasets, where fully sampled k-space data is accessible. To demonstrate the efficacy of JSSL, we utilized subsampled prostate parallel MRI measurements as the target dataset, while employing fully sampled brain and knee k-space acquisitions as proxy datasets. Our results showcase a substantial improvement over conventional self-supervised training methods, thereby underscoring the effectiveness of our joint approach. We provide a theoretical motivation for JSSL and establish a practical "rule-of-thumb" for selecting the most appropriate training approach for deep MRI reconstruction.
    摘要 磁共振成像(Magnetic Resonance Imaging,MRI)是一种重要的诊断方法,但它的自然slow acquisition process(磁共振信号读取过程)在临床场景中,如腹腔、心脏和肾脏成像,会带来移动影响,导致完全抽样的k-空间数据不可得。在缺乏完全抽样的情况下,无法使用完全抽样的数据作为基准数据来训练深度学习网络。为解决这些限制,无监督方法(self-supervised methods)得到了广泛应用,利用可用的半样本k-空间数据来训练深度学习网络。然而,这些无监督方法通常与监督方法相比较差。在这篇文章中,我们介绍了JSSL(联合监督和自监督学习),一种新的训练方法,旨在在没有完全抽样数据的情况下提高MRI重建质量。我们的提议的方法是同时在自监督学习环境中使用目标数据集(target dataset)中的半样本数据,并在监督学习环境中使用其他数据集(proxy datasets),其中具有完全抽样的k-空间数据。为证明JSSL的效果,我们使用了半样本肾脏平行MRI测量数据作为目标数据集,并使用了完全抽样的大脑和股骨k-空间测量数据作为proxy datasets。我们的结果显示,JSSL在相比传统自监督训练方法的情况下具有显著改善,从而证明了我们的联合方法的有效性。我们还提供了对JSSL的理论基础和实际“准则”(rule-of-thumb),以帮助选择适当的深度MRI重建训练方法。

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

  • paper_url: http://arxiv.org/abs/2311.15855
  • repo_url: None
  • paper_authors: Hsuan-I Ho, Jie Song, Otmar Hilliges
  • for: 创建从单个图像中生成真实、全面的3D人体模型
  • methods: 提出了一种新的渠道,将图像条件的扩散模型集成到3D mesh重建工作流中
  • results: 经过广泛的实验和用户测试,证明该方法可以很好地从不同的图像中生成真实、全面的3D人体模型
    Abstract A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single images. The main challenge lies in inferring unknown human shapes, clothing, and texture information in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the ill-posed single-view reconstruction problem into hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate back appearances from the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. Our designs enable training of the pipeline with only about 500 3D human scans while maintaining its generality and robustness. Extensive experiments and user studies on two 3D reconstruction benchmarks demonstrated the efficacy of our method in generating realistic, fully textured 3D humans from a diverse range of unseen images.
    摘要 长期目标是从单个图像中生成真实、完整的3D人体。主要挑战在于推断不可见区域的人体形状、衣物和纹理信息。我们提议SiTHpipeline,它uniquely integrate了图像条件的扩散模型到3D短网 reconstruction工作流程中。我们的方法通过将单视重构问题分解为描绘和重构两个互补部分来解决这个挑战。为描绘部分,我们使用强大的生成扩散模型来描绘图像中的返回 appearances。为重构部分,我们利用皮封体mesh作为导向来从输入和反向图像中恢复全身纹理网格。我们的设计使得可以通过约500个3D人体扫描训练我们的管道,而且保持其通用性和稳定性。我们的实验和用户研究表明,我们的方法可以从多样化的未看过图像中生成真实、完整的3D人体。

Single-Model and Any-Modality for Video Object Tracking

  • paper_url: http://arxiv.org/abs/2311.15851
  • repo_url: https://github.com/zongwei97/untrack
  • paper_authors: Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte
  • for: 这个论文主要针对视频对象跟踪领域中的多模态跟踪问题。
  • methods: 这篇论文提出了一种基于 transformer 架构的单一参数化跟踪方法(Un-Track),通过低级 фактор化和重建技术来学习不同模态之间的共同特征空间,从而实现多模态跟踪的一体化。
  • results: 对于 DepthTrack 数据集,Un-Track 可以提供 +8.1 绝对 F-score 提升,相比 SOTA 统一跟踪器和模式特定精化对手,Un-Track 在五个不同模式的 benchmark 数据集上均表现出色,证明了它的有效性和实用性。
    Abstract In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a \underline{Un}ified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture and without the need for modality-specific fine-tuning. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific finetuned counterparts, validating our effectiveness and practicality.
    摘要 在视频对象跟踪领域,辅助Modalities如深度、热成像或事件数据已成为资产,以增强RGB跟踪器的性能。然而,将多modalities的跟踪器统一为单一模型存在多种挑战。这些挑战来自于输入数据的自然差异,每种Modalities具有特定的表示方式,缺乏多modalities的数据集,以及缺少某些modalities的情况。在这项工作中,我们介绍了Un-Track,一个基于单一参数集的多Modalities跟踪器。为了处理任意Modalities,我们的方法通过低级因子化和重建技术学习它们共同的底层空间。更重要的是,我们只使用RGB-X对来学习共同的底层空间。这种共同表示能够自然地结合所有Modalities,使得单一的转换器架构可以处理任意Modalities,无需特定的模式精度调整。Un-Track在DepthTrack数据集上实现了+8.1绝对F1分数提升,相比21.50GFLOPs和+6.6M参数,通过简单 yet efficient的激励策略。我们对五个benchmark数据集进行了广泛的比较,并证明Un-Track不仅超过了state-of-the-art的统一跟踪器,还超过了特定模式精度调整后的模式特定跟踪器,这 validate了我们的效果和实用性。

Cell Maps Representation For Lung Adenocarcinoma Growth Patterns Classification In Whole Slide Images

  • paper_url: http://arxiv.org/abs/2311.15847
  • repo_url: None
  • paper_authors: Arwa Al-Rubaian, Gozde N. Gunesli, Wajd A. Althakfi, Ayesha Azam, Nasir Rajpoot, Shan E Ahmed Raza
  • for: 这种研究旨在开发一种基于机器学习的肺癌分类方法,以提高肺癌诊断和 проgnosis。
  • methods: 这种方法首先将染色涂抹扫描图像转化为细胞地图,然后使用 convolutional neural network 进行分类。
  • results: 研究表明,这种方法可以具有高度的普适性和精度,在未见数据集上实现了约30%的高精度。
    Abstract Lung adenocarcinoma is a morphologically heterogeneous disease, characterized by five primary histologic growth patterns. The quantity of these patterns can be related to tumor behavior and has a significant impact on patient prognosis. In this work, we propose a novel machine learning pipeline capable of classifying tissue tiles into one of the five patterns or as non-tumor, with an Area Under the Receiver Operating Characteristic Curve (AUCROC) score of 0.97. Our model's strength lies in its comprehensive consideration of cellular spatial patterns, where it first generates cell maps from Hematoxylin and Eosin (H&E) whole slide images (WSIs), which are then fed into a convolutional neural network classification model. Exploiting these cell maps provides the model with robust generalizability to new data, achieving approximately 30% higher accuracy on unseen test-sets compared to current state of the art approaches. The insights derived from our model can be used to predict prognosis, enhancing patient outcomes.
    摘要 肺癌是一种 morphologically 多样化的疾病,表现为五种主要 Histologic 生长模式。这些模式的数量与肿瘤行为有直接关系,对患者预后具有重要影响。在这项工作中,我们提出了一种新的机器学习管道,能够将组织块分类为一个 из五种模式或非肿瘤,AUCROC 分数为 0.97。我们的模型的优势在于它对细胞空间模式进行了全面考虑,首先从 Hematoxylin 和 Eosin(H&E)整个染色板图像(WSIs)中生成细胞地图,然后将其传输给一个卷积神经网络分类模型。通过利用这些细胞地图,我们的模型在新数据上实现了约 30% 高的准确率,比现有的状态 искусственный智能方法更加稳定。这些发现可以用于预测肿瘤的发展,提高患者的预后。

Learning with Noisy Low-Cost MOS for Image Quality Assessment via Dual-Bias Calibration

  • paper_url: http://arxiv.org/abs/2311.15846
  • repo_url: None
  • paper_authors: Lei Wang, Qingbo Wu, Desen Yuan, King Ngi Ngan, Hongliang Li, Fanman Meng, Linfeng Xu
  • for: 学习基于图像质量评估(IQA)模型,以减少人工标注的劳动成本。
  • methods: 使用低成本的主观评分(LC-MOS)作为学习目标,并通过对LC-MOS的不确定性进行假设,以提高IQA模型的鲁棒性。
  • results: 在四个常用的IQA数据集上,提出了一种基于LC-MOS的IQA模型学习方法,并经过了广泛的实验 validate 该方法的可靠性和效果。
    Abstract Learning based image quality assessment (IQA) models have obtained impressive performance with the help of reliable subjective quality labels, where mean opinion score (MOS) is the most popular choice. However, in view of the subjective bias of individual annotators, the labor-abundant MOS (LA-MOS) typically requires a large collection of opinion scores from multiple annotators for each image, which significantly increases the learning cost. In this paper, we aim to learn robust IQA models from low-cost MOS (LC-MOS), which only requires very few opinion scores or even a single opinion score for each image. More specifically, we consider the LC-MOS as the noisy observation of LA-MOS and enforce the IQA model learned from LC-MOS to approach the unbiased estimation of LA-MOS. In this way, we represent the subjective bias between LC-MOS and LA-MOS, and the model bias between IQA predictions learned from LC-MOS and LA-MOS (i.e., dual-bias) as two latent variables with unknown parameters. By means of the expectation-maximization based alternating optimization, we can jointly estimate the parameters of the dual-bias, which suppresses the misleading of LC-MOS via a gated dual-bias calibration (GDBC) module. To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels. Theoretical analysis and extensive experiments on four popular IQA datasets show that the proposed method is robust toward different bias rates and annotation numbers and significantly outperforms the other learning based IQA models when only LC-MOS is available. Furthermore, we also achieve comparable performance with respect to the other models learned with LA-MOS.
    摘要 学习基于图像质量评估(IQA)模型已经获得了吸引人的表现,尤其是通过可靠的主观质量标签,其中主观意见分数(MOS)是最受欢迎的选择。然而,由于各个评分员的主观偏见,需要大量的评分员提供多个意见分数,这会增加学习成本。在这篇论文中,我们希望通过低成本的MOS(LC-MOS)学习Robust IQA模型,只需每个图像的几个或者even只有一个意见分数。更具体地,我们认为LC-MOS是LA-MOS的噪声观察,并强制IQA模型学习自LC-MOS中的不准确评分,以便 approached不受主观偏见的LA-MOS评分。这样,我们可以表示LC-MOS和LA-MOS之间的主观偏见差异,以及IQA预测结果学习自LC-MOS和LA-MOS(i.e., dual-bias)之间的模型偏见。通过对预测结果进行预期最大化的分布式优化,我们可以同时估计 dual-bias 的参数。我们通过LC-MOS来减少误导的方法,并通过一种闭合 dual-bias 准备(GDBC)模块来加以减少。根据我们知道的,这是首次对主观低成本标签学习Robust IQA模型的探索。我们在四个流行的IQA数据集上进行了广泛的实验,并证明了我们的方法在不同的偏见率和评分员数量下具有robust性。此外,我们还实现了与其他基于LA-MOS学习的模型相比肯同的性能。

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2311.15841
  • repo_url: None
  • paper_authors: Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang
  • for: 本研究旨在解决文本到图像(T2I)生成中的新任务——动作定制。该任务的目标是从有限数据中学习并将动作特征推广到未经见过的人或动物。
  • methods: 我们提出了一种借鉴-基本的方法——动作解体标识器(ADI),从示例图像中学习动作特征标识器。ADI首先扩展了semantic conditioning空间,通过引入层次标识符token,从而增加表达丰富性而分配推准 across different features。然后,为防止行为无关特征的推准,ADI从构建的样本三重中提取梯度不变性,并将无关通道的更新掩码。
  • results: 我们的ADI比exist的基elines表现出色,在动作定制T2I生成中达到了更高的质量和多样性。我们还提供了一个ActionBench,其包括了多种动作,每个动作都有仔细选择的样本。both quantitative和qualitative结果表明,我们的ADI在动作定制T2I生成中具有显著的优势。
    Abstract This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation.
    摘要 Translation in Simplified Chinese:这个研究关注了一个新的文本到图像(T2I)生成任务,即动作定制。该任务的目标是从有限数据中学习共存的动作,并将其推广到未见过的人或甚至动物。实验结果表明,现有的主体驱动定制方法无法学习动作的特征特征,同时也困难划分动作和上下文特征。为了超越低级特征的偏好和高级特征的杂谱,我们提议一种倒推基于方法,即Action-Disentangled Identifier (ADI),来学习动作特有的标识符从示例图像中。ADI首先扩展了 semantic conditioning 空间,通过引入层级标识符 токен,从而增加表达力richness,同时分布倒推在不同特征上。然后,为了阻止倒推无关动作特征,ADI提取构造的样本 triple 中的梯度不变性,并遮盖无关通道的更新。为了全面评估任务,我们提供了一个 ActionBench,包括了多种动作,每个动作都有仔细选择的示例。 Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation.

Syn3DWound: A Synthetic Dataset for 3D Wound Bed Analysis

  • paper_url: http://arxiv.org/abs/2311.15836
  • repo_url: None
  • paper_authors: Léo Lebrat, Rodrigo Santa Cruz, Remi Chierchia, Yulia Arzhaeva, Mohammad Ali Armin, Joshua Goldsmith, Jeremy Oorloff, Prithvi Reddy, Chuong Nguyen, Lars Petersson, Michelle Barakat-Johnson, Georgina Luscombe, Clinton Fookes, Olivier Salvado, David Ahmedt-Aristizabal
  • for: 这篇论文是为了解决床ridden patients和老年人的伤口管理问题而写的。
  • methods: 这篇论文使用现代图像分析技术,提供高精度和准确的伤口测量。
  • results: 这篇论文提出了一个开源的Syn3DWound数据集,并提出了基线方法和一个对自动3D形态分析和2D/3D伤口分割的比较框架。
    Abstract Wound management poses a significant challenge, particularly for bedridden patients and the elderly. Accurate diagnostic and healing monitoring can significantly benefit from modern image analysis, providing accurate and precise measurements of wounds. Despite several existing techniques, the shortage of expansive and diverse training datasets remains a significant obstacle to constructing machine learning-based frameworks. This paper introduces Syn3DWound, an open-source dataset of high-fidelity simulated wounds with 2D and 3D annotations. We propose baseline methods and a benchmarking framework for automated 3D morphometry analysis and 2D/3D wound segmentation.
    摘要 伤口管理具有重要挑战,特别是对床bound patients和老年人来说。精准的诊断和 cicatrix 监测可以从现代图像分析中受益很大,提供精确和精密的伤口测量。 despite several existing techniques, the shortage of expansive and diverse training datasets remains a significant obstacle to constructing machine learning-based frameworks。这篇文章介绍Syn3DWound,一个开源的高效 simulations of wounds dataset with 2D and 3D annotations。我们提出基准方法和一个比较框架 для自动化3D形态分析和2D/3D伤口分割。

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

  • paper_url: http://arxiv.org/abs/2311.15830
  • repo_url: None
  • paper_authors: Zhengcong Fei, Mingyuan Fan, Junshi Huang
  • for: 这paper的目的是使用遮盖模型原理来驱动大型基础视觉模型的成功,并应用到音频上。
  • methods: 这paper引入了Audio-based Joint-Embedding Predictive Architecture(A-JEPA),一种简单的扩展方法,通过自我超vised学习来学习从音频谱spectrum。
  • results: 这paper发现,通过将随机块遮盖转换为时间频率意识的遮盖,可以提高音频spectrum中高度相关的地方的表达能力和 Robustness。此外,通过在目标集上进行正则化遮盖,可以提高音频和语音分类任务的Contextual semantic understanding和表达能力。
    Abstract This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.
    摘要 The A-JEPA method first encodes audio spectrogram patches using a context encoder, and then predicts the representations of regions sampled at well-designed locations. The target representations of these regions are extracted by taking an exponential moving average of the context encoder on the whole spectrogram. To improve the model's performance, the authors transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated local time and frequency in audio spectrograms.To enhance the model's contextual semantic understanding and robustness, the authors fine-tune the encoder with a regularized masking on target datasets, rather than using input dropping or zero. The A-JEPA model is built using the Vision Transformers structure, and is found to be highly scalable and to set new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.

LLMGA: Multimodal Large Language Model based Generation Assistant

  • paper_url: http://arxiv.org/abs/2311.16500
  • repo_url: https://github.com/Zj-BinXia/LLMGA
  • paper_authors: Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia
  • for: 这个论文是为了推出一种基于大语言模型的生成助手(LLMGA),利用大语言模型(LLM)内置的知识和推理能力,帮助用户在图像生成和编辑中进行精准的控制。
  • methods: 这个论文使用了一种两阶段训练方法,在第一阶段中,通过训练大语言模型(MLLM)来学习图像生成和编辑的特性,并在第二阶段中,通过优化固定扩散(SD)来将MLLM的生成提示与SD进行Alignment。此外,论文还提出了一种参考基于的修复网络,以解决图像修改过程中的纹理、亮度和对比度的差异。
  • results: 论文的实验结果表明,LLMGA具有良好的生成能力,可以在交互方式下推出更多的应用场景。
    Abstract In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting $\&$ outpainting, and visual question answering. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during image editing. Extensive results show that LLMGA has promising generative capabilities and can enable wider applications in an interactive manner.
    摘要 在本文中,我们介绍了一种基于大语言模型的多Modal生成助手(LLMGA),利用大语言模型(LLM)内置的庞大知识和理解能力来帮助用户进行图像生成和修改。与现有方法不同,我们的LLMGA不使用固定大小的嵌入来控制稳定扩散(SD),而是提供详细的语言生成提示,以提高LLM上下文理解和降低生成提示的噪音。这不仅扩展了LLM的上下文理解范围,而且还提高了生成图像的精度和可读性。为此,我们积集了包括提示修改、相似图像生成、填充和剔除、视觉问答等多种任务的COMPREHENSIVE数据集。此外,我们提出了两个阶段的训练方案。在第一阶段,我们训练MLLM掌握图像生成和修改的属性,使其能够生成详细的提示。在第二阶段,我们优化SD以与MLLM生成提示相Alignment。此外,我们还提出了一种参考基于的修复网络,以解决图像修改过程中的纹理、亮度和对比度差异。我们的实验结果表明,LLMGA具有扎实的生成能力,可以在互动性下扩展到更多应用场景。

C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing

  • paper_url: http://arxiv.org/abs/2311.15812
  • repo_url: None
  • paper_authors: Avigyan Bhattacharya, Mainak Singha, Ankit Jha, Biplab Banerjee
  • for: 本研究旨在解决领域和类总结问题在分析光学远程感知图像时,使用大规模预训练的视力语言模型(VLM)CLIP。
  • methods: 我们提出了一种解决方案,使得CLIP在多个领域中进行域 инвариант的提问学习,并提高视觉特征表达力。我们发现CLIP的视觉编码器很难以捕捉图像片断的 Contextual information,尤其是在遥感图像中,各个领域类型具有明确的Contextual appearance。为此,我们提出了C-SAW方法,通过在视觉空间中添加自动学习损失和一种新的提问学习技术,以强调视觉域和内容特征。
  • results: 我们的实验结果表明,C-SAW在多个遥感benchmark上和不同的总结任务中具有显著的优势。
    Abstract We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP. While contrastively trained VLMs show impressive zero-shot generalization performance, their effectiveness is limited when dealing with diverse domains during training and testing. Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts, which results in a drop in performance while dealing with such multi-domain data. To address these challenges, we propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features. We observe that CLIP's vision encoder struggles to identify contextual image information, particularly when image patches are jumbled up. This issue is especially severe in optical remote sensing images, where land-cover classes exhibit well-defined contextual appearances. To this end, we introduce C-SAW, a method that complements CLIP with a self-supervised loss in the visual space and a novel prompt learning technique that emphasizes both visual domain and content-specific features. We keep the CLIP backbone frozen and introduce a small set of projectors for both the CLIP encoders to train C-SAW contrastively. Experimental results demonstrate the superiority of C-SAW across multiple remote sensing benchmarks and different generalization tasks.
    摘要 我们在分析光学远程感知图像时集中于领域和类总体化问题,使用大规模预训练的视觉语言模型(VLM)CLIP。而对于不同领域的训练和测试数据,尝试性地训练的VLM显示出了很好的零shot总结性能,但其效iveness受到多个领域的训练和测试数据的多样性的限制。现有的提问学习技术忽视了在提问中包含领域和内容信息的重要性,这会导致与多个领域数据进行提问学习时的性能下降。为解决这些挑战,我们提出了一种解决方案,以确保领域不变的提问学习,同时提高视觉特征表达的能力。我们发现,CLIP的视觉encoder在混乱图像 patches时很难以识别图像上的上下文信息,特别是在光学远程感知图像中,陆地覆盖类型具有明确的上下文表现。为此,我们提出了 C-SAW,一种方法,它在视觉空间中添加了自动学习损失,并提出了一种新的提问学习技术,强调视觉领域和内容特定的特征。我们保持CLIP的背部冰结并引入了一小组项目器,以便对CLIP的encoder进行C-SAW对比。实验结果表明,C-SAW在多个光学远程感知 benchmark上和不同的总结任务中具有优越性。

PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions

  • paper_url: http://arxiv.org/abs/2311.15806
  • repo_url: None
  • paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
  • for: 这篇论文的目的是解决深度神经网络(DNNs)在计算机视觉和自然语言处理中的高推断成本问题,通过几何化来实现这一目的。
  • methods: 这篇论文使用了一种名为PIPE的量化方法,PIPE 方法利用了差分错误扩展、群集簇范围和集成估计来实现更好的并行化。
  • results: 根据论文的测试结果,PIPE 方法在每个测试应用程序(包括 Computer Vision 和自然语言处理)、架构(包括 ConvNets 和 transformers)和位元数(包括 int8 和 ternary 量化)上都能够获得超过其他方法的性能。
    Abstract Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point perations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack of adaptability to the target devices, as a hardware typically only support specific bit widths. Thus, to adapt to a variety of devices, a quantization method shall be flexible enough to find good accuracy v.s. speed trade-offs for every bit width and target device. To achieve this, we propose PIPE, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization. PIPE is backed off by strong theoretical guarantees and achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to ternary quantization).
    摘要

SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2311.15803
  • repo_url: None
  • paper_authors: Quentin Herau, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Cyrille Migniot, Pascal Vasseur, Cédric Demonceaux
  • for: 本研究旨在实现多标的感应器资料的精确汇总,以提高自动驾驶车辆的运作精确性和稳定性。
  • methods: 本研究使用神经震荡场(NeRF)来代表不同感应器模式的共同体积表示,以实现Robust和精确的时空感应器汇总。我们运用分割方法,基于每个感应器可见部分的scene,将汇总问题转化为只有重叠区域的问题,这样的策略使得汇总更加稳定和精确。
  • results: 我们运用本方法在多个公共驾驶数据集上进行验证,结果显示,我们的方法能够比较精确和稳定,较前方法更好。
    Abstract In rapidly-evolving domains such as autonomous driving, the use of multiple sensors with different modalities is crucial to ensure high operational precision and stability. To correctly exploit the provided information by each sensor in a single common frame, it is essential for these sensors to be accurately calibrated. In this paper, we leverage the ability of Neural Radiance Fields (NeRF) to represent different sensors modalities in a common volumetric representation to achieve robust and accurate spatio-temporal sensor calibration. By designing a partitioning approach based on the visible part of the scene for each sensor, we formulate the calibration problem using only the overlapping areas. This strategy results in a more robust and accurate calibration that is less prone to failure. We demonstrate that our approach works on outdoor urban scenes by validating it on multiple established driving datasets. Results show that our method is able to get better accuracy and robustness compared to existing methods.
    摘要 在高速发展领域如自动驾驶中,使用多种感知器的多模态感知是关键以确保高度精准和稳定操作。为了正确地利用每个感知器提供的信息在单一的帧中,它们必须准确地准确。在这篇论文中,我们利用神经辐射场(NeRF)来表示不同感知器模态的共同三维表示,以实现Robust和准确的空间时间感知器准确。通过根据每个感知器可见部分的Scene进行分区,我们将准确性 проблеme formulate 只有相互重叠的部分。这种策略会导致更Robust和准确的准确,相比之下更不易失败。我们在多个Established驾驶数据集上验证了我们的方法,结果显示我们的方法能够在外部城市场景中获得更高的准确性和稳定性,相比于现有方法。

Mip-Splatting: Alias-free 3D Gaussian Splatting

  • paper_url: http://arxiv.org/abs/2311.16493
  • repo_url: https://github.com/autonomousvision/mip-splatting
  • paper_authors: Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, Andreas Geiger
  • for: 提高3D Gaussian Splatting的精度和效率
  • methods: 引入3D平滑Filter,根据输入视图中的最大采样频率来限制3D Gaussian primitives的大小,并使用2D Mip filter来消除抖音和扩散问题
  • results: 在多种场景下,包括单个批处理图像和多个批处理图像的训练和测试, validate our approach的有效性
    Abstract Recently, 3D Gaussian Splatting has demonstrated impressive novel view synthesis results, reaching high fidelity and efficiency. However, strong artifacts can be observed when changing the sampling rate, \eg, by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem, we introduce a 3D smoothing filter which constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views, eliminating high-frequency artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip filter, which simulates a 2D box filter, effectively mitigates aliasing and dilation issues. Our evaluation, including scenarios such a training on single-scale images and testing on multiple scales, validates the effectiveness of our approach.
    摘要 近期,3D Gaussian Splatting 展示了吸目的新视图合成结果,达到高准确率和效率。然而,当变换 sampling rate 时,可见强烈的artefacts,例如改变 focal length 或 camera distance。我们发现这种现象的原因是因为缺乏 3D 频率约束和使用 2D 扩展 filter。为解决这个问题,我们引入了一种基于输入视图的最大抽象频率的 3D 平滑 filter,根据 maximal sampling frequency 约束 3D Gaussian primitives 的大小,避免在增大时出现高频纹理残留。此外,将 2D 扩展替换为 2D Mip filter,模拟 2D 箱 filter,有效地解决了抖擦和扩展问题。我们的评估,包括在单个批处理图像和多个缩放级别上进行训练和测试,证明了我们的方法的有效性。

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model

  • paper_url: http://arxiv.org/abs/2311.16510
  • repo_url: None
  • paper_authors: Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu
  • for: 这篇论文的目的是为了实现源预测模型在目标领域中的适应,并仅使用无标的目标训练数据和源模型预训在源领域。
  • methods: 这篇论文使用了 pseudo labeling 和/或辅助supervision,并 explore了 off-the-shelf vision-language(ViL)多modal模型(例如 CLIP)的潜在。它们提出了一个名为 Distilling multimodal Foundation model(DIFO)的新方法,它在适应过程中 alternate between two step:(i) 为 ViL 模型进行自适应,以增强它对目标模型的互联性;(ii) 将自适应后的 ViL 模型知识转换到目标模型中。此外,它还引入了两种有效的调整常式,namely most-likely category encouragement 和 predictive consistency。
  • results: 实验结果显示,DIFO 对于目前的state-of-the-art方法进行了严重的超越。
    Abstract Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Our source code will be released.
    摘要 源自自由领域适应(SFDA)目的是将源模型适应目标领域,只有访问无标的目标训练数据和源模型在监督源领域预训练。对于这些问题,传统方法是不可避免的错误。为了解决这个限制,在这个研究中,我们首次探索了市场上可用的视觉语言(ViL)多模式模型(例如CLIP)的潜力。我们发现,直接将ViL模型应用到目标领域中是不 satisfactory,因为它不是这个特定任务的特殊化。为了让它成为特定任务的,我们提出了一个novel的Distilling multimodal Foundation model(DIFO)方法。具体来说,DIFO在适应过程中 alternate between two steps:(i)使ViL模型进行启发式学习,以增强它对目标模型的互联信息;(ii)将这个自订的ViL模型知识转投到目标模型中。为了更加精确和可靠的传播,我们还引入了两个有效的调整项, namely most-likely category encouragement和predictive consistency。实验结果显示,DIFO明显超过了目前的替代方案。我们将源代码发布。

Relationship between Model Compression and Adversarial Robustness: A Review of Current Evidence

  • paper_url: http://arxiv.org/abs/2311.15782
  • repo_url: None
  • paper_authors: Svetlana Pavlitska, Hannes Grolig, J. Marius Zöllner
  • for: 提高模型容量可以增强深度学习网络的针对性攻击性能力。
  • methods: 许多模型压缩技术,如剪枝和量化,可以降低网络大小而保持准确性。
  • results: 一些latest studies have investigated the relationship between模型压缩和针对性攻击性能,但有些实验结果存在矛盾。
    Abstract Increasing the model capacity is a known approach to enhance the adversarial robustness of deep learning networks. On the other hand, various model compression techniques, including pruning and quantization, can reduce the size of the network while preserving its accuracy. Several recent studies have addressed the relationship between model compression and adversarial robustness, while some experiments have reported contradictory results. This work summarizes available evidence and discusses possible explanations for the observed effects.
    摘要 增加模型容量是一种常见的方法来提高深度学习网络的逆攻击Robustness。然而,多种模型压缩技术,包括剪辑和量化,可以降低网络的大小,保持其准确性。一些最近的研究已经研究了模型压缩和逆攻击Robustness之间的关系,而一些实验得到了矛盾的结果。本文总结了可用证据并讨论了可能的解释。

Stable Segment Anything Model

  • paper_url: http://arxiv.org/abs/2311.15776
  • repo_url: https://github.com/fanq15/stable-sam
  • paper_authors: Qi Fan, Xin Tao, Lei Ke, Mingqiao Ye, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu-Wing Tai, Chi-Keung Tang
  • for: 提高Segment Anything Model(SAM)的精度和稳定性,使其能够更好地处理低质量的提示。
  • methods: 通过分析SAM的分割稳定性,发现提示中的缺失盒和点会导致SAM的面部分划分偏向背景或特定的对象部分。以及提出了一种 learnable deformable offsets 的方法,使SAM可以在数据驱动的方式下动态调整图像特征的抽样位置,以提高分割精度和稳定性。
  • results: 在多个数据集上进行了广泛的实验,证明了我们的方法可以提高SAM的分割精度和稳定性,同时保留SAM的强大可提示分割效率和通用性,并且具有最少的学习参数(0.08 M)和快速适应(在1个训练epoch中)。
    Abstract The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key finding reveals that given such low-quality prompts, SAM's mask decoder tends to activate image features that are biased towards the background or confined to specific object parts. To mitigate this issue, our key idea consists of adjusting the sampling locations of image feature using learnable deformable offsets, while the original SAM model architecture and weights remain unchanged. Consequently, our deformable sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted target regions in a data-driven manner, facilitated by our effective robust training strategy (RTS). During inference, dynamic routing plugin (DRP) is proposed that toggles SAM between the deformable and regular grid sampling modes, conditioned on the input prompt quality. Thus, our solution, termed Stable-SAM, is one of its kind focusing on solely adjusting feature sampling locations, which offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality, with 3) minimal learnable parameters (0.08 M) and fast adaptation (by 1 training epoch). Extensive experiments across multiple datasets validate the effectiveness and advantages of our approach, underscoring Stable-SAM as a more robust solution for segmenting anything. Codes will be released upon acceptance.
    摘要 Segment Anything Model (SAM) 可以实现非常出色的提示分割,但是需要高质量的提示,这些提示通常需要技术人员的努力来 Specify。为了让 SAM 更加鲁棒对待不优质提示,这篇论文提出了首次对 SAM 的分割稳定性进行全面分析,具体来说是对提示质量较低的情况进行分析,包括不准确的 bounding box 和不充分的点。我们的关键发现是,当 faced with such low-quality prompts, SAM 的 mask decoder 会活化图像特征,这些特征偏向背景或是对象部分中的特征。为了解决这个问题,我们提出了一种调整图像特征的抽象偏移学习方法,而不改变 SAM 的模型结构和参数。因此,我们的抽象抽取器(DSP)可以让 SAM 通过数据驱动的方式shift attention to the prompted target regions。在推理中,我们还提出了动态路由插件(DRP),它可以根据输入提示质量来 toggle SAM между抽象和常规的网格采样模式。因此,我们的解决方案,称为 Stable-SAM,可以在很宽的提示质量范围内提供更好的分割稳定性,同时保持 SAM 的强大可提示分割效率和通用性,并且具有最小化的可学习参数(0.08 M)和快速适应(在 1 个训练 epoch 内)。广泛的实验 validate 了我们的方法的效果和优势,这证明了 Stable-SAM 是一种更加鲁棒的分割解决方案。代码将在接受后释出。

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2311.15773
  • repo_url: https://github.com/SimM-T2I/SimM
  • paper_authors: Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu
  • for: 这 paper 的目的是提出一种无需训练的布局准确化系统,以便在生成图像时更好地理解和 sintesize 文本提示中的布局要求。
  • methods: 该系统使用 “检查-定位-修正” 管道,首先分析提示以生成目标布局,然后与 intermediate 输出进行比较以自动检测错误。接着,通过移动定位的启动和进行 intra- 和 inter-map 调整,可以减少计算负担。
  • results: 对于一系列布局要求,提出了一个名为 SimMBench 的测试准确度,并通过量化和质量测试表明,提出的 SimM 系统可以准确地调整布局不一致的问题。
    Abstract Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically, following a "check-locate-rectify" pipeline, the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements, we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies.
    摘要 Diffusion models 有最近很大进步在生成实际图像上。然而,在精确理解和整合文本提示中的布局要求仍存在挑战。为了将生成的图像与文本提示中的布局一致,我们提出了一个无需训练的布局调整系统SimM。在推理时间内,该系统会在生成过程中进行干预,并按照"检查-定位-修正"管道进行操作。 Specifically, the system first analyzes the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead.为了评估SimM的效果,我们提出了一个名为SimMBench的数据集,该数据集补做了现有数据集中的缺乏超越空间关系。并both quantitative和qualitative结果表明,提出的SimM可以准确地调整布局不一致。

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

  • paper_url: http://arxiv.org/abs/2311.15769
  • repo_url: https://github.com/HJYao00/Side4Video
  • paper_authors: Huanjin Yao, Wenhao Wu, Zhiheng Li
  • for: 本研究旨在提高大型计算机视觉模型的有效使用,尤其是在视频理解任务上。
  • methods: 我们提出了一种名为Side4Video的新方法,它使用轻量级的空间-时间侧网络来减少大型图像模型的内存使用量,并通过多级空间特征来避免反向传播。
  • results: 我们的方法在多个视频数据集上取得了出色的表现,特别是在Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)等任务上。
    Abstract Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.
    摘要 大型预训练视觉模型在计算机视觉中实现了很大的成功。然而,将整个模型完全精度地调整为下游任务,特别是视频理解,可能是计算上的极限。现有研究转移注意力于快速的图像到视频转移学习。然而,现有的高效调整方法忽略了训练内存使用和将大型模型传递到视频领域的探索。本文提出了一种新的空间-时间侧网络,用于高效地调整大图像模型到视频理解,称为Side4Video。我们引入了一个轻量级的空间-时间侧网络,与冰封的视觉模型相连,以避免将重要的预训练模型的后向传播,并利用多级空间特征从原始图像模型。这种极其内存高效的架构,使我们的方法可以减少75%的内存使用量,比前一代适配器基于方法更加高效。因此,我们可以将巨大的ViT-E(4.4B)模型传递到视频理解任务,这是14倍于ViT-L(304M)。我们的方法在不同的视频数据集上显示出了优异的表现(如Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)),特别是在跨模态任务(如动作认知和文本-视频检索)中表现出色。我们将代码发布在https://github.com/HJYao00/Side4Video。

PyNanospacing: TEM image processing tool for strain analysis and visualization

  • paper_url: http://arxiv.org/abs/2311.15751
  • repo_url: None
  • paper_authors: Mehmet Ali Sarsil, Mubashir Mansoor, Mert Saracoglu, Servet Timur, Mustafa Urgen, Onur Ergen
  • for: 本研究旨在开发一种用于TEM图像处理的Python代码,以便可以处理各种材料,包括粒子、2D材料、纯晶和固相材料。
  • methods: 本研究使用了Python编程语言,并开发了一种能够处理各种材料的TEM图像处理算法,可以将本地差异转换为对应的折线图,以便可以视觉地表示材料的压缩和扩展。
  • results: 本研究通过开发了一种可以处理各种材料的TEM图像处理算法,可以生成精确的材料特性信息,包括带隙、机械剪切模量、颜色、phonon和电子浓度等,以及材料表面和催化性能。这些结果可以帮助更深入地探索材料的压缩工程学,并通过对材料的压缩和扩展进行可视化表示。
    Abstract The diverse spectrum of material characteristics including band gap, mechanical moduli, color, phonon and electronic density of states, along with catalytic and surface properties are intricately intertwined with the atomic structure and the corresponding interatomic bond-lengths. This interconnection extends to the manifestation of interplanar spacings within a crystalline lattice. Analysis of these interplanar spacings and the comprehension of any deviations, whether it be lattice compression or expansion, commonly referred to as strain, hold paramount significance in unraveling various unknowns within the field. Transmission Electron Microscopy (TEM) is widely used to capture atomic-scale ordering, facilitating direct investigation of interplanar spacings. However, creating critical contour maps for visualizing and interpreting lattice stresses in TEM images remains a challenging task. Here we developed a Python code for TEM image processing that can handle a wide range of materials including nanoparticles, 2D materials, pure crystals and solid solutions. This algorithm converts local differences in interplanar spacings into contour maps allowing for a visual representation of lattice expansion and compression. The tool is very generic and can significantly aid in analyzing material properties using TEM images, allowing for a more in-depth exploration of the underlying science behind strain engineering via strain contour maps at the atomic level.
    摘要 “材料的多元谱spectrum,包括带隙、机械剪力、颜色、 fonon和电子密度状态,以及催化和表面性质,与原子结构和相应的间隔尺度紧密相关。这种关系还扩展到晶体中的间隔间观察。分析这些间隔间的变化,以及对这些变化的理解,是探索不同领域的关键。传输电子顾问(TEM)广泛用于捕捉原子级次字母,可以直接调查间隔间的变化。然而,从TEM图像中创建关键的护帐图可以视觉化和解释晶体的压缩和扩展是一项具有挑战性的任务。我们在这里开发了一个基于Python的TEM图像处理算法,可以处理各种材料,包括粒子、二维材料、纯净晶体和固溶液。这种算法将本地差异转换为带图,以便视觉化晶体的压缩和扩展。这个工具非常通用,可以帮助分析材料的特性,并且允许更深入探索在压缩工程中的科学基础。”

SIRAN: Sinkhorn Distance Regularized Adversarial Network for DEM Super-resolution using Discriminative Spatial Self-attention

  • paper_url: http://arxiv.org/abs/2311.16490
  • repo_url: None
  • paper_authors: Subhajit Paul, Ashutosh Gupta
  • for: This paper aims to generate high-resolution Digital Elevation Models (DEMs) using high-resolution multi-spectral (MX) satellite imagery with the assistance of adversarial learning.
  • methods: The proposed method utilizes polarized self-attention of discriminator spatial maps and a Densely connected Multi-Residual Block (DMRB) module to improve the efficiency of gradient flow. Additionally, the objective function is optimized using Sinkhorn distance with traditional GAN to address vanishing gradient issues and improve numerical convergence.
  • results: The proposed method demonstrates better performance compared to other learning-based state-of-the-art methods in terms of vanishing gradient issues and numerical convergence. The authors provide both qualitative and quantitative outcomes with available state-of-the-art methods and generate several high-resolution DEMs covering terrains with diverse signatures to show the performance of their model.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文目标是使用高分辨率多spectral (MX) 卫星图像生成高分辨率地形高程模型 (DEM),并利用对抗学习来优化模型。
  • methods: 该方法利用排斥自注意力的推导器空间地图,以及一种紧密连接的多重径径块 (DMRB) 模块,以提高梯度流的效率。此外,该方法使用传统GAN的锐度距离来优化目标函数,以Addressing vanishing gradient issues和数值稳定性问题。
  • results: 该方法比其他学习基于的当前状态码方法更好地性能,包括消失梯度问题和数值稳定性问题。作者们通过提供可用的当前状态码方法的质量和量化结果,并生成了覆盖不同特征的地形高程模型,以示其模型的性能。
    Abstract Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion of polarized self-attention of discriminator spatial maps as well as introduce a Densely connected Multi-Residual Block (DMRB) module to assist in efficient gradient flow. Further, we present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. We demonstrate both qualitative and quantitative outcomes with available state-of-the-art methods. Based on our experiments on DEM datasets of Shuttle Radar Topographic Mission (SRTM) and Cartosat-1, we show that the proposed model performs preferably against other learning-based state-of-the-art methods. We also generate and visualize several high-resolution DEMs covering terrains with diverse signatures to show the performance of our model.
    摘要 digital elevation model (DEM) 是远程感知领域中非常重要的一环,用于分析和探索不同的表面高程信息。在本研究中,我们计划使用高分辨率多spectral (MX) 卫星图像生成高分辨率 DEM,并通过对抗学习来优化这个过程。为了快速调控这个过程,我们利用推荐权重自注意力的推荐自动化机制,并引入密集连接多重残差块 (DMRB) 模块,以帮助更高效的梯度流。此外,我们还提出了一个基于传统 GAN 的对抗学习目标函数,以改进对抗学习的稳定性。在这个方面,我们提供了 both 理论和实验的证明,证明我们的模型在衰落梯度问题和数值收敛问题上具有更好的性能。我们通过对 SRTM 和 Cartosat-1 的 DEM 数据进行实验,证明我们的模型在learning-based 状态之前表现更好。此外,我们还生成了一些高分辨率 DEM ,用于展示我们的模型在不同的地形特征上的性能。

One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

  • paper_url: http://arxiv.org/abs/2311.15744
  • repo_url: None
  • paper_authors: Minghui Hu, Jianbin Zheng, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham
  • for: 提高 diffusion 模型中图像质量和准确性
  • methods: 提出一种增强图像质量的方法,即 One More Step (OMS),通过添加一个简单 yet effective 步骤来提高图像质量,同时保持原始模型参数
  • results: OMS 方法可以提高图像质量和准确性,并且可以让不同的预训练 diffusion 模型在同一个 latent Domain 中共享同一个 OMS 模块,无需修改原始模型参数
    Abstract It is well known that many open-released foundational diffusion models have difficulty in generating images that substantially depart from average brightness, despite such images being present in the training data. This is due to an inconsistency: while denoising starts from pure Gaussian noise during inference, the training noise schedule retains residual data even in the final timestep distribution, due to difficulties in numerical conditioning in mainstream formulation, leading to unintended bias during inference. To mitigate this issue, certain $\epsilon$-prediction models are combined with an ad-hoc offset-noise methodology. In parallel, some contemporary models have adopted zero-terminal SNR noise schedules together with $\mathbf{v}$-prediction, which necessitate major alterations to pre-trained models. However, such changes risk destabilizing a large multitude of community-driven applications anchored on these pre-trained models. In light of this, our investigation revisits the fundamental causes, leading to our proposal of an innovative and principled remedy, called One More Step (OMS). By integrating a compact network and incorporating an additional simple yet effective step during inference, OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
    摘要 “ Many open-released foundational diffusion models have difficulty generating images with significant deviations from the average brightness, despite the presence of such images in the training data. This is due to an inconsistency between the denoising process during inference and the training noise schedule, which can lead to biases during inference. To address this issue, some models use an ad-hoc offset-noise methodology or zero-terminal SNR noise schedules with $\mathbf{v}$-prediction. However, these changes can destabilize community-driven applications that rely on pre-trained models. In response, our investigation identifies the fundamental causes of this issue and proposes an innovative solution called One More Step (OMS). By integrating a compact network and adding an additional simple step during inference, OMS enhances image fidelity and resolves the disparity between training and inference, while preserving the original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.”Note that Simplified Chinese is used in mainland China, while Traditional Chinese is used in Taiwan and other regions. The translation may vary slightly depending on the specific dialect or region.

Machine Learning-Based Jamun Leaf Disease Detection: A Comprehensive Review

  • paper_url: http://arxiv.org/abs/2311.15741
  • repo_url: None
  • paper_authors: Auvick Chandra Bhowmik, Dr. Md. Taimur Ahad, Yousuf Rayhan Emon
  • for: 本研究旨在探讨机器学习技术在茭果叶病诊断中的应用,以提高茭果叶病诊断的效率和准确率。
  • methods: 本研究使用了多种机器学习模型,包括转移学习模型(TLMViT)、SLViT、SE-ViT、IterationViT、Tiny-LeViT、IEM-ViT、GreenViT、PMViT等,以及传统的 dense convolutional network(DenseNet)、Residual Neural Network(ResNet)-50V2、EfficientNet、Ensemble model、Convolutional Neural Network(CNN)等模型。
  • results: 本研究对多种数据集进行了评估,并证明了这些机器学习模型在实际应用中的可行性。
    Abstract Jamun leaf diseases pose a significant threat to agricultural productivity, negatively impacting both yield and quality in the jamun industry. The advent of machine learning has opened up new avenues for tackling these diseases effectively. Early detection and diagnosis are essential for successful crop management. While no automated systems have yet been developed specifically for jamun leaf disease detection, various automated systems have been implemented for similar types of disease detection using image processing techniques. This paper presents a comprehensive review of machine learning methodologies employed for diagnosing plant leaf diseases through image classification, which can be adapted for jamun leaf disease detection. It meticulously assesses the strengths and limitations of various Vision Transformer models, including Transfer learning model and vision transformer (TLMViT), SLViT, SE-ViT, IterationViT, Tiny-LeViT, IEM-ViT, GreenViT, and PMViT. Additionally, the paper reviews models such as Dense Convolutional Network (DenseNet), Residual Neural Network (ResNet)-50V2, EfficientNet, Ensemble model, Convolutional Neural Network (CNN), and Locally Reversible Transformer. These machine-learning models have been evaluated on various datasets, demonstrating their real-world applicability. This review not only sheds light on current advancements in the field but also provides valuable insights for future research directions in machine learning-based jamun leaf disease detection and classification.
    摘要 jamun 叶病种 pose 了一个重要的农业产量威胁,负面影响了 jamun 产业的产量和质量。随着机器学习的出现,有效地控制这些病种成为可能。早期检测和诊断是成功农业管理的关键。尽管没有特定于 jamun 叶病检测的自动化系统已经开发,但是对类似病种检测使用图像处理技术已经实施了许多自动化系统。本文提供了机器学习方法在植物叶病诊断中的广泛评论,可以适应 jamun 叶病检测。它仔细评估了不同的 Vision Transformer 模型,包括传输学习模型和视觉 transformer(TLMViT)、SLViT、SE-ViT、IterationViT、Tiny-LeViT、IEM-ViT、GreenViT 和 PMViT。此外,文章还评估了 dense convolutional network(DenseNet)、Residual Neural Network(ResNet)-50V2、EfficientNet、Ensemble model、Convolutional Neural Network(CNN)和 Locally Reversible Transformer 等机器学习模型。这些模型在不同的数据集上进行了评估, demonstrating 了它们在实际应用中的可行性。本文不仅披露了当前领域的进展,还提供了有价值的未来研究方向,包括机器学习基于 jamun 叶病检测和分类的未来研究。

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

  • paper_url: http://arxiv.org/abs/2311.15740
  • repo_url: https://github.com/feup-infolab/archmine
  • paper_authors: Mariana Dias, Carla Teixeira Lopes
  • For: 本研究旨在评估图像处理方法和参数调整在文本识别器中对文化遗产文档的影响。* Methods: 本研究使用了多目标问题形式和非主导种加速遗产算法(NSGA-II)来调整图像处理方法的参数。* Results: 评估结果显示,通过基于数字表示类型的参数化可以提高图像处理算法在文本识别器中的性能。此外,我们的发现还表明,在文本识别任务无需预处理时,使用图像预处理算法可能更适合。 Specifically, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
    Abstract Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
    摘要 链接数据在不同领域被应用为新的数据结构和连接方式。文化遗产机构使用链接数据来改进存档描述和提高信息发现。大多数存档记录都有数字表示物理artefact的扫描图像,这些图像不可读取机器。光学字符识别(OCR)可以识别图像中的文本并将其转换为机器编码文本。这篇论文评估了图像处理方法和参数调整在文化遗产文件中OCR应用的影响。该方法使用多目标问题的形式来最小化Levenshtein编辑距离并最大化正确地识别的单词数量,使用非主导种生态学算法(NSGA-II)来调整方法的参数。评估结果表明,根据数字表示类型进行参数化可以提高图像前处理算法在OCR中的性能。此外,我们的发现表明,在文本识别任务没有前处理时,使用图像前处理算法可能更适合。特别是,适应阈值、双向滤波和开口是对戏剧封面、信件和总体数据中的最佳性能的算法,应该在OCR之前应用以提高其性能。

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

  • paper_url: http://arxiv.org/abs/2311.15732
  • repo_url: https://github.com/whwu95/GPT4Vis
  • paper_authors: Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang
  • for: 这个研究不是推出新方法,而是关注最新的生成人工智能(GenAI)领域中的一个基线:使用 GPT-4 进行视觉理解。我们的研究是评估 GPT-4 的语言和视觉能力在零shot视觉识别任务中。特别是,我们探索了 GPT-4 生成的丰富文本描述如何提高识别性能无需任何训练。此外,我们评估了 GPT-4 直接识别多样化的视觉内容的视觉能力。
  • methods: 我们采用了一系列实验,系统地量化 GPT-4 在三种模式下的表现:图像、视频和点云。我们在 16 个广泛认可的数据集上进行了总共 16 个数据集的广泛测试,并提供了 top-1 和 top-5 精度指标。
  • results: 我们发现,通过利用 GPT-4 高级语言知识来生成丰富的描述,可以明显提高零shot识别性能。在视觉方面,GPT-4V 的平均表现在 16 个数据集中约等于 OpenAI-CLIP 的 ViT-L 和 EVA-CLIP 的 ViT-E。我们希望这项研究可以为未来的研究提供有价值的数据点和经验。我们的代码可以在 GitHub 上找到:https://github.com/whwu95/GPT4Vis。
    Abstract This paper does not present a novel method. Instead, it delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. Specifically, we explore the potential of its generated rich textual descriptions across various categories to enhance recognition performance without any training. Additionally, we evaluate its visual proficiency in directly recognizing diverse visual content. To achieve this, we conduct an extensive series of experiments, systematically quantifying the performance of GPT-4 across three modalities: images, videos, and point clouds. This comprehensive evaluation encompasses a total of 16 widely recognized benchmark datasets, providing top-1 and top-5 accuracy metrics. Our study reveals that leveraging GPT-4's advanced linguistic knowledge to generate rich descriptions markedly improves zero-shot recognition. In terms of visual proficiency, GPT-4V's average performance across 16 datasets sits roughly between the capabilities of OpenAI-CLIP's ViT-L and EVA-CLIP's ViT-E. We hope that this research will contribute valuable data points and experience for future studies. We release our code at https://github.com/whwu95/GPT4Vis.
    摘要 To conduct this study, we conducted an extensive series of experiments using 16 widely recognized benchmark datasets, providing top-1 and top-5 accuracy metrics. Our results show that leveraging GPT-4's advanced linguistic knowledge to generate rich descriptions significantly improves zero-shot recognition. In terms of visual proficiency, GPT-4's average performance across the 16 datasets is roughly between the capabilities of OpenAI-CLIP's ViT-L and EVA-CLIP's ViT-E.We hope that this research will provide valuable data points and experience for future studies. Our code is available at https://github.com/whwu95/GPT4Vis.

MARIS: Referring Image Segmentation via Mutual-Aware Attention Features

  • paper_url: http://arxiv.org/abs/2311.15727
  • repo_url: None
  • paper_authors: Mengxi Zhang, Yiming Liu, Xiangjun Yin, Huanjing Yue, Jingyu Yang
  • for: 这篇论文主要targets at Referring Image Segmentation (RIS) task, aiming to segment a specific region based on a language expression prompt.
  • methods: 该方法基于Segment Anything Model (SAM)和mutual-aware attention mechanism, which leverages two parallel branches to enhance cross-modal fusion. The mechanism includes Vision-Guided Attention和Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features.
  • results: 对三个 benchmark datasets进行了广泛的实验,并证明了该方法在RIS任务上的超越前方法。
    Abstract Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
    摘要 引用图像分割(RIS)目标是基于语言表达提示segement出特定区域。现有方法将语言特征与视觉特征结合并获得多modal特征进行面码编码。然而,这些方法可能会将视觉吸引人的实体 segment而不是正确的引用区域,因为多modal特征可能会受到丰富的视觉上下文的束缚。在这篇论文中,我们提出了MARIS方法,它基于Segment Anything Model(SAM)和两个并行分支的mutual-aware attention机制来增强交叉模式融合。具体来说,我们的mutual-aware attention机制包括视觉引导注意力和语言引导注意力,它们在视觉和语言特征之间bidirectionally模型关系。相应地,我们设计了一个面码解码器,以便在更加explicit的语言指导下进行更一致的分割。为此,我们提出了一个多modal查询符,以同时integrate语言信息和视觉信息。我们的方法在三个标准测试集上进行了广泛的实验,并示出了与状态前的RIS方法相比的优越性。我们的代码将公开available。

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.15707
  • repo_url: https://github.com/jiehonglin/sam-6d
  • paper_authors: Jiehong Lin, Lihua Liu, Dekun Lu, Kui Jia
  • for: 这篇论文旨在解决杂凑的6D物体姿 pose estimation问题,协助模型具备跨环境对应能力。
  • methods: 这篇论文提出了一个名为SAM-6D的新框架,通过两步骤,包括分割和姿 pose estimation,实现这个任务。
  • results: 这篇论文的结果显示,SAM-6D可以在杂凑的RGB-D图像上实现高性能的6D物体姿 pose estimation,并且比较出色的与现有方法比较。
    Abstract Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.
    摘要 zero-shot 6D对象姿态估计 involves the detection of novel objects with their 6D poses in cluttered scenes, which presents significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, providing a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance, and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without any frills, SAM-6D outperforms existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.16494
  • repo_url: None
  • paper_authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang
  • for: 这篇论文的目的是解决vision-language(V&L)模型在分布变化下的问题,提高其在下游任务上的效果。
  • methods: 这篇论文使用Attribute-Guided Prompt Tuning(ArGue)方法,做出三个主要贡献:1)在传统方法中直接附加软预告之前,将模型与大型自然语言模型生成的基本visual特征进行对齐。我们认为,当模型能够对这些特征表示高度信任,则可以识别正确的类别理由。2)引入特征抽样,删除不利的特征,从而只保留具有 semantics 意义的特征。3)提出负预告法,明确列出class-agnostic的特征,促使模型产生高度垂直的概率分布,与这些负面特征相关。
  • results: 在实验中,这篇论文的方法与现有的状态oce-of-the-art prompt tuning方法相比,在新类别预测和外部散度预测任务上表现出色,具有更高的效果。
    Abstract Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.
    摘要 although 软提示调整是效果很好地使vision-language(V&L)模型适应下游任务,它在分布转移问题上显示有限制。我们通过启发引导的提示调整(ArGue),作出三个关键贡献:1. 而不是直接附加软提示 перед类名,我们将模型与由大自然语言模型(LLM)生成的基本视觉特征进行对应。我们认为,当模型表达高度自信且与这些特征相关时,它才能够正确地理解类间的理由。2. 我们引入特征采样,以消除不利的特征,从而只保留 semantics 有意义的特征。3. 我们提出负提示,显式列出 class-agnostic 特征,以唤醒偶极相关性并鼓励模型生成高度归一化的概率分布与这些负特征相关。在实验中,我们的方法在新类预测和 OUT-OF-distribution 通用任务上显著超越当前状态的提示调整方法。

Model-agnostic Body Part Relevance Assessment for Pedestrian Detection

  • paper_url: http://arxiv.org/abs/2311.15679
  • repo_url: None
  • paper_authors: Maurice Günder, Sneha Banerjee, Rafet Sifa, Christian Bauckhage
  • for: This paper is written for explaining the workings of deep learning models, specifically in the context of computer vision and object detection.
  • methods: The paper uses sampling-based explanation methods, specifically KernelSHAP, to analyze the output of deep learning models.
  • results: The paper presents a novel sampling-based method that is more efficient and robust for explainability analyses on large-scale datasets, and demonstrates its effectiveness through experiments on a pedestrian detection task.
    Abstract Model-agnostic explanation methods for deep learning models are flexible regarding usability and availability. However, due to the fact that they can only manipulate input to see changes in output, they suffer from weak performance when used with complex model architectures. For models with large inputs as, for instance, in object detection, sampling-based methods like KernelSHAP are inefficient due to many computation-heavy forward passes through the model. In this work, we present a framework for using sampling-based explanation models in a computer vision context by body part relevance assessment for pedestrian detection. Furthermore, we introduce a novel sampling-based method similar to KernelSHAP that shows more robustness for lower sampling sizes and, thus, is more efficient for explainability analyses on large-scale datasets.
    摘要 model无关的解释方法对深度学习模型具有灵活性和可用性。然而,由于它们只能对输入进行修改,因此对复杂的模型结构表现弱。例如,在对物体检测进行Explainability分析时,使用抽象基于采样的方法,如KernelSHAP,可能因为大量的计算引入的前进 passes而变得不efficient。在这种情况下,我们提出了一个框架,用于在计算机视ión上使用抽象基于采样的解释模型。此外,我们还介绍了一种新的抽象基于采样方法,与KernelSHAP类似,但更加robust,可以在较小的采样大小下进行Explainability分析,因此更加高效。

HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

  • paper_url: http://arxiv.org/abs/2311.15672
  • repo_url: None
  • paper_authors: Xihe Yang, Xingyu Chen, Shaohui Wang, Daiheng Gao, Xiaoguang Han, Baoyuan Wang
  • for: 这 paper 的目的是 reconstruction of human avatars from few-shot unconstrained photos.
  • methods: 这 paper 使用了 skinning mechanism with deep marching tetrahedra (DMTet) 和 two-phase optimization method 来处理动态数据和缺乏数据的问题。
  • results: 这 paper 的 HaveFun 框架可以完成人物重建、渲染和动画。对于 developed benchmarks 的实验结果表明,HaveFun 的性能明显超过了其他方法。
    Abstract As for human avatar reconstruction, contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper, we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic articulated poses. For handling dynamic data, we integrate a skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable tetrahedral representation, which drives arbitrary mesh topologies generated by the DMTet for the adaptation of unconstrained images. To effectively mine instructive information from few-shot data, we devise a two-phase optimization method with few-shot reference and few-shot guidance. The former focuses on aligning avatar identity with reference images, while the latter aims to generate plausible appearances for unseen regions. Overall, our framework, called HaveFun, can undertake avatar reconstruction, rendering, and animation. Extensive experiments on our developed benchmarks demonstrate that HaveFun exhibits substantially superior performance in reconstructing the human body and hand. Project website: https://seanchenxy.github.io/HaveFunWeb/.
    摘要 对人体化身重建,当今技术通常需要高价的数据收集和寥寥一些临时图像来获得满意的结果。在这篇论文中,我们从临时无结构图像集中进行研究。人体化身从这种数据源重建具有限制数据量和动态艺术骨骼pose的挑战。为处理动态数据,我们将皮肤机制与深入迈征etrahedra(DMTet)集成,以形成可驱动的四面体表示,该表示驱动由DMTet生成的自由多面体结构,以适应无结构图像。为了有效地从少量数据中提取有益信息,我们设计了两相互独立的优化方法:一是对比图像和参考图像进行尺寸对齐,而二是为未经见过的区域生成可能的外观。总的来说,我们的框架,叫做HaveFun,可以进行人体重建、渲染和动画。我们在自己开发的benchmark上进行了广泛的实验,显示HaveFun在重建人体和手部方面具有显著的性能优势。项目官网:https://seanchenxy.github.io/HaveFunWeb/.

Deformation-Guided Unsupervised Non-Rigid Shape Matching

  • paper_url: http://arxiv.org/abs/2311.15668
  • repo_url: None
  • paper_authors: Aymen Merrouche, Joao Regateiro, Stefanie Wuhrer, Edmond Boyer
  • for: 非导向的形状匹配(shape matching),用于计算机视觉和图形应用中的基本步骤。
  • methods: 使用层次分辨率的补做方法(hierarchical patch based shape representation)和固定匹配到3D的方法(fitting a patch-wise near-rigid deformation model)来实现Robust不导向形状匹配。
  • results: 在使用 raw 3D 扫描数据时,本方法可以获得 significanly 更好的结果,与标准测试场景中的方法相当。
    Abstract We present an unsupervised data-driven approach for non-rigid shape matching. Shape matching identifies correspondences between two shapes and is a fundamental step in many computer vision and graphics applications. Our approach is designed to be particularly robust when matching shapes digitized using 3D scanners that contain fine geometric detail and suffer from different types of noise including topological noise caused by the coalescence of spatially close surface regions. We build on two strategies. First, using a hierarchical patch based shape representation we match shapes consistently in a coarse to fine manner, allowing for robustness to noise. This multi-scale representation drastically reduces the dimensionality of the problem when matching at the coarsest scale, rendering unsupervised learning feasible. Second, we constrain this hierarchical matching to be reflected in 3D by fitting a patch-wise near-rigid deformation model. Using this constraint, we leverage spatial continuity at different scales to capture global shape properties, resulting in matchings that generalize well to data with different deformations and noise characteristics. Experiments demonstrate that our approach obtains significantly better results on raw 3D scans than state-of-the-art methods, while performing on-par on standard test scenarios.
    摘要 我们提出了一种无监督数据驱动的非固定形匹配方法。形匹配是计算机视觉和图形应用中的基本步骤,我们的方法特别适用于使用3D扫描仪获取的精度高的形状数据,并且可以承受不同类型的噪声,包括surface region的凝结噪声。我们基于两个策略。首先,我们使用层次分解的 patch 基于形状表示,在粗细到细的层次上匹配形状,从而具有较高的噪声抗性。其次,我们使用 patch-wise 近似固定扭变模型来约束这个层次匹配,从而利用不同尺度的空间连续性来捕捉全局形状特征,使得匹配结果更好地泛化到不同的扭变和噪声特性。我们的方法在 raw 3D 扫描数据上实现了较好的结果,与标准测试场景的结果相当。

Technical Report for Argoverse Challenges on 4D Occupancy Forecasting

  • paper_url: http://arxiv.org/abs/2311.15660
  • repo_url: None
  • paper_authors: Pengfei Zheng, Kanokphan Lertniphonphan, Feng Chen, Siwei Chen, Bingchuan Sun, Jun Xie, Zhepeng Wang
  • for: 这个论文是为了解决CVPR 2023 工作坊 autonomous driving 中的4D 占用预测问题。
  • methods: 该方案使用了一个强大的 LiDAR 基础的 Bird’s Eye View(BEV)编码器,并将时间拼接和二stage 解码器结合在一起,其中包括一个 DETR 头和一个 UNet 解码器。
  • results: 该方案在 Argoverse 2 感知数据集上进行测试,并评估了未来3秒的占用状态。与基线相比,该方案实现了18%的 L1 误差降低(3.57),并在 Argoverse Challenges 中获得了4D 占用预测任务的第一名。
    Abstract This report presents our Le3DE2E_Occ solution for 4D Occupancy Forecasting in Argoverse Challenges at CVPR 2023 Workshop on Autonomous Driving (WAD). Our solution consists of a strong LiDAR-based Bird's Eye View (BEV) encoder with temporal fusion and a two-stage decoder, which combines a DETR head and a UNet decoder. The solution was tested on the Argoverse 2 sensor dataset to evaluate the occupancy state 3 seconds in the future. Our solution achieved 18% lower L1 Error (3.57) than the baseline and got the 1 place on the 4D Occupancy Forecasting task in Argoverse Challenges at CVPR 2023.
    摘要 本报告介绍我们的Le3DE2E_Occ解决方案,用于CVPR 2023 工作坊自动驾驶(WAD)中的4D占用预测挑战。我们的解决方案包括一个强大的 LiDAR 基础视图(BEV)编码器和时间融合,以及两个阶段解码器,其中一个是 DETR 头和一个 UNet 解码器。我们的解决方案在 Argoverse 2 感知数据集上进行测试,以评估未来 3 秒内的占用状态。我们的解决方案与基eline的 L1 误差比(3.57)下降了18%,并在 Argoverse Challenges 中获得了4D占用预测任务的第一名。

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.15657
  • repo_url: https://github.com/chaofengc/texforce
  • paper_authors: Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin
  • for: 提高 diffusion 模型中文描述和图像的对应性,以提高图像质量。
  • methods: 使用人工奖励学习或直接反propagation来修改 diffusion U-Net,但多数研究忽视了文本encoder的重要性,这常常是预训练并固定不变的。本文提出了通过 reinforcement learning 来训练文本encoder,从而提高文本-图像对应性,提高图像质量。
  • results: 我们的研究表明,可以通过训练文本encoder来提高 diffusion 模型的性能,而且可以使用 TexForce 来简单地将 U-Net 模型与文本encoder结合,无需进行额外训练。此外,我们还示出了我们的方法在多种应用中的适用性,包括生成高质量的人脸和手像。
    Abstract Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.
    摘要 文本到图像填充模型通常是通过优化log-likelihood目标函数来训练,这会带来下游任务中的图像美观和文本图像对齐的问题。现在的研究通过修改填充U-Net使用人工奖励或直接反射学习来解决这个问题。然而,许多研究忽视了文本encoder的重要性,它通常是预训练的并固定的。在这篇论文中,我们示出了通过适应学习训练文本encoder来提高文本图像对齐的结果,从而提高视觉质量。我们的主要动机来自于观察,现有的文本encoder是不佳的,经常需要精心的提示调整。虽然可以通过精细调整U-Net来部分提高性能,但它仍然受到文本encoder的限制。因此,我们提议使用适应学习和低级适应来训练文本encoder,称为TexForce。我们首先示出了通过适应学习训练文本encoder可以提高填充模型的性能。然后,我们示出了TexForce可以与现有的U-Net训练模型进行简单的组合,而不需要额外的训练。最后,我们展示了我们的方法在多种应用中的适用性,包括高质量的面和手图像生成。

Mitigating Hallucination in Visual Language Models with Visual Supervision

  • paper_url: http://arxiv.org/abs/2311.16479
  • repo_url: None
  • paper_authors: Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao, Jinqiao Wang, Ming Tang
    for:The paper aims to improve the performance of large vision-language models (LVLMs) by addressing the issue of hallucination in their responses.methods:The authors use a combination of methods to improve the performance of LVLMs, including:* Generating image-text pairs with detailed relationship annotations in the panoptic scene graph dataset (PSG)* Integrating SAM and mask prediction loss as auxiliary supervision* Proposing a new benchmark, RAH-Bench, to evaluate hallucination in LVLMsresults:The authors achieve the following results:* An +8.4% enhancement compared to the original LLaVA model* Widespread performance improvements across other modelsHere is the Chinese translation of the three key information points:for:这篇论文的目标是提高大型视力语言模型(LVLMs)的表现,解决它们的回答中的幻觉问题。methods:作者使用了以下方法来提高LVLMs的表现:* 生成具有细节关系注释的图文对象集(PSG)* 将SAM和掩码预测损失作为辅助监督* 提出一个新的评估标准套件,RAH-Bench,来评估LVLMs中的幻觉results:作者实现了以下结果:* LLaVA原型模型的+8.4%提升* 其他模型的广泛表现改进
    Abstract Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the image, encouraging the model to answer questions based on multi-modal contexts. On the other hand, we integrate SAM and mask prediction loss as auxiliary supervision, forcing the LVLMs to have the capacity to identify context-related objects, so that they can generate more accurate responses, mitigating hallucination. Moreover, to provide a deeper evaluation on the hallucination in LVLMs, we propose a new benchmark, RAH-Bench. It divides vision hallucination into three different types that contradicts the image with wrong categories, attributes or relations, and introduces False Positive Rate as detailed sub-metric for each type. In this benchmark, our approach demonstrates an +8.4% enhancement compared to original LLaVA and achieves widespread performance improvements across other models.
    摘要 大型视言语模型(LVLM)经常会出现幻觉现象,生成响应时间不符合图像内容。主要问题在于其在多模式上下文中理解细节的能力不强,可以归结于训练数据和损失函数两个方面。视 instrucion 集合主要关注全局描述,自动递归损失函数偏好文本模型化而非图像理解。在这篇论文中,我们采用更详细的视觉注释和更能够区分的视觉模型,以便在训练 LVLM 时更好地提高其对多模式上下文的理解,从而避免幻觉现象。一方面,我们生成了具有细节关系注释的图像文本对(PSG),这些对话更加注重图像中的细节信息,让模型更容易根据多模式上下文回答问题。另一方面,我们将 SAM 和面 predicate 损失函数作为辅助监督,让 LVLM 具备Context-related对象的识别能力,以便更准确地回答问题,避免幻觉现象。此外,为了对 LVLM 中的幻觉进行更深入的评估,我们提出了一个新的标准套件 RAH-Bench。它将视觉幻觉分为三类:图像与错误类别、属性或关系相关,并在每类上 introduce False Positive Rate 的详细子指标。在这个套件中,我们的方法与原始 LLVA 相比增加了+8.4%,并在其他模型上实现了广泛的性能提升。

PaintNeSF: Artistic Creation of Stylized Scenes with Vectorized 3D Strokes

  • paper_url: http://arxiv.org/abs/2311.15637
  • repo_url: None
  • paper_authors: Hao-Bin Duan, Miao Wang, Yan-Xun Li, Yong-Liang Yang
  • for: 生成3D场景的独特风格图像
  • methods: 使用 vector stroke 模拟人工艺术创作过程,从基本 primitives 和 splines 中生成彩色3D stroke 笔触集,将3D scene 材料化为多视图图像
  • results: 能够有效地Synthesize 3D scene 的Geometric 和艺术风格化图像,同时保持不同视图的一致性
    Abstract We present Paint Neural Stroke Field (PaintNeSF), a novel technique to generate stylized images of a 3D scene at arbitrary novel views from multi-view 2D images. Different from existing methods which apply stylization to trained neural radiance fields at the voxel level, our approach draws inspiration from image-to-painting methods, simulating the progressive painting process of human artwork with vector strokes. We develop a palette of stylized 3D strokes from basic primitives and splines, and consider the 3D scene stylization task as a multi-view reconstruction process based on these 3D stroke primitives. Instead of directly searching for the parameters of these 3D strokes, which would be too costly, we introduce a differentiable renderer that allows optimizing stroke parameters using gradient descent, and propose a training scheme to alleviate the vanishing gradient issue. The extensive evaluation demonstrates that our approach effectively synthesizes 3D scenes with significant geometric and aesthetic stylization while maintaining a consistent appearance across different views. Our method can be further integrated with style loss and image-text contrastive models to extend its applications, including color transfer and text-driven 3D scene drawing.
    摘要 我们介绍Paint Neural Stroke Field(PaintNeSF),一种新的技术来生成3D场景中的不同视角的写实图像。与现有方法不同,我们的方法从图像至画作的方法中获取灵感,通过vector stroke来模拟人工艺术创作的进程。我们发展了一个3D stroke的画传alette和spline,并视3D场景塑型化任务为基于这些3D stroke元素的多视角重建过程。而不是直接搜寻这些3D stroke的参数,我们创建了一个可微的渲染器,通过梯度下降来优化参数,并提出了一个训练方案来解决减少梯度问题。我们的方法可以与style loss和图像对比方面的模型一起运用,包括颜色转移和文本驱动3D场景绘制。

Only Positive Cases: 5-fold High-order Attention Interaction Model for Skin Segmentation Derived Classification

  • paper_url: http://arxiv.org/abs/2311.15625
  • repo_url: https://github.com/wurenkai/MHA-UNet
  • paper_authors: Renkai Wu, Yinghao Liu, Pengchen Liang, Qing Chang
  • for: 这种皮肤疾病计算机辅助诊断工具,用于帮助皮肤科医生和患者更好地理解计算机辅助诊断的学习和预测过程。
  • methods: 该paper提出了一种多高级注意力互动模型(MHA-UNet),该模型可以通过可解释的理由来判断皮肤疾病是否存在,而不需要训练使用负样本。具体来说,该模型引入了高级注意力互动机制,将压缩注意力引入到更高级别的特征注意力中。此外,该paper还提出了一种多高级注意力互动模块(MHAblock),该模块通过组合不同级别的特征来实现更高效的分类和预测。
  • results: 在explainable reasoning的基础上,该paper在absence of negative samples的情况下进行了分类和 segmentation experiments,并取得了81.0%的最高正确检测率和83.5%的最高负检测率。此外,与13种医学 segmentation模型进行了比较,以及与8种 externally validate models在三个公共数据集和我们的临床数据集中进行了比较,得到了state-of-the-art的性能。
    Abstract Computer-aided diagnosis of skin diseases is an important tool. However, the interpretability of computer-aided diagnosis is currently poor. Dermatologists and patients cannot intuitively understand the learning and prediction process of neural networks, which will lead to a decrease in the credibility of computer-aided diagnosis. In addition, traditional methods need to be trained using negative samples in order to predict the presence or absence of a lesion, but medical data is often in short supply. In this paper, we propose a multiple high-order attention interaction model (MHA-UNet) for use in a highly explainable skin lesion segmentation task. MHA-UNet is able to obtain the presence or absence of a lesion by explainable reasoning without the need for training on negative samples. Specifically, we propose a high-order attention interaction mechanism that introduces squeeze attention to a higher level for feature attention. In addition, a multiple high-order attention interaction (MHAblock) module is proposed by combining the different features of different orders. For classifying the presence or absence of lesions, we conducted classification experiments on several publicly available datasets in the absence of negative samples, based on explainable reasoning about the interaction of 5 attention orders of MHAblock. The highest positive detection rate obtained from the experiments was 81.0% and the highest negative detection rate was 83.5%. For segmentation experiments, comparison experiments of the proposed method with 13 medical segmentation models and external validation experiments with 8 state-of-the-art models in three public datasets and our clinical dataset demonstrate the state-of-the-art performance of our model. The code is available from https://github.com/wurenkai/MHA-UNet.
    摘要 计算机助动诊断皮肤疾病是一种重要的工具。然而,计算机助动诊断的可解释性目前很差。皮肤科医生和患者无法直观地理解计算机助动诊断中的学习和预测过程,这将导致计算机助动诊断的信用度下降。另外,传统方法需要通过负样本进行训练,以预测皮肤疾病的存在或缺失,但医疗数据很少。在这篇论文中,我们提出了一种多重高阶注意力互动模型(MHA-UNet),用于实现高度可解释的皮肤疾病分割任务。MHA-UNet可以通过可解释的理由来确定皮肤疾病的存在或缺失,而无需训练负样本。具体来说,我们提出了一种高阶注意力互动机制,通过将压缩注意力引入到更高级别的特征注意力中来实现。此外,我们还提出了一种多重高阶注意力互动模块(MHAblock),通过将不同级别的特征相互结合来实现。为了 классификация皮肤疾病的存在或缺失,我们在absence of negative samples基于可解释的理由进行了分类实验,并取得了最高的正确检测率81.0%和最高的负检测率83.5%。 для分割任务,我们对13种医疗分割模型进行了比较实验,并对8种 state-of-the-art模型进行了外部验证实验在三个公共数据集和我们的临床数据集中,并证明了我们的模型的顶峰性能。代码可以从https://github.com/wurenkai/MHA-UNet获取。

Technical Report for Argoverse Challenges on Unified Sensor-based Detection, Tracking, and Forecasting

  • paper_url: http://arxiv.org/abs/2311.15615
  • repo_url: None
  • paper_authors: Zhepeng Wang, Feng Chen, Kanokphan Lertniphonphan, Siwei Chen, Jinyao Bao, Pengfei Zheng, Jinbao Zhang, Kaer Huang, Tao Zhang
  • for: 本文提出了一种用于检测、跟踪和预测的集成式感知器解决方案,用于 Argoverse Challenges at CVPR 2023 Workshop on Autonomous Driving (WAD) 。
  • methods: 本文提出了一种把检测、跟踪和预测三个任务集成到一起的网络模型,采用强大的 Bird’s Eye View (BEV) 编码器进行空间和时间融合,生成了多任务共享表示。
  • results: 本文在 Argoverse 2 感知数据集上测试了该解决方案,并在 E2E Forecasting 轨道上达到了检测、跟踪和预测26个物体类型的1stPlace。
    Abstract This report presents our Le3DE2E solution for unified sensor-based detection, tracking, and forecasting in Argoverse Challenges at CVPR 2023 Workshop on Autonomous Driving (WAD). We propose a unified network that incorporates three tasks, including detection, tracking, and forecasting. This solution adopts a strong Bird's Eye View (BEV) encoder with spatial and temporal fusion and generates unified representations for multi-tasks. The solution was tested in the Argoverse 2 sensor dataset to evaluate the detection, tracking, and forecasting of 26 object categories. We achieved 1st place in Detection, Tracking, and Forecasting on the E2E Forecasting track in Argoverse Challenges at CVPR 2023 WAD.
    摘要 这份报告介绍我们的Le3DE2E解决方案,用于统一感知器基于探测、跟踪和预测在CVPR 2023 工作坊自动驾驶(WAD)中。我们提议一个统一网络,其中包含三个任务,包括探测、跟踪和预测。该解决方案采用强大的鸟瞰视图(BEV)编码器,并进行空间和时间融合,生成统一表示多个任务。我们在Argoverse 2感知数据集上测试了这种解决方案,以评估26种物体类型的探测、跟踪和预测性能。我们在Argoverse Challenges at CVPR 2023 WAD中获得了探测、跟踪和预测3个track的第一名。

RetouchUAA: Unconstrained Adversarial Attack via Image Retouching

  • paper_url: http://arxiv.org/abs/2311.16478
  • repo_url: None
  • paper_authors: Mengda Xie, Yiling He, Meie Fang
    for:RetouchUAA is designed to attack deep neural networks (DNNs) by exploiting image retouching styles, which are more realistic and interpretable than traditional attacks.methods:RetouchUAA uses a custom-designed image retouching attack framework and a retouching style guidance module to generate realistic and interpretable perturbations. The framework linearizes images and models human retouching behavior, while the guidance module ensures the perturbations are standard retouching styles.results:RetouchUAA achieves nearly 100% white-box attack success against three DNNs on ImageNet and Place365, with a better trade-off between image naturalness, transferability, and defense robustness than baseline attacks.
    Abstract Deep Neural Networks (DNNs) are susceptible to adversarial examples. Conventional attacks generate controlled noise-like perturbations that fail to reflect real-world scenarios and hard to interpretable. In contrast, recent unconstrained attacks mimic natural image transformations occurring in the real world for perceptible but inconspicuous attacks, yet compromise realism due to neglect of image post-processing and uncontrolled attack direction. In this paper, we propose RetouchUAA, an unconstrained attack that exploits a real-life perturbation: image retouching styles, highlighting its potential threat to DNNs. Compared to existing attacks, RetouchUAA offers several notable advantages. Firstly, RetouchUAA excels in generating interpretable and realistic perturbations through two key designs: the image retouching attack framework and the retouching style guidance module. The former custom-designed human-interpretability retouching framework for adversarial attack by linearizing images while modelling the local processing and retouching decision-making in human retouching behaviour, provides an explicit and reasonable pipeline for understanding the robustness of DNNs against retouching. The latter guides the adversarial image towards standard retouching styles, thereby ensuring its realism. Secondly, attributed to the design of the retouching decision regularization and the persistent attack strategy, RetouchUAA also exhibits outstanding attack capability and defense robustness, posing a heavy threat to DNNs. Experiments on ImageNet and Place365 reveal that RetouchUAA achieves nearly 100\% white-box attack success against three DNNs, while achieving a better trade-off between image naturalness, transferability and defense robustness than baseline attacks.
    摘要 深度神经网络(DNNs)易受到攻击性示例的影响。传统攻击通常生成控制的噪声样本,这些样本不reflect real-world scenario并difficult to interpret. 在相反,最近的无约束攻击模仿了实际 mundo naturale transformation,通过不可见的方式进行攻击,但是这些攻击可能会compromise realism due to neglect of image post-processing and uncontrolled attack direction。在这篇论文中,我们提出了RetouchUAA,一种无约束攻击,利用了实际的杂化:图像修改样式。相比之下,RetouchUAA具有以下几个优势:首先,RetouchUAA通过两个关键的设计:图像修改攻击框架和修改样式指导模块,生成了可解释和真实的杂化。图像修改攻击框架是人类可解释的攻击框架,用于模拟人类修改行为中的本地加工和修改决策。这种框架提供了一个可读的和合理的管道,以便理解DNNs对修改的Robustness。修改样式指导模块使得攻击图像朝向标准的修改样式,从而确保其真实性。其次,由于RetouchUAA的修改决策 regularization和持续攻击策略,它也表现出了出色的攻击能力和防御Robustness,对DNNs构成了严重的威胁。实验结果表明,RetouchUAA在ImageNet和Place365上 Achieves nearly 100% white-box attack success rate against three DNNs,而且在图像自然性、传输性和防御Robustness之间取得了更好的平衡。

Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars

  • paper_url: http://arxiv.org/abs/2311.16482
  • repo_url: https://github.com/jimmyYliu/Animatable-3D-Gaussian
  • paper_authors: Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang
  • for: 实现高品质的人工人体 avatar,但训练和渲染成本高。
  • methods: 使用3D Gaussian 学习人体姿势和形状,并将3D Gaussian 扩展到动态人体场景中。
  • results: 在不同的观察角度和姿势下,能够实现高品质的重建和新视角生成,并且比较昂贵的训练和渲染成本较低。
    Abstract Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce hash-encoded shape and appearance to speed up training and propose time-dependent ambient occlusion to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method outperforms existing methods in terms of training time, rendering speed, and reconstruction quality. Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.
    摘要

A manometric feature descriptor with linear-SVM to distinguish esophageal contraction vigor

  • paper_url: http://arxiv.org/abs/2311.15609
  • repo_url: None
  • paper_authors: Jialin Liu, Lu Yan, Xiaowei Liu, Yuzhuo Dai, Fanggen Lu, Yuanting Ma, Muzhou Hou, Zheng Wang
  • for: 诊断食管功能障碍的临床诊断和评估方法。
  • methods: 使用高分辨率抽象功能测试(HRM)测量食管功能动态特征,并通过图像处理技术预测食管收缩力。
  • results: 使用特征提取和历史ogram of Gradients(FE-HOG)分析食管提肠动作特征,并使用线性支持向量机(linear-SVM)进行分类,实现了较高的识别精度(86.83%),比其他常用机器学习方法高。
    Abstract n clinical, if a patient presents with nonmechanical obstructive dysphagia, esophageal chest pain, and gastro esophageal reflux symptoms, the physician will usually assess the esophageal dynamic function. High-resolution manometry (HRM) is a clinically commonly used technique for detection of esophageal dynamic function comprehensively and objectively. However, after the results of HRM are obtained, doctors still need to evaluate by a variety of parameters. This work is burdensome, and the process is complex. We conducted image processing of HRM to predict the esophageal contraction vigor for assisting the evaluation of esophageal dynamic function. Firstly, we used Feature-Extraction and Histogram of Gradients (FE-HOG) to analyses feature of proposal of swallow (PoS) to further extract higher-order features. Then we determine the classification of esophageal contraction vigor normal, weak and failed by using linear-SVM according to these features. Our data set includes 3000 training sets, 500 validation sets and 411 test sets. After verification our accuracy reaches 86.83%, which is higher than other common machine learning methods.
    摘要 在临床中,如果患者出现非机械性喉窄综合征、胸部食管疼痛和食管胃阻症状,医生通常会评估食管动功能。高分辨率振荡测试(HRM)是临床广泛使用的技术,可全面和Objectively检测食管动功能。然而,HRM测试结果获得后,医生仍需根据多种参数评估。这个工作困难重重,评估过程复杂。我们通过图像处理技术来预测食管收缩力,以帮助评估食管动功能。首先,我们使用特征提取和 histogram of gradients(FE-HOG)分析提案的喉窄(PoS)特征,以提取更高级别的特征。然后,我们根据这些特征使用线性支持向量机器学习(linear-SVM)进行分类,并将食管收缩力分为正常、弱和失败三类。我们的数据集包括3000个训练集、500个验证集和411个测试集。经验验证后,我们的准确率达86.83%,高于其他常见机器学习方法。

2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.15605
  • repo_url: None
  • paper_authors: Ozan Unal, Dengxin Dai, Lukas Hoyer, Yigit Baran Can, Luc Van Gool
  • for: 提高 LiDAR semantic segmentation 的大规模标注数据集的需求,新方法在不监督训练中减少标注的必要性。
  • methods: 使用 RGB 图像提供更密集的场景表示,并使用域 adaptation 的二维semantic segmentation 网络提取高级特征信息。一种一向冲突学习方案和 FOVMix 混合策略来弥补 LiDAR 和 RGB 摄像头之间的水平视场匹配问题。
  • results: IGNet 在 ScribbleKITTI 上实现了不监督 LiDAR semantic segmentation 的状态之末表现,与完全监督训练相比,只需8% 标注点,提高了98% 的相对性能。此外,我们还证明我们的贡献在 semi-supervised 训练中也是最佳的。
    Abstract As 3D perception problems grow in popularity and the need for large-scale labeled datasets for LiDAR semantic segmentation increase, new methods arise that aim to reduce the necessity for dense annotations by employing weakly-supervised training. However these methods continue to show weak boundary estimation and high false negative rates for small objects and distant sparse regions. We argue that such weaknesses can be compensated by using RGB images which provide a denser representation of the scene. We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network. We further utilize a one-way contrastive learning scheme alongside a novel mixing strategy called FOVMix, to combat the horizontal field-of-view mismatch between the two sensors and enhance the effects of image guidance. IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points, while introducing no additional annotation burden or computational/memory cost during inference. Furthermore, we show that our contributions also prove effective for semi-supervised training, where IGNet claims state-of-the-art results on both ScribbleKITTI and SemanticKITTI.
    摘要 As 3D perception problems become more popular and the need for large-scale labeled datasets for LiDAR semantic segmentation increases, new methods have emerged that aim to reduce the need for dense annotations by using weakly-supervised training. However, these methods still struggle with weak boundary estimation and high false negative rates for small objects and distant sparse regions. We believe that these weaknesses can be compensated by using RGB images, which provide a denser representation of the scene. We propose an image-guidance network (IGNet) that builds upon the idea of distilling high-level feature information from a domain-adapted synthetically trained 2D semantic segmentation network. We also use a one-way contrastive learning scheme and a novel mixing strategy called FOVMix to combat the horizontal field-of-view mismatch between the two sensors and enhance the effects of image guidance. IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, with up to 98% relative performance to fully supervised training using only 8% labeled points, without adding any additional annotation burden or computational/memory cost during inference. Furthermore, we show that our contributions are also effective for semi-supervised training, achieving state-of-the-art results on both ScribbleKITTI and SemanticKITTI.

Progressive Target-Styled Feature Augmentation for Unsupervised Domain Adaptation on Point Clouds

  • paper_url: http://arxiv.org/abs/2311.16474
  • repo_url: https://github.com/xiaoyao3302/ptsfa
  • paper_authors: Zicheng Wang, Zhen Zhao, Yiming Wu, Luping Zhou, Dong Xu
  • for: 这个研究是为了解决对点云资料进行适应分析中的无监督领域转换问题,因为模型在新的enario中常常会受到领域shift的影响,导致其表现不佳。
  • methods: 我们提出了一种新的方法,即进步的目标式化特征增强(PTSFA),它不同于先前的方法,它不是针对特征提取器进行适应,而是针对分类器进行适应,以便让分类器能够识别目标式的原始特征。
  • results: 我们在benchmark datasets上验证了我们的方法,其中我们的方法在新的state-of-the-art表现中得到了新的最佳性能。
    Abstract Unsupervised domain adaptation is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to domain shifts. Previous works tackle the problem by using adversarial training or self-supervised learning for feature extractor adaptation, but ensuring that features extracted from the target domain can be distinguished by the source-supervised classifier remains challenging. In this work, we propose a novel approach called progressive target-styled feature augmentation (PTSFA). Unlike previous works that focus on feature extractor adaptation, our PTSFA approach focuses on classifier adaptation. It aims to empower the classifier to recognize target-styled source features and progressively adapt to the target domain. To enhance the reliability of predictions within the PTSFA framework and encourage discriminative feature extraction, we further introduce a new intermediate domain approaching (IDA) strategy. We validate our method on the benchmark datasets, where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/PTSFA.
    摘要 <> translate the following text into Simplified Chinese:Unsupervised domain adaptation is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to domain shifts. Previous works tackle the problem by using adversarial training or self-supervised learning for feature extractor adaptation, but ensuring that features extracted from the target domain can be distinguished by the source-supervised classifier remains challenging. In this work, we propose a novel approach called progressive target-styled feature augmentation (PTSFA). Unlike previous works that focus on feature extractor adaptation, our PTSFA approach focuses on classifier adaptation. It aims to empower the classifier to recognize target-styled source features and progressively adapt to the target domain. To enhance the reliability of predictions within the PTSFA framework and encourage discriminative feature extraction, we further introduce a new intermediate domain approaching (IDA) strategy. We validate our method on the benchmark datasets, where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/PTSFA.Translate the text into Simplified Chinese:<>域外预处理是点云分析领域中的一项关键挑战,因为训练于一个数据集的模型在新enario中表现不佳,这是因为域 shift。先前的工作通过对 feature extractor 进行 adversarial 训练或自我超vision 学习来解决问题,但是保证目标域中提取的特征可以被来自源监督学习器认可仍然是挑战。在这项工作中,我们提出了一种新的方法called progressive target-styled feature augmentation (PTSFA)。与先前的方法不同的是,PTSFA 方法将注意力集中在 classifier 的适应上,它目标是使 classifier 能够认可目标风格的源特征,并逐步适应目标域。为了在 PTSFA 框架中提高预测可靠性和促进特征提取,我们还引入了一种新的中间域接近策略(IDA)。我们在标准测试集上验证了我们的方法,其 achieves 新的 state-of-the-art 性能。我们的代码可以在 https://github.com/xiaoyao3302/PTSFA 中找到。

LFSRDiff: Light Field Image Super-Resolution via Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16517
  • repo_url: https://github.com/chaowentao/lfsrdiff
  • paper_authors: Wentao Chao, Fuqing Duan, Xuechun Wang, Yingqian Wang, Guanghui Wang
  • For: 本研究 targets the problem of light field (LF) image super-resolution (SR), which is challenging due to the inherent ill-posed nature of LF images.* Methods: The proposed approach, LFSRDiff, incorporates a disentangled U-Net for diffusion models to effectively extract and fuse spatial and angular information within LF images.* Results: The proposed approach consistently produces diverse and realistic SR results, achieving the highest perceptual metric in terms of LPIPS and demonstrating the ability to effectively control the trade-off between perception and distortion.Here is the summary in Traditional Chinese:* For: 本研究探讨了光场(LF)图像超解析(SR)的问题,LF图像的自然缺乏定义性使得SR问题变得更加困难。* Methods: 提案的方法LFSRDiff,通过将U-Net采用分离的方法来实现Diffusion模型中的LF分离,以更好地提取和融合LF图像中的空间和角度信息。* Results: 提案的方法能够预量得到多样化和现实的SR结果,在LPIPS上取得最高的感知指标,并能够有效地控制干扰和歪斜之间的调节。
    Abstract Light field (LF) image super-resolution (SR) is a challenging problem due to its inherent ill-posed nature, where a single low-resolution (LR) input LF image can correspond to multiple potential super-resolved outcomes. Despite this complexity, mainstream LF image SR methods typically adopt a deterministic approach, generating only a single output supervised by pixel-wise loss functions. This tendency often results in blurry and unrealistic results. Although diffusion models can capture the distribution of potential SR results by iteratively predicting Gaussian noise during the denoising process, they are primarily designed for general images and struggle to effectively handle the unique characteristics and information present in LF images. To address these limitations, we introduce LFSRDiff, the first diffusion-based LF image SR model, by incorporating the LF disentanglement mechanism. Our novel contribution includes the introduction of a disentangled U-Net for diffusion models, enabling more effective extraction and fusion of both spatial and angular information within LF images. Through comprehensive experimental evaluations and comparisons with the state-of-the-art LF image SR methods, the proposed approach consistently produces diverse and realistic SR results. It achieves the highest perceptual metric in terms of LPIPS. It also demonstrates the ability to effectively control the trade-off between perception and distortion. The code is available at \url{https://github.com/chaowentao/LFSRDiff}.
    摘要 Light field (LF) 图像超解析 (SR) 是一个复杂的问题,因为一个低分辨率 (LR) 输入 LF 图像可能对应多个可能的超解析结果。尽管如此,主流 LF 图像 SR 方法通常采用deterministic方法,生成受 pixel-wise 损失函数约束的唯一输出。这种做法经常导致模糊和不真实的结果。虽然扩散模型可以捕捉 LF 图像的可能性分布,但它们通常是为普通图像而设计,对 LF 图像的特殊特征和信息不够有效。为解决这些局限性,我们介绍了 LFSRDiff,首个基于扩散的 LF 图像 SR 模型,通过 incorporating LF 分离机制。我们的新贡献包括在扩散模型中引入分离U-Net,以更好地提取和融合 LF 图像中的空间和方向信息。经过全面的实验评估和与当前领先的 LF 图像 SR 方法进行比较,我们的提案通常生成多样和真实的 SR 结果,并在 LPIPS 指标中达到最高的 perceval 值。此外,我们的方法还能够有效控制对比度和损害之间的负荷。代码可以在 GitHub 上找到:\url{https://github.com/chaowentao/LFSRDiff}.

An Ensemble of 2.5D ResUnet Based Models for Segmentation for Kidney and Masses

  • paper_url: http://arxiv.org/abs/2311.15586
  • repo_url: None
  • paper_authors: Cancan Chen, RongguoZhang
  • for: 本研究的目的是提出一种高效的计算机断层成像(CT)扫描图像的肾脏、肾肿和肾腔分割方法。
  • methods: 本研究使用了2.5D ResUnet建立了一个高效的含义semantic segmentation框架,从粗到细进行分割。使用了489个CT扫描图像进行训练和验证,并使用了一个从未使用过的独立CT扫描图像进行测试。
  • results: 研究结果表明,提出的方法可以得到较高的 dice 值(0.954、0.792、0.691),surface dice 值(0.897、0.591、0.541)以及较低的平均扫描时间(20.65s)和最大GPU内存占用量(3525MB)。结果表明,提出的方法可以做到更好的平衡模型性能和效率。
    Abstract The automatic segmentation of kidney, kidney tumor and kidney cyst on Computed Tomography (CT) scans is a challenging task due to the indistinct lesion boundaries and fuzzy texture. Considering the large range and unbalanced distribution of CT scans' thickness, 2.5D ResUnet are adopted to build an efficient coarse-to-fine semantic segmentation framework in this work. A set of 489 CT scans are used for training and validation, and an independent never-before-used CT scans for testing. Finally, we demonstrate the effectiveness of our proposed method. The dice values on test set are 0.954, 0.792, 0.691, the surface dice values are 0.897, 0.591, 0.541 for kidney, tumor and cyst, respectively. The average inference time of each CT scan is 20.65s and the max GPU memory is 3525MB. The results suggest that a better trade-off between model performance and efficiency.
    摘要 自动 segmentation of kidney, kidney tumor, and kidney cyst on Computed Tomography (CT) scans 是一个复杂的任务,因为 lesion boundaries 和 texture 具有模糊的特征。 考虑到 CT scans 的厚度范围很大且分布不均,这里采用了 2.5D ResUnet 建立了一个高效的 course-to-fine semantic segmentation 框架。 使用了 489 个 CT scans 进行训练和验证,并在独立的 never-before-used CT scans 上进行测试。 最终,我们证明了我们的提议的方法的效果。 测试集的 dice 值为 0.954, 0.792, 0.691,表面 dice 值为 0.897, 0.591, 0.541 для kidney, tumor 和 cyst,分别。 每个 CT scans 的平均推断时间为 20.65s,最大的 GPU 内存为 3525MB。 结果表明我们的方法可以更好地平衡模型性能和效率。

A deep learning approach for marine snow synthesis and removal

  • paper_url: http://arxiv.org/abs/2311.15584
  • repo_url: https://github.com/fergaletto/mssr
  • paper_authors: Fernando Galetto, Guang Deng
  • for: 提高水下图像的可见性和人工智能系统的性能,解决海洋灰尘对水下图像的污染问题。
  • methods: 使用深度学习技术,首先生成真实的海洋灰尘样本,并将其与自然水下图像结合成一个对应的集合。然后,使用 U-Net 模型进行海洋灰尘除除作为图像到图像翻译任务,以高精度除除海洋灰尘。
  • results: 实验表明,U-Net 模型可以高效地除除自然和人工生成的海洋灰尘,并高于现有方法(如 median 滤波器和其自适应变体)。我们还证明了我们的方法在 MSRB 数据集上的稳定性,该数据集包含我们的模型在训练过程中没有看到的 sintetic artifacts。
    Abstract Marine snow, the floating particles in underwater images, severely degrades the visibility and performance of human and machine vision systems. This paper proposes a novel method to reduce the marine snow interference using deep learning techniques. We first synthesize realistic marine snow samples by training a Generative Adversarial Network (GAN) model and combine them with natural underwater images to create a paired dataset. We then train a U-Net model to perform marine snow removal as an image to image translation task. Our experiments show that the U-Net model can effectively remove both synthetic and natural marine snow with high accuracy, outperforming state-of-the-art methods such as the Median filter and its adaptive variant. We also demonstrate the robustness of our method by testing it on the MSRB dataset, which contains synthetic artifacts that our model has not seen during training. Our method is a practical and efficient solution for enhancing underwater images affected by marine snow.
    摘要 海洋瑞透,浮游的粉尘在水下图像中严重降低了人工和机器视觉系统的可见性和性能。这篇论文提出了一种使用深度学习技术来减少海洋瑞透干扰的新方法。我们首先使用生成对抗网络(GAN)模型生成了真实的海洋瑞透样本,然后与自然水下图像相结合,形成了一个对应的数据集。然后,我们使用U-Net模型进行海洋瑞透除去作为图像到图像翻译任务进行训练。我们的实验表明,U-Net模型可以高精度地除去自然和人工生成的海洋瑞透,超过了状态机制的方法,如 médian滤波和其自适应变体。我们还证明了我们的方法的稳定性,通过在MSRB数据集上测试我们的模型,该数据集包含Synthetic artifacts,我们的模型在训练过程中没有看到过。我们的方法是一种实用和高效的解决水下图像受到海洋瑞透干扰的问题的方法。

Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings

  • paper_url: http://arxiv.org/abs/2311.15581
  • repo_url: None
  • paper_authors: Sudheer Achary, Rohit Girmaji, Adhiraj Anil Deshmukh, Vineet Gandhi
  • for: 实时视频编辑和摄像头轨迹稳定
  • methods: 基于GAZED框架和CineFilter技术实现实时视频编辑
  • results: 比基准方法高质量的视频输出和用户认为视频编辑效果美观
    Abstract Eliminating time-consuming post-production processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a real-time adaptation of the GAZED framework integrated with CineFilter, a novel real-time camera trajectory stabilization approach. It enables users to create professionally edited videos in real-time. Comparative evaluations against baseline methods, including the non-real-time GAZED, demonstrate that Real Time GAZED achieves similar editing results, ensuring high-quality video output. Furthermore, a user study confirms the aesthetic quality of the video edits produced by the Real Time GAZED approach. With these advancements in real-time camera trajectory optimization and video editing presented, the demand for immediate and dynamic content creation in industries such as live broadcasting, sports coverage, news reporting, and social media content creation can be met more efficiently.
    摘要 减少时间消耗的后期处理和提供高质量视频的优势是实时方法的关键。为满足这些需求,我们介绍了实时GAZED:一种基于GAZED框架的实时适应方法,与新型的实时摄像机轨迹稳定方法CineFilter相结合。它允许用户在实时创建专业编辑的视频。与基准方法相比,包括非实时GAZED,Real Time GAZED的比较评估表明,它可以实现相似的编辑效果,保证视频输出质量高。此外,用户测试确认Real Time GAZED方法生成的视频剪辑具有美观性。这些实时摄像机轨迹优化和视频编辑技术的进步,可以更好地满足现场直播、体育转播、新闻报道和社交媒体内容创作中的快速内容创作需求。

EucliDreamer: Fast and High-Quality Texturing for 3D Models with Stable Diffusion Depth

  • paper_url: http://arxiv.org/abs/2311.15573
  • repo_url: None
  • paper_authors: Cindy Le, Congrui Hetang, Ang Cao, Yihui He
  • for: 这 paper 是为了生成基于文本提示和3D模型的 texture 的一种新方法。
  • methods: 该方法使用 Score Distillation Sampling (SDS) 过程,并考虑了额外深度信息。
  • results: 我们的模型可以生成更加满意的结果,并且可以生成不同的艺术风格 для同一个物体。此外,我们的模型在生成相同质量的 texture 时间比较快。我们还进行了详细的抑制研究,探讨不同因素对生成质量的影响,包括采样步骤、指导缩放、负提示、数据扩展、高程范围以及 SDS 的代替方法。
    Abstract This paper presents a novel method to generate textures for 3D models given text prompts and 3D meshes. Additional depth information is taken into account to perform the Score Distillation Sampling (SDS) process [28] with depth conditional Stable Diffusion [34]. We ran our model over the open-source dataset Objaverse [7] and conducted a user study to compare the results with those of various 3D texturing methods. We have shown that our model can generate more satisfactory results and produce various art styles for the same object. In addition, we achieved faster time when generating textures of comparable quality. We also conduct thorough ablation studies of how different factors may affect generation quality, including sampling steps, guidance scale, negative prompts, data augmentation, elevation range, and alternatives to SDS.
    摘要

Video-based Visible-Infrared Person Re-Identification with Auxiliary Samples

  • paper_url: http://arxiv.org/abs/2311.15571
  • repo_url: https://github.com/dyhbupt/buptcampus
  • paper_authors: Yunhao Du, Cheng Lei, Zhicheng Zhao, Yuan Dong, Fei Su
  • for: 这篇论文的目的是解决在24小时监控系统中进行人识别和跟踪的问题,使用可见光和红外光来匹配人像。
  • methods: 该论文使用了两�ream流程,包括一个基本的两栅流程和一个curriculum学习策略,以及一种新的时间k-对称重新排序方法来提高排名结果的精度。
  • results: 实验结果表明,该论文提出的方法具有显著的有效性,并且在对9种现有的图像和视频基于VI-ReID方法进行重现时也表现出优异性。
    Abstract Visible-infrared person re-identification (VI-ReID) aims to match persons captured by visible and infrared cameras, allowing person retrieval and tracking in 24-hour surveillance systems. Previous methods focus on learning from cross-modality person images in different cameras. However, temporal information and single-camera samples tend to be neglected. To crack this nut, in this paper, we first contribute a large-scale VI-ReID dataset named BUPTCampus. Different from most existing VI-ReID datasets, it 1) collects tracklets instead of images to introduce rich temporal information, 2) contains pixel-aligned cross-modality sample pairs for better modality-invariant learning, 3) provides one auxiliary set to help enhance the optimization, in which each identity only appears in a single camera. Based on our constructed dataset, we present a two-stream framework as baseline and apply Generative Adversarial Network (GAN) to narrow the gap between the two modalities. To exploit the advantages introduced by the auxiliary set, we propose a curriculum learning based strategy to jointly learn from both primary and auxiliary sets. Moreover, we design a novel temporal k-reciprocal re-ranking method to refine the ranking list with fine-grained temporal correlation cues. Experimental results demonstrate the effectiveness of the proposed methods. We also reproduce 9 state-of-the-art image-based and video-based VI-ReID methods on BUPTCampus and our methods show substantial superiority to them. The codes and dataset are available at: https://github.com/dyhBUPT/BUPTCampus.
    摘要 visible-infrared人重识别(VI-ReID)目的是匹配在可见和红外摄像头中捕捉到的人,以实现24小时监控系统中的人检索和跟踪。过去的方法强调从不同摄像头中的交叉模式人像学习。然而,时间信息和单个摄像头示例通常被忽略。为了解决这个问题,在这篇论文中,我们首先提供了一个大规模的VI-ReID数据集 named BUPTCampus。与大多数现有的VI-ReID数据集不同,它:1. 收集了tracklets而不是图像,以增加时间信息的质量。2. 包含了某些摄像头中的像素对应的交叉模式样本对,以更好地学习模式不变。3. 提供了一个辅助集,以帮助提高优化。每个人只出现在一个摄像头中。基于我们制作的数据集,我们提出了一个两渠道框架作为基线,并通过生成对抗网络(GAN)来缓解两种模式之间的差距。此外,我们还提出了一种curriculum学习策略,以同时学习主要和辅助集。此外,我们还设计了一种新的时间kreciprocal重新排序方法,以使用细化的时间相关关系来精细化排序列表。实验结果表明我们的方法的有效性。我们还在BUPTCampus上重现了9种现状顶尖的图像基于和视频基于VI-ReID方法,并显示了我们的方法在他们之上显著的优势。代码和数据集可以在https://github.com/dyhBUPT/BUPTCampus上获取。

UFDA: Universal Federated Domain Adaptation with Practical Assumptions

  • paper_url: http://arxiv.org/abs/2311.15570
  • repo_url: None
  • paper_authors: Xinhui Liu, Zhenghao Chen, Luping Zhou, Dong Xu, Wei Xi, Gairui Bai, Yihan Zhao, Jizhong Zhao
  • for: 这个研究是为了解决现实世界中联邦领域对应(Federated Domain Adaptation,FDA)的实际化问题。
  • methods: 本研究提出了一个更实际的方法,名为全面联邦领域对应(Universal Federated Domain Adaptation,UFDA),它不需要 Label set consistency 和目标领域的标签集信息,并且可以处理不同来源领域的标签集不一致和目标领域的标签集完全无知。
  • results: experiments 表明,我们的方法可以在三个 benchmark 上实现比较好的性能,并且比之前的方法更少假设,即使在现实世界中的实际应用中。
    Abstract Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, such as label set consistency, which makes them significantly less feasible for real-world situations and introduces security hazards. In this work, we propose a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent and the target-domain label set is totally blind. This relaxes the assumptions made by FDA, which are often challenging to meet in real-world cases and diminish model security. To address the UFDA scenario, we propose a corresponding framework called Hot-Learning with Contrastive Label Disambiguation (HCLD), which tackles UFDA's domain shifts and category gaps problem by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. The extensive experiments on three benchmarks demonstrate that our HCLD achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to the previous methodologies with many additional assumptions.
    摘要 To address the UFDA scenario, we propose a corresponding framework called Hot-Learning with Contrastive Label Disambiguation (HCLD). This framework tackles UFDA's domain shifts and category gaps problem by using one-hot outputs from the black-box models of various source domains. Additionally, to better distinguish shared and unknown classes, we propose a cluster-level strategy called Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains.Our extensive experiments on three benchmarks demonstrate that our HCLD achieves comparable performance for the UFDA scenario with fewer assumptions, compared to previous methodologies with many additional assumptions.

Fully Authentic Visual Question Answering Dataset from Online Communities

  • paper_url: http://arxiv.org/abs/2311.15562
  • repo_url: None
  • paper_authors: Chongyan Chen, Mengchen Liu, Noel Codella, Yunsheng Li, Lu Yuan, Danna Gurari
  • for: 这篇论文是关于图像问答(VQA)的,它的目的是回答基于图像的问题。
  • methods: 这篇论文使用了来自网络问答社区论坛的真实用 caso,并将其称为VQAonline。
  • results: 研究人员发现VQAonline中的答案具有较长的均值(例如173个单词),与标准的VQA评估指标不兼容,因此分析了六种流行的长文评估指标中的哪些最好align with human judgments。
    Abstract Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We then characterize our dataset and how it relates to eight other VQA datasets. Observing that answers in our dataset tend to be much longer (e.g., with a mean of 173 words) and thus incompatible with standard VQA evaluation metrics, we next analyze which of the six popular metrics for longer text evaluation align best with human judgments. We then use the best-suited metrics to evaluate six state-of-the-art vision and language foundation models on VQAonline and reveal where they struggle most. We will release the dataset soon to facilitate future extensions.
    摘要 Visual Question Answering (VQA) 涉及对图像的问题回答。我们介绍了第一个来自真实使用场景的 VQA 数据集,称为 VQAonline。我们 THEN 描述了我们的数据集和与其他 eight 个 VQA 数据集的关系。我们发现我们数据集中的答案具有较长的均值(例如 173 个单词),因此与标准 VQA 评估指标不兼容。我们然后分析了 six 种流行的长文本评估指标,以确定哪些最好适应人类判断。我们 THEN 使用最适合的指标来评估 six 种现代视觉和语言基础模型在 VQAonline 上的表现,并揭示它们在哪些方面遇到最大的困难。我们即将发布数据集,以便未来扩展。

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

  • paper_url: http://arxiv.org/abs/2311.15561
  • repo_url: None
  • paper_authors: Yiming Chen, Zhiqi Li, Peidong Liu
  • for: 本研究旨在提出一种高效的文本到3D图像生成方法,可以在consumer graphics card上生成3D图像,只需要约8毫秒。
  • methods: 我们利用一个大型预训练的文本到图像扩散模型生成的图像,来监督训练一个文本受限的3D生成随机抽象网络。一旦网络训练完毕,我们可以通过单一的前进 pass来高效地生成3D图像。
  • results: 我们的方法可以减少计算负担,提高生成速度,从而提供一种高效的文本到3D图像生成方法。
    Abstract Recent breakthroughs in text-to-image generation has shown encouraging results via large generative models. Due to the scarcity of 3D assets, it is hardly to transfer the success of text-to-image generation to that of text-to-3D generation. Existing text-to-3D generation methods usually adopt the paradigm of DreamFusion, which conducts per-asset optimization by distilling a pretrained text-to-image diffusion model. The generation speed usually ranges from several minutes to tens of minutes per 3D asset, which degrades the user experience and also imposes a burden to the service providers due to the high computational budget. In this work, we present an efficient text-to-3D generation method, which requires only around 8 $ms$ to generate a 3D asset given the text prompt on a consumer graphic card. The main insight is that we exploit the images generated by a large pre-trained text-to-image diffusion model, to supervise the training of a text conditioned 3D generative adversarial network. Once the network is trained, we are able to efficiently generate a 3D asset via a single forward pass. Our method requires no 3D training data and provides an alternative approach for efficient text-to-3D generation by distilling pre-trained image diffusion models.
    摘要 最近的文本到图像生成突破口已经显示出了有希望的结果,通过大型生成模型。由于3D资产的缺乏,很难将文本到图像生成的成功传递到文本到3D生成。现有的文本到3D生成方法通常采用DreamFusion的思想,通过预训练的文本到图像扩散模型进行每个资产优化。生成速度通常在几分钟到几十分钟之间,这会影响用户体验,同时也对服务提供者造成高计算预算的压力。在这种工作中,我们提出了一种高效的文本到3D生成方法,需要只有约8毫秒来生成基于文本提示的3D资产。我们利用一个大型预训练的文本到图像扩散模型生成的图像来监督训练一个文本决定的3D生成随机抽取网络。一旦网络训练完成,我们可以通过单个前向传播来高效地生成3D资产。我们的方法不需要3D训练数据,并提供了一种alternative的高效文本到3D生成方法,不需要扩散模型的预训练。

PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images

  • paper_url: http://arxiv.org/abs/2311.15556
  • repo_url: https://github.com/jiquan123/i2iqa
  • paper_authors: Jiquan Yuan, Xinyan Cao, Changjin Li, Fanyi Yang, Jinlong Lin, Xixin Cao
  • for: 本研究旨在评估人工智能生成的图像质量,提供更全面的评估方法。
  • methods: 本研究使用了人类观察者进行主观测试,收集了高质量的图像质量标签。并提出了两种参考模型:NR-AIGCIQA 和 FR-AIGCIQA。
  • results: 研究发现,NR-AIGCIQA 模型在不同的图像生成场景下表现出色,FR-AIGCIQA 模型则在具有高纬度匹配的场景下表现更好。
    Abstract As image generation technology advances, AI-based image generation has been applied in various fields and Artificial Intelligence Generated Content (AIGC) has garnered widespread attention. However, the development of AI-based image generative models also brings new problems and challenges. A significant challenge is that AI-generated images (AIGI) may exhibit unique distortions compared to natural images, and not all generated images meet the requirements of the real world. Therefore, it is of great significance to evaluate AIGIs more comprehensively. Although previous work has established several human perception-based AIGC image quality assessment (AIGCIQA) databases for text-generated images, the AI image generation technology includes scenarios like text-to-image and image-to-image, and assessing only the images generated by text-to-image models is insufficient. To address this issue, we establish a human perception-based image-to-image AIGCIQA database, named PKU-I2IQA. We conduct a well-organized subjective experiment to collect quality labels for AIGIs and then conduct a comprehensive analysis of the PKU-I2IQA database. Furthermore, we have proposed two benchmark models: NR-AIGCIQA based on the no-reference image quality assessment method and FR-AIGCIQA based on the full-reference image quality assessment method. Finally, leveraging this database, we conduct benchmark experiments and compare the performance of the proposed benchmark models. The PKU-I2IQA database and benchmarks will be released to facilitate future research on \url{https://github.com/jiquan123/I2IQA}.
    摘要 为了评估人工智能生成的图像质量,需要更好地评估人工智能生成的图像(AIG)。然而,人工智能图像生成技术的发展也带来了新的问题和挑战。一个重要的问题是AIG可能会出现与自然图像不同的扭曲,并且不 всех生成的图像符合实际世界的要求。因此,评估AIG的方法是非常重要的。在过去的工作中,已经建立了基于人类视觉的图像生成内容评价(AIGCIQA)数据库,但这些数据库只评估了基于文本到图像的图像生成方式。然而,人工智能图像生成技术还包括文本到图像和图像到图像的场景,只评估文本到图像的图像生成方式是不够的。为解决这个问题,我们建立了一个基于人类视觉的图像到图像AIGCIQA数据库,名为PKU-I2IQA。我们进行了一项有序的主观实验,收集了AIG的质量标签,然后对PKU-I2IQA数据库进行了全面的分析。此外,我们还提出了两种标准模型:NR-AIGCIQA基于无参考图像质量评估方法和FR-AIGCIQA基于全参考图像质量评估方法。最后,我们利用这个数据库,对两种标准模型进行了比较性能测试。PKU-I2IQA数据库和标准模型将在\url{https://github.com/jiquan123/I2IQA}上发布,以便未来的研究。

Dataset Distillation in Latent Space

  • paper_url: http://arxiv.org/abs/2311.15547
  • repo_url: None
  • paper_authors: Yuxuan Duan, Jianfu Zhang, Liqing Zhang
  • for: 降低训练模型的计算负担,提高模型在下游任务中的性能。
  • methods: 将DD过程从像素空间转移到含义空间,使用预训练的通用自编码器将原始图像编码成压缩后的latent codes。
  • results: 在压缩后的latent codes上进行DD算法,可以大幅降低时间和空间消耗,同时保持与原始数据的相似性,并可以 targets at greater data ratio和高分辨率dataset。
    Abstract Dataset distillation (DD) is a newly emerging research area aiming at alleviating the heavy computational load in training models on large datasets. It tries to distill a large dataset into a small and condensed one so that models trained on the distilled dataset can perform comparably with those trained on the full dataset when performing downstream tasks. Among the previous works in this area, there are three key problems that hinder the performance and availability of the existing DD methods: high time complexity, high space complexity, and low info-compactness. In this work, we simultaneously attempt to settle these three problems by moving the DD processes from conventionally used pixel space to latent space. Encoded by a pretrained generic autoencoder, latent codes in the latent space are naturally info-compact representations of the original images in much smaller sizes. After transferring three mainstream DD algorithms to latent space, we significantly reduce time and space consumption while achieving similar performance, allowing us to distill high-resolution datasets or target at greater data ratio that previous methods have failed. Besides, within the same storage budget, we can also quantitatively deliver more latent codes than pixel-level images, which further boosts the performance of our methods.
    摘要

Beyond Pixels: Exploring Human-Readable SVG Generation for Simple Images with Vision Language Models

  • paper_url: http://arxiv.org/abs/2311.15543
  • repo_url: None
  • paper_authors: Tong Zhang, Haoyang Liu, Peiyan Zhang, Yuxuan Cheng, Haohan Wang
  • for: 本研究旨在提出一种可以快速生成简洁可读的Scalable Vector Graphics(SVG)图像,以满足计算机图形领域中的需求。
  • methods: 我们提出了一种名为Simple-SVG-Generation(S\textsuperscript{2}VG\textsuperscript{2)的方法,该方法可以生成简洁可读的SVG图像,同时保持原图像的关系性和上下文。
  • results: 我们通过对简单图像进行理解任务和高级语言模型的评估,发现我们的方法可以明显超越先前的SVG生成方法。此外,我们还进行了人工评估,结果也表明我们的方法可以提供更加简洁可读的SVG图像。
    Abstract In the field of computer graphics, the use of vector graphics, particularly Scalable Vector Graphics (SVG), represents a notable development from traditional pixel-based imagery. SVGs, with their XML-based format, are distinct in their ability to directly and explicitly represent visual elements such as shape, color, and path. This direct representation facilitates a more accurate and logical depiction of graphical elements, enhancing reasoning and interpretability. Recognizing the potential of SVGs, the machine learning community has introduced multiple methods for image vectorization. However, transforming images into SVG format while retaining the relational properties and context of the original scene remains a key challenge. Most vectorization methods often yield SVGs that are overly complex and not easily interpretable. In response to this challenge, we introduce our method, Simple-SVG-Generation (S\textsuperscript{2}VG\textsuperscript{2}). Our method focuses on producing SVGs that are both accurate and simple, aligning with human readability and understanding. With simple images, we evaluate our method with reasoning tasks together with advanced language models, the results show a clear improvement over previous SVG generation methods. We also conducted surveys for human evaluation on the readability of our generated SVGs, the results also favor our methods.
    摘要 在计算机图形领域,使用 вектор图形,特别是可扩展 вектор图形(SVG),表示了传统 пиксель基本图形的一项重要发展。 SVGs 的 XML 格式使其能直接和明确表示视觉元素,如形状、颜色和路径,这种直接表示方式使得图形元素的理解和逻辑推理得到了提高。认可 SVGs 的潜在优势,机器学习社区已经提出了多种图像vectorization方法。然而,将图像转换为 SVG 格式而保持原Scene中的关系性和上下文仍然是一个关键挑战。大多数vectorization方法通常会生成不太可读的 SVGs。在回应这个挑战的情况下,我们介绍了我们的方法,Simple-SVG-Generation(S\textsuperscript{2}VG\textsuperscript{2)。我们的方法关注生成简洁而准确的 SVGs,与人类可读性和理解有关。使用简单的图像,我们与高级语言模型进行了合理性任务的评估,结果显示了我们的方法与过去的 SVG 生成方法之间的明显改善。我们还进行了人类评估我们生成的 SVGs 的可读性,结果也对我们的方法产生了正面的影响。

EAFP-Med: An Efficient Adaptive Feature Processing Module Based on Prompts for Medical Image Detection

  • paper_url: http://arxiv.org/abs/2311.15540
  • repo_url: None
  • paper_authors: Xiang Li, Long Lan, Husam Lahza, Shaowu Yang, Shuihua Wang, Wenjing Yang, Hengzhu Liu, Yudong Zhang
  • for: 这个论文旨在解决医学成像技术之间的特征表现差异问题,提高医学成像检测的效率和准确性。
  • methods: 本论文提出了一个基于大语言模型的医学成像检测方法,即EAFP-Med,可以快速和高效地提取不同尺度的疾病特征,并且可以与不同的成像技术进行整合。
  • results: 本论文的实验结果显示,EAFP-Med ST 模型在三个数据集(胸部X射线像、头部磁共振成像像和皮肤像)上的测试结果均达到了最佳水平,并且比前九种方法更高效。
    Abstract In the face of rapid advances in medical imaging, cross-domain adaptive medical image detection is challenging due to the differences in lesion representations across various medical imaging technologies. To address this issue, we draw inspiration from large language models to propose EAFP-Med, an efficient adaptive feature processing module based on prompts for medical image detection. EAFP-Med can efficiently extract lesion features of different scales from a diverse range of medical images based on prompts while being flexible and not limited by specific imaging techniques. Furthermore, it serves as a feature preprocessing module that can be connected to any model front-end to enhance the lesion features in input images. Moreover, we propose a novel adaptive disease detection model named EAFP-Med ST, which utilizes the Swin Transformer V2 - Tiny (SwinV2-T) as its backbone and connects it to EAFP-Med. We have compared our method to nine state-of-the-art methods. Experimental results demonstrate that EAFP-Med ST achieves the best performance on all three datasets (chest X-ray images, cranial magnetic resonance imaging images, and skin images). EAFP-Med can efficiently extract lesion features from various medical images based on prompts, enhancing the model's performance. This holds significant potential for improving medical image analysis and diagnosis.
    摘要 随着医学成像技术的快速发展,跨领域adaptive医学成像检测具有挑战性,主要是因为不同的医学成像技术中lesion的表现不同。为解决这个问题,我们从大语言模型中灵感来提出EAFP-Med,一种高效的adaptive功能处理模块基于提示。EAFP-Med可以高效地从不同类型的医学成像中提取lesion特征,而且可以根据提示而不受特定成像技术的限制。此外,它可以作为输入图像的特征预处理模块,与任何模型前端连接以提高输入图像中lesion特征。此外,我们还提出了一种基于Swin Transformer V2 - Tiny(SwinV2-T)的novel adaptive疾病检测模型,名为EAFP-Med ST。我们与9种现有方法进行比较。实验结果表明,EAFP-Med ST在三个数据集(肺X射影图像、脑磁共振成像图像和皮肤图像)上的表现最佳。EAFP-Med可以高效地基于提示提取不同类型医学成像中lesion特征,提高模型的性能。这对医学成像分析和诊断具有重要意义。

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.15537
  • repo_url: https://github.com/xb534/sed
  • paper_authors: Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang
  • for: 这篇论文目的是提出一种简单的编码器-解码器模型,用于开放词汇semantic segmentation问题。
  • methods: 该模型使用层次编码器来生成图像级image-text成本地图,并使用分层结构的解码器将成本地图与不同层次的背景图进行混合。
  • results: 在多个开放词汇semantic segmentation数据集上进行实验,该方法达到了31.6%的mIoU分数,在82ms/图像上实现了单个A6000上的最高性能。
    Abstract Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone, instead of plain transformer, to predict pixel-level image-text cost map. Compared to plain transformer, hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed, we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder, resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets, which demonstrates the efficacy of our SED method. When using ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000. We will release it at \url{https://github.com/xb534/SED.git}.
    摘要 Open-vocabulary semantic segmentation寻求通过分类不同类别的像素。现有的方法大多数都是利用预训练的视觉语言模型,其中的关键是使用图像级别的模型进行像素级别的分类任务。在这篇论文中,我们提出了一种简单的编码器-解码器,称之为SED,用于开放词汇semantic segmentation,它包括层次编码器基于的成本图生成和渐进融合解码器。层次编码器基于的成本图生成使用层次背景,而不是平面变换器,以预测像素级别的图像文本成本图。与平面变换器相比,层次背景更好地捕捉本地空间信息,并且与输入大小相关的计算复杂度是线性的。我们的渐进融合解码器使用顶部结构将成本图和不同层次背景的特征图进行融合,以实现分类。为了加速推理速度,我们引入了类别早期抛弃的方案,从推理的早期阶段抛弃许多不存在的类别,从而最多地提高了推理速度,而无需减少准确性。我们在多个开放词汇semantic segmentation数据集上进行了实验,并证明了我们的SED方法的有效性。当使用ConvNeXt-B时,我们的SED方法在ADE20K上的mIoU分数为31.6%,在82毫秒($ms)/张的单个A6000上完成每帧推理。我们将在\url{https://github.com/xb534/SED.git}上发布它。

SVRDA: A Web-based Dataset Annotation Tool for Slice-to-Volume Registration

  • paper_url: http://arxiv.org/abs/2311.15536
  • repo_url: https://github.com/roldbach/svrda
  • paper_authors: Weixun Luo, Alexandre Triay Bagur, Paul Aljabar, George Ralli, Sir Michael Brady
    for: SVRDA is designed to facilitate the annotation of benchmark datasets for slice-to-volume registration.methods: SVRDA is a web-based application that supports platform-agnostic collaboration and efficient transformation manipulation via keyboard shortcuts. It also features automatic saving, configuration-based data loading, and separation of concerns for flexibility and extensibility.results: The effectiveness of SVRDA was validated through indirect evaluation of post-registration segmentation quality on UK Biobank data, which showed a significant improvement in Dice Similarity Coefficient and 95th percentile Hausdorff distance. Additionally, SVRDA was successfully integrated into test-retest T1 quantification on in-house magnetic resonance images, leading to more consistent results after registration.
    Abstract Background and Objective: The lack of benchmark datasets has impeded the development of slice-to-volume registration algorithms. Such datasets are difficult to annotate, primarily due to the dimensional difference within data and the dearth of task-specific software. We aim to develop a user-friendly tool to streamline dataset annotation for slice-to-volume registration. Methods: The proposed tool, named SVRDA, is an installation-free web application for platform-agnostic collaborative dataset annotation. It enables efficient transformation manipulation via keyboard shortcuts and smooth case transitions with auto-saving. SVRDA supports configuration-based data loading and adheres to the separation of concerns, offering great flexibility and extensibility for future research. Various supplementary features have been implemented to facilitate slice-to-volume registration. Results: We validated the effectiveness of SVRDA by indirectly evaluating the post-registration segmentation quality on UK Biobank data, observing a dramatic overall improvement (24.02% in the Dice Similarity Coefficient and 48.93% in the 95th percentile Hausdorff distance, respectively) supported by highly statistically significant evidence ($p<0.001$).We further showcased the clinical usage of SVRDA by integrating it into test-retest T1 quantification on in-house magnetic resonance images, leading to more consistent results after registration. Conclusions: SVRDA can facilitate collaborative annotation of benchmark datasets while being potentially applicable to other pipelines incorporating slice-to-volume registration. Full source code and documentation are available at https://github.com/Roldbach/SVRDA
    摘要 背景和目标:缺乏标准数据集的缺乏对 slice-to-volume регистрация算法的发展带来了阻碍。这些数据集 annotate 是因为数据维度的差异以及缺乏任务特定的软件而困难。我们的目标是开发一个用户友好的工具,以便在 slice-to-volume REGISTRAION 的数据集上进行 annotate。方法:我们提出的工具是一个可以在线安装的 web 应用程序,可以在平台不同的情况下进行协作性数据集注释。该工具具有键盘快捷键和自动保存功能,可以快速和灵活地进行转换操作。SVRDA 支持配置数据加载和遵循分离的 Concerns 的方式,这使得它具有很好的扩展性和灵活性,以便未来的研究。此外,我们还实现了一些辅助功能,以便进一步优化 slice-to-volume REGISTRAION。结果:我们证明了 SVRDA 的有效性,通过对 UK Biobank 数据进行 indirect 评估, Observation 表明在 Dice 相似度系数和 95% 百分比 Hausdorff 距离上出现了很大的全体改善(24.02% 和 48.93% 分别),这种改善得到了高度统计学上的证明 ($p<0.001$).此外,我们还展示了 SVRDA 在医学应用中的优势,通过将其集成到了室内磁共振成像中的测试-重复 T1 量化中,以获得更一致的结果。结论:SVRDA 可以促进标准数据集的协作注释,同时可以应用于其他 incorporating slice-to-volume REGISTRAION 的管道。完整的源代码和文档可以在 上获取。

Efficient Dataset Distillation via Minimax Diffusion

  • paper_url: http://arxiv.org/abs/2311.15529
  • repo_url: https://github.com/vimar-gu/minimaxdiffusion
  • paper_authors: Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, Yiran Chen
  • for: 降低训练 neural network 的存储和计算资源,通过生成一个小型的代理数据集来捕捉原始大规模数据集中的 ric information。
  • methods: incorporating 生成扩散技术 для计算代理数据集,并通过设计降低扩散过程中的质量损失来提高生成的图像的多样性和表现力。
  • results: 在 ImageWoof 上 Achieves state-of-the-art 验证性能,并且需要 much less 的计算资源,相比之下 previous 方法。
    Abstract Dataset distillation reduces the storage and computational consumption of training a network by generating a small surrogate dataset that encapsulates rich information of the original large-scale one. However, previous distillation methods heavily rely on the sample-wise iterative optimization scheme. As the images-per-class (IPC) setting or image resolution grows larger, the necessary computation will demand overwhelming time and resources. In this work, we intend to incorporate generative diffusion techniques for computing the surrogate dataset. Observing that key factors for constructing an effective surrogate dataset are representativeness and diversity, we design additional minimax criteria in the generative training to enhance these facets for the generated images of diffusion models. We present a theoretical model of the process as hierarchical diffusion control demonstrating the flexibility of the diffusion process to target these criteria without jeopardizing the faithfulness of the sample to the desired distribution. The proposed method achieves state-of-the-art validation performance while demanding much less computational resources. Under the 100-IPC setting on ImageWoof, our method requires less than one-twentieth the distillation time of previous methods, yet yields even better performance. Source code available in https://github.com/vimar-gu/MinimaxDiffusion.
    摘要 (Simplified Chinese translation) dataset 简化通过生成一个小型的附加数据集来减少训练网络的存储和计算资源。然而,过去的简化方法都是通过样本WISE的迭代优化算法来实现。随着图像每个类(IPC)设置或图像分辨率的增加,需要的计算将占用过量的时间和资源。在这种情况下,我们想要利用生成扩散技术来计算附加数据集。我们发现构建有效的附加数据集的关键因素是代表性和多样性,因此我们在生成图像的扩散模型中添加了附加的最大化最小化 criterion来提高这些方面。我们提出了一个层次扩散控制的理论模型,证明了扩散过程可以无须牺牲样本的准确性来实现这些标准。我们的方法实现了当前验证性能的最佳状态,而且需要的计算资源减少了90%以上。在ImageWoof上的100-IPC设置下,我们的方法需要前一个方法的20倍的时间,却能够获得更好的性能。源代码可以在https://github.com/vimar-gu/MinimaxDiffusion上找到。

Fine-grained Appearance Transfer with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16513
  • repo_url: https://github.com/babahui/fine-grained-appearance-transfer
  • paper_authors: Yuteng Ye, Guanwen Li, Hang Zhou, Cai Jiale, Junqing Yu, Yawei Luo, Zikai Song, Qilong Xing, Youjia Zhang, Wei Yang
  • for: 这篇论文是关于图像到图像翻译(I2I)和图像外观传输的研究,旨在通过维护图像的结构准确性来改变图像的视觉外观。
  • methods: 这篇论文提出了一种新的框架,它将semantic matching、图像外观传输和latent deviation综合应用于图像翻译。具有预测$x_0$空间的 diffusion models 在 latent space 中的使用被视为一个关键因素,可以帮助保留细节的 струкural element。
  • results: 该方法在各种类别和领域中进行了广泛的实验,并表明其能够有效地处理细节的图像翻译。 code 可以在 https://github.com/babahui/Fine-grained-Appearance-Transfer 中找到。
    Abstract Image-to-image translation (I2I), and particularly its subfield of appearance transfer, which seeks to alter the visual appearance between images while maintaining structural coherence, presents formidable challenges. Despite significant advancements brought by diffusion models, achieving fine-grained transfer remains complex, particularly in terms of retaining detailed structural elements and ensuring information fidelity. This paper proposes an innovative framework designed to surmount these challenges by integrating various aspects of semantic matching, appearance transfer, and latent deviation. A pivotal aspect of our approach is the strategic use of the predicted $x_0$ space by diffusion models within the latent space of diffusion processes. This is identified as a crucial element for the precise and natural transfer of fine-grained details. Our framework exploits this space to accomplish semantic alignment between source and target images, facilitating mask-wise appearance transfer for improved feature acquisition. A significant advancement of our method is the seamless integration of these features into the latent space, enabling more nuanced latent deviations without necessitating extensive model retraining or fine-tuning. The effectiveness of our approach is demonstrated through extensive experiments, which showcase its ability to adeptly handle fine-grained appearance transfers across a wide range of categories and domains. We provide our code at https://github.com/babahui/Fine-grained-Appearance-Transfer
    摘要 图像到图像翻译(I2I)和其子领域的外观传递问题具有挑战性。尤其是在细节级别上进行传递时,保持结构一致性和信息准确性是复杂的。这篇论文提出了一种创新的框架,用于缓解这些挑战。我们的方法包括各种语义匹配、外观传递和潜在偏差的综合使用。我们的核心思想在于通过Diffusion模型预测的$x_0$空间进行精细控制。这是实现细节级别的自然传递的关键。我们的框架利用这个空间实现源和目标图像的semantic alignment,以便更好地进行掩码基于的外观传递,以提高特征获取。我们的方法可以充分利用这些特征,无需进行广泛的模型重新训练或细化。我们的实验结果表明,我们的方法可以在多种类别和领域中灵活地处理细节级别的外观传递。我们的代码可以在https://github.com/babahui/Fine-grained-Appearance-Transfer上获取。

Sparse Pedestrian Character Learning for Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2311.15512
  • repo_url: None
  • paper_authors: Yonghao Dong, Le Wang, Sanpin Zhou, Gang Hua, Changyin Sun
  • for: 预测人行道径,即自动驾驶中的一个重要任务。
  • methods: 使用行人特征信息,包括行人动作和外观,以提高学习的路径嵌入,并实现状态前所未达成的性能。
  • results: 我们提出了一种两栅稀有基于网络~(TSNet),该网络可以从稀有的人物特征中减去负面特征,以提高路径嵌入。广泛的实验表明,我们的方法在两个首人视角数据集上表现出色,超过了现有的状态前所未达成的方法。
    Abstract Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, \textit{i.e.}, action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network~(TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.
    摘要 pedestrian trajectory prediction in a first-person view latest 收到了很多关注,因为它对自动驾驶非常重要。 recent work 使用 pedestrian 的 character information,即行为和外表,以提高学习的轨迹嵌入,并实现了状态最佳性。 然而,这些方法忽略了无效和负面 pedestrian 的 character information,这会对轨迹表示有害,从而导致性能下降。 To address this issue, we present a two-stream sparse-character-based network~(TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. 广泛的实验在两个 first-person view 数据集上,PIE 和 JAAD,表明我们的方法超过了现有的最佳方法。 In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.

CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering

  • paper_url: http://arxiv.org/abs/2311.15510
  • repo_url: https://github.com/haidongz-usc/CaesarNeRF
  • paper_authors: Haidong Zhu, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Ram Nevatia, Luming Liang
  • for: 提高 NeRF 模型的普适性和少量学习能力,帮助实现高质量细节的渲染。
  • methods: 引入 CAlibratEd SemAntic Representation,同时使用像素级表示来提高 NeRF 模型的普适性和少量学习能力。
  • results: 对公共数据集进行了广泛的实验,并证明 CaesarNeRF 可以在不同的参考视图数量下达到状态 искусственный智能水平,并且能够capture varying details。
    Abstract Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic understanding without compromising high-quality details. CaesarNeRF explicitly models pose differences of reference views to combine scene-level semantic representations, providing a calibrated holistic understanding. This calibration process aligns various viewpoints with precise location and is further enhanced by sequential refinement to capture varying details. Extensive experiments on public datasets, including LLFF, Shiny, mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art performance across varying numbers of reference views, proving effective even with a single reference image. The project page of this work can be found at https://haidongz-usc.github.io/project/caesarnerf.
    摘要 通用性和少量学习是Neural Radiance Fields(NeRF)中的关键挑战,经常是因为缺乏像素级渲染的整体理解。我们提出了CaesarNeRF,一种综合方法,利用场景级CAlibratEd SemAntic Representation以及像素级表示来进攻少量、通用的神经渲染,实现整体理解而无需牺牲高质量细节。CaesarNeRF显式模型参考视图差异,将场景级Semantic Representation合并,提供准确的整体理解。这个准确性进程与多个视图匹配精确位置,并通过顺序反馈进一步增强,捕捉不同细节。我们在公共数据集上进行了广泛的实验,包括LLFF、Shiny、mip-NeRF 360和MVImgNet,并证明CaesarNeRF在不同参考视图数量下具有状态 искусственный智能表现,并且效果随着参考视图数量的增加。CaesarNeRF的项目页面可以在https://haidongz-usc.github.io/project/caesarnerf找到。

Class-Adaptive Sampling Policy for Efficient Continual Learning

  • paper_url: http://arxiv.org/abs/2311.16485
  • repo_url: https://github.com/hossein-rezaei624/casp
  • paper_authors: Hossein Rezaei, Mohammad Sabokrou
  • for: 提高continuous learning(CL)的效率和可 reuse 性, solve the problem of buffer-based methods 不能够 dynamically allocate storage space.
  • methods: 提出了一种名为“Class-Adaptive Sampling Policy”(CASP)的新方法和策略,通过考虑类别贡献和难度,动态 allocate buffer 空间,以便更好地利用知识和避免忘记。
  • results: CASP 可以大幅提高 CL 的效率和可 reuse 性,适用于不同类型的学习任务和复杂的学习场景。
    Abstract Continual learning (CL) aims to acquire new knowledge while preserving information from previous experiences without forgetting. Though buffer-based methods (i.e., retaining samples from previous tasks) have achieved acceptable performance, determining how to allocate the buffer remains a critical challenge. Most recent research focuses on refining these methods but often fails to sufficiently consider the varying influence of samples on the learning process, and frequently overlooks the complexity of the classes/concepts being learned. Generally, these methods do not directly take into account the contribution of individual classes. However, our investigation indicates that more challenging classes necessitate preserving a larger number of samples compared to less challenging ones. To address this issue, we propose a novel method and policy named 'Class-Adaptive Sampling Policy' (CASP), which dynamically allocates storage space within the buffer. By utilizing concepts of class contribution and difficulty, CASP adaptively manages buffer space, allowing certain classes to occupy a larger portion of the buffer while reducing storage for others. This approach significantly improves the efficiency of knowledge retention and utilization. CASP provides a versatile solution to boost the performance and efficiency of CL. It meets the demand for dynamic buffer allocation, accommodating the varying contributions of different classes and their learning complexities over time.
    摘要 (简化中文) kontinual learning (CL) 的目标是在新的知识取得到而不会忘记之前的经验。虽然缓存方法(保留先前任务的样本)已经达到了可接受的性能,但是决定如何分配缓存仍然是一个关键的挑战。最近的研究主要关注这些方法的改进,但是往往不充分考虑样本在学习过程中的不同影响,并经常忽视学习的类别和概念的复杂性。通常,这些方法直接不考虑个类的贡献。然而,我们的调查表明,更复杂的类需要保留更多的样本,而 simpler 类则可以减少存储。为解决这个问题,我们提出了一种新的方法和政策,即 'Class-Adaptive Sampling Policy' (CASP),它在缓存中动态分配存储空间。通过利用类贡献和难度概念,CASP可以适应不同类的学习复杂性,让某些类占据缓存的更大比例,而另一些类则减少存储。这种方法能够有效提高知识保持和利用效率。CASP 提供了一种灵活的解决方案,可以提高 CL 的性能和效率。它适应了缓存分配的变化需求,满足不同类的学习复杂性和时间变化。

AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

  • paper_url: http://arxiv.org/abs/2311.15478
  • repo_url: None
  • paper_authors: Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha
  • for: 本文提出了一种新的方法,即 AerialBooth,用于从单个输入图像中生成空中视图,基于其文本描述。
  • methods: 本方法利用了预训练的文本-2D图像稳定扩散模型作为3D世界的先验知识,并在两步finetuning中优化文本嵌入和UNet重构输入图像的过程。
  • results: 通过对各种自然场景、室内场景、人体动作等多种数据进行广泛的实验和减少研究,我们证明了 AerialBooth 的效果和其对其他文本控制视图的普适性。同时,我们还显示了 AerialBooth 在7个评价指标上实现了最佳的视角准确性-准确性衡量。代码和数据可以在 GitHub 上找到。
    Abstract We present a novel method, AerialBooth, for synthesizing the aerial view from a single input image using its text description. We leverage the pretrained text-to-2D image stable diffusion model as prior knowledge of the 3D world. The model is finetuned in two steps to optimize for the text embedding and the UNet that reconstruct the input image and its inverse perspective mapping respectively. The inverse perspective mapping creates variance within the text-image space of the diffusion model, while providing weak guidance for aerial view synthesis. At inference, we steer the contents of the generated image towards the input image using novel mutual information guidance that maximizes the information content between the probability distributions of the two images. We evaluate our approach on a wide spectrum of real and synthetic data, including natural scenes, indoor scenes, human action, etc. Through extensive experiments and ablation studies, we demonstrate the effectiveness of AerialBooth and also its generalizability to other text-controlled views. We also show that AerialBooth achieves the best viewpoint-fidelity trade-off though quantitative evaluation on 7 metrics analyzing viewpoint and fidelity w.r.t. input image. Code and data is available at https://github.com/divyakraman/AerialBooth2023.
    摘要 我们提出了一种新方法,即 AerialBooth,用于从单个输入图像中生成飞行视图。我们利用了预训练的文本-2D图像稳定扩散模型作为3D世界的先验知识。我们在两步中进行了训练,以便优化文本嵌入和UNet,以重构输入图像和其反射映射。反射映射创造了在文本-图像空间中的变量,同时提供了软导向的飞行视图生成。在推理阶段,我们使用了一种新的共声导向,以使得生成的图像内容倾向于输入图像。我们对各种实际和 sintetic 数据进行了广泛的实验和割除研究,包括自然场景、室内场景、人体动作等。我们通过了EXTENSIVE 的实验和割除研究,证明了 AerialBooth 的有效性和其普适性。此外,我们还表明了 AerialBooth 在7个视点和准确性评价指标中的最佳视点均衡。代码和数据可以在 https://github.com/divyakraman/AerialBooth2023 上获取。

DreamCreature: Crafting Photorealistic Virtual Creatures from Imagination

  • paper_url: http://arxiv.org/abs/2311.15477
  • repo_url: https://github.com/kamwoh/dreamcreature
  • paper_authors: Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang
  • for: 这个论文的目的是为了开发一种基于文本的图像生成模型,能够生成新的、细腻的生物类型(如虚拟狗或鸟类),用于数字资产创造和生物多样性分析。
  • methods: 这个论文使用了一种新的方法 called DreamCreature,它可以在无监督的情况下,从图像中提取出目标概念(如鸟类的不同部分),然后通过组合这些部分来生成新的、混合的概念。
  • results: 实验表明,DreamCreature 比之前的方法更高效地生成新的、细腻的生物类型,并且可以在多种背景和场景下具有权威的结构和逼真的外观。
    Abstract Recent text-to-image (T2I) generative models allow for high-quality synthesis following either text instructions or visual examples. Despite their capabilities, these models face limitations in creating new, detailed creatures within specific categories (e.g., virtual dog or bird species), which are valuable in digital asset creation and biodiversity analysis. To bridge this gap, we introduce a novel task, Virtual Creatures Generation: Given a set of unlabeled images of the target concepts (e.g., 200 bird species), we aim to train a T2I model capable of creating new, hybrid concepts within diverse backgrounds and contexts. We propose a new method called DreamCreature, which identifies and extracts the underlying sub-concepts (e.g., body parts of a specific species) in an unsupervised manner. The T2I thus adapts to generate novel concepts (e.g., new bird species) with faithful structures and photorealistic appearance by seamlessly and flexibly composing learned sub-concepts. To enhance sub-concept fidelity and disentanglement, we extend the textual inversion technique by incorporating an additional projector and tailored attention loss regularization. Extensive experiments on two fine-grained image benchmarks demonstrate the superiority of DreamCreature over prior methods in both qualitative and quantitative evaluation. Ultimately, the learned sub-concepts facilitate diverse creative applications, including innovative consumer product designs and nuanced property modifications.
    摘要 最近的文本到图像(T2I)生成模型可以生成高质量的图像,以文本指令或视觉示例为基础。尽管它们具有可观的能力,但它们在创造新的、详细的生物类型(例如虚拟狗或鸟种)方面存在限制。为了覆盖这一漏洞,我们介绍了一项新任务:虚拟生物生成。给定一组无标签图像目标概念(例如200种鸟类),我们希望通过训练一个T2I模型,使其能够创造新的混合概念(例如新的鸟种),并在多样化背景和Context中呈现 faithful 的结构和高质量的外观。我们提出了一种新的方法called DreamCreature,它可以自动从无标签图像中提取目标概念的下一级概念(例如鸟类体部),并在无监督的情况下进行学习。T2I因此可以适应创造新的概念,通过将学习到的下一级概念Components seamlessly和 flexibly组合而生成。为了提高下一级概念的准确性和分离度,我们将文本反转技术扩展为添加额外的投影和适应损失正则化。广泛的实验表明,DreamCreature在两个细致的图像benchmark上舒适性和量化评价中具有Superiority。最终,学习的下一级概念可以促进多样化的创意应用,包括创新的消费品设计和细化的财产修改。

MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers

  • paper_url: http://arxiv.org/abs/2311.15475
  • repo_url: https://github.com/nihalsid/mesh-gpt
  • paper_authors: Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner
  • for: 本研究旨在开发一种基于语言模型的 triangle mesh生成方法,以提高 mesh 的紧凑性和精炼程度。
  • methods: 该方法采用了一种序列化方法,通过 graph convolutions 学习 vocabulary 的 latent quantized embeddings,然后使用 transformer 模型预测下一个 embedding 的索引。
  • results: 该方法可以直接生成新的 triangle mesh,并达到了 state of the art 的 mesh generation 方法,具有9%的形态覆盖率和30个 FID 分数的提升。
    Abstract We introduce MeshGPT, a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes, in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. Inspired by recent advances in powerful large language models, we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh. A transformer is then trained on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained, our model can be autoregressively sampled to generate new triangle meshes, directly generating compact meshes with sharp edges, more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable improvement over state of the art mesh generation methods, with a 9% increase in shape coverage and a 30-point enhancement in FID scores across various categories.
    摘要 我们介绍MeshGPT,一种新的方法生成三角形网格,它具有艺术家创建的紧凑性特点,与基于神经场的iso-surfacing方法提取的密集三角形网格不同。我们 Drawing inspiration from recent advances in powerful large language models, we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. First, we learn a vocabulary of latent quantized embeddings using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are then sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh. Finally, we train a transformer on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained, our model can be autoregressively sampled to generate new triangle meshes, directly generating compact meshes with sharp edges, more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT比州际前方法有9%的形态覆盖率和30个FID分数的提高。

Where to Begin? From Random to Foundation Model Instructed Initialization in Federated Learning for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.15463
  • repo_url: None
  • paper_authors: Ming Li, Guang Yang
  • for: 这篇论文的目的是探讨在医疗影像分析中使用 Federated Learning (FL) 技术,并评估基于底层模型的 initialization 的影响。
  • methods: 本文使用了 Federated Learning (FL) 模型,并探讨基于 Segment Anything Model (SAM) 的底层模型作为 initialize FL 模型的指导教师。
  • results: 本文的实验结果显示,使用基于 SAM 的底层模型作为 initialize FL 模型可以提高 FL 模型在非 Identically Independent Distributed (non-IID) 数据 scenario 中的性能,并且可以更快地收敛。
    Abstract In medical image analysis, Federated Learning (FL) stands out as a key technology that enables privacy-preserved, decentralized data processing, crucial for handling sensitive medical data. Currently, most FL models employ random initialization, which has been proven effective in various instances. However, given the unique challenges posed by non-IID (independently and identically distributed) data in FL, we propose a novel perspective: exploring the impact of using the foundation model with enormous pre-trained knowledge, such as the Segment Anything Model (SAM), as an instructive teacher for FL model initialization in medical image segmentation task. This work for the first time attempts to utilize the foundation model as an instructive teacher for initialization in FL, assessing its impact on the performance of FL models, especially in non-IID data scenarios. Our empirical evaluation on chest x-ray lung segmentation showcases that FL with foundation model instructed initialization not only achieves faster convergence but also improves performance in complex data contexts. These findings offer a new perspective for model initialization in FL.
    摘要 Here's the translation in Simplified Chinese:医疗图像分析中,联邦学习(FL)是一种关键技术,它可以保持隐私,进行分布式数据处理,这对医疗数据进行处理是非常重要。目前,大多数FL模型使用随机初始化,这已经在各种场景中证明有效。然而,由于非独立和同分布(non-IID)数据的特殊挑战,我们提出了一新的视角:探讨使用底层模型,如Segment Anything Model(SAM),作为FL模型初始化的指导教师。这是第一次尝试使用基础模型作为FL模型初始化的指导教师,评估其影响FL模型的性能,特别是在非独立数据场景下。我们对肺部X射影像分割进行了实验,显示FL与基础模型指导初始化不仅可以更快 converges,还可以在复杂的数据上下文中提高性能。这些发现提供了一新的视角 дляFL模型的初始化。

cs.AI - 2023-11-27

Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations

  • paper_url: http://arxiv.org/abs/2311.16353
  • repo_url: None
  • paper_authors: Delaram Pirhayatifard, Mohammad Taha Toghani, Guha Balakrishnan, César A. Uribe
  • for: 用于减少数据量下多任务图像生成
  • methods: 利用表示学习技术, Shared Parameters 核心meta架构和专门的任务层
  • results: 在标准图像集上超过了不条件和条件 DDPM 的 FID 和 SSIM 指标
    Abstract In this work, we address the challenge of multi-task image generation with limited data for denoising diffusion probabilistic models (DDPM), a class of generative models that produce high-quality images by reversing a noisy diffusion process. We propose a novel method, SR-DDPM, that leverages representation-based techniques from few-shot learning to effectively learn from fewer samples across different tasks. Our method consists of a core meta architecture with shared parameters, i.e., task-specific layers with exclusive parameters. By exploiting the similarity between diverse data distributions, our method can scale to multiple tasks without compromising the image quality. We evaluate our method on standard image datasets and show that it outperforms both unconditional and conditional DDPM in terms of FID and SSIM metrics.
    摘要 在这个工作中,我们解决了多任务图像生成问题,使用有限数据的涂抹扩散模型(DDPM),这种生成模型可以生成高质量图像 by reversing 噪声扩散过程。我们提出了一种新的方法,SR-DDPM,它利用少量学习技术来有效地学习从 fewer samples 中的多个任务。我们的方法包括核心元体建 architecture 的共享参数,即任务特定层的独立参数。通过利用多种数据分布之间的相似性,我们的方法可以扩展到多个任务而无需牺牲图像质量。我们在标准图像集上评估了我们的方法,并发现它在 FID 和 SSIM метриках上比 both 随机 DDPM 和 conditional DDPM 表现更好。

Compositional Chain-of-Thought Prompting for Large Multimodal Models

  • paper_url: http://arxiv.org/abs/2311.17076
  • repo_url: None
  • paper_authors: Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig
  • for: 提高多modal task的性能,特别是视觉语言任务中的compositional reasoning能力。
  • methods: 使用Scene Graph(SG)来提取LMM中的compositional知识,并使用Zero-shot Chain-of-Thought(CCoT)提示方法来驱动LMM生成响应。
  • results: CCoT方法可以提高LMM在多modal benchmark上的性能,不需要精度的Scene Graph annotations和finetuning。
    Abstract The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs.
    摘要 现代强视觉脊梁和大型语言模型(LLM)的结合,已经使得大型多模态模型(LMM)成为现代视觉语言(VL)任务的标准。然而,最先进的LMM仍然困难捕捉视觉复杂的组合知识,如物体Attributes和关系。一种解决方案是使用场景图(SG)——一种对象和其关系和属性的正式化,在视觉和文本领域之间作为桥接。然而,场景图数据需要场景图注释,收集场景图注释是贵重的并不易扩展。另外,基于SG数据进行LMM训练可能会导致权值忘记原始预训练目标。为解决这一问题,我们提出了组合思维(CCoT)方法,一种零上下文链条提示方法,使用LMM生成的场景图来提取组合知识。具体来说,我们首先使用LMM生成场景图,然后使用该场景图作为提示来生成响应。经过广泛的实验,我们发现,我们提出的CCoT方法不仅能提高LMM在多种视觉语言复杂 benchmark上的性能,还能提高多种流行的LMM在通用多模态 benchmark上的性能,无需微调或注释真实的SG数据。

Reward Shaping for Improved Learning in Real-time Strategy Game Play

  • paper_url: http://arxiv.org/abs/2311.16339
  • repo_url: None
  • paper_authors: John Kliem, Prithviraj Dasgupta
  • for: 该研究探讨了在实时策略游戏中使用奖励形成来提高人工智能学习的表现。
  • methods: 研究使用了不同的奖励形成函数,以适应不同的游戏事件,并对这些事件进行了适当的规定。
  • results: 实验结果表明,奖励形成可以作为一种有效的方法,以便理解游戏中不同的子任务之间的重要性,编码第二个目标函数,例如能源效率,到玩家的游戏行为中,以及提高对不同对手水平的学习策略。
    Abstract We investigate the effect of reward shaping in improving the performance of reinforcement learning in the context of the real-time strategy, capture-the-flag game. The game is characterized by sparse rewards that are associated with infrequently occurring events such as grabbing or capturing the flag, or tagging the opposing player. We show that appropriately designed reward shaping functions applied to different game events can significantly improve the player's performance and training times of the player's learning algorithm. We have validated our reward shaping functions within a simulated environment for playing a marine capture-the-flag game between two players. Our experimental results demonstrate that reward shaping can be used as an effective means to understand the importance of different sub-tasks during game-play towards winning the game, to encode a secondary objective functions such as energy efficiency into a player's game-playing behavior, and, to improve learning generalizable policies that can perform well against different skill levels of the opponent.
    摘要 我们研究了奖励形态在改善回归学习中的效果,在实时策略捕捉旗标游戏的上下文中。游戏具有罕见的奖励,与捕捉或捕获旗标或标记对手相关。我们表明,适当设计的奖励形态函数应用于不同的游戏事件可以显著提高玩家的表现和学习算法的训练时间。我们在模拟 marine 捕捉旗标游戏中进行了实验,结果表明,奖励形态可以用作改善玩家在游戏中完成不同任务的重要性,编码 auxiliary 目标函数,如能源效率,并提高对不同对手水平的学习策略。

Releasing the CRaQAn (Coreference Resolution in Question-Answering): An open-source dataset and dataset creation methodology using instruction-following models

  • paper_url: http://arxiv.org/abs/2311.16338
  • repo_url: None
  • paper_authors: Rob Grzywinski, Joshua D’Arcy, Rob Naidoff, Ashish Shukla, Alex Browne, Ren Gibbons, Brinnae Bent
  • for: 这个论文是为了提高问答应用中的信息检索方法,特别是在核心引用解决方面。
  • methods: 这篇论文使用了一种新的 instruciton-following 模型(GPT-4)和一种循环批评和改进的方法来创建高质量的数据集。
  • results: 这篇论文提供了250个问题答案对,其中包含了核心引用。这些数据集可以帮助进一步研究核心引用解决方法的问题。
    Abstract Instruction-following language models demand robust methodologies for information retrieval to augment instructions for question-answering applications. A primary challenge is the resolution of coreferences in the context of chunking strategies for long documents. The critical barrier to experimentation of handling coreferences is a lack of open source datasets, specifically in question-answering tasks that require coreference resolution. In this work we present our Coreference Resolution in Question-Answering (CRaQAn) dataset, an open-source dataset that caters to the nuanced information retrieval requirements of coreference resolution in question-answering tasks by providing over 250 question-answer pairs containing coreferences. To develop this dataset, we developed a novel approach for creating high-quality datasets using an instruction-following model (GPT-4) and a Recursive Criticism and Improvement Loop.
    摘要 instrucciones de modelo de lenguaje exigen métodos robustos para la recuperación de información para aplicaciones de respuesta a preguntas. Un desafío principal es la resolución de coreferencias en el contexto de estrategias de chunking para documentos largos. La barrera crítica para la experimentación de la resolución de coreferencias es la falta de conjuntos de datos abiertos, específicamente en tareas de respuesta a preguntas que requieren resolución de coreferencias. En este trabajo presentamos nuestro Dataset de Resolución de Coreferencias en Respuesta a Preguntas (CRaQAn), un conjunto de datos abierto que se adapta a las necesidades de recuperación de información nuancedas de coreferencias en tareas de respuesta a preguntas al proporcionar más de 250 pares de preguntas y respuestas que contienen coreferencias. Para crear este conjunto de datos, desarrollamos una aproximación novel para crear conjuntos de datos de alta calidad utilizando un modelo de seguimiento de instrucciones (GPT-4) y un Bucle de Crítica y Mejora Recursiva.

Domain-Specific Deep Learning Feature Extractor for Diabetic Foot Ulcer Detection

  • paper_url: http://arxiv.org/abs/2311.16312
  • repo_url: None
  • paper_authors: Reza Basiri, Milos R. Popovic, Shehroz S. Khan
  • for: 这篇论文旨在开发一个自动识别diabetic foot ulcer(DFU)伤口的深度学习网络,并评估最佳特征提取器。
  • methods: 本研究使用了14种不同的深度学习网络,包括UNet和EfficientNetb3等,并使用了mAP和F1-score进行评估。
  • results: 结果显示,UNet和EfficientNetb3的结合使得最高的评估成绩,这两种特征提取器可以用来开发一个专门的DFU域领域自动伤口检测管线。
    Abstract Diabetic Foot Ulcer (DFU) is a condition requiring constant monitoring and evaluations for treatment. DFU patient population is on the rise and will soon outpace the available health resources. Autonomous monitoring and evaluation of DFU wounds is a much-needed area in health care. In this paper, we evaluate and identify the most accurate feature extractor that is the core basis for developing a deep-learning wound detection network. For the evaluation, we used mAP and F1-score on the publicly available DFU2020 dataset. A combination of UNet and EfficientNetb3 feature extractor resulted in the best evaluation among the 14 networks compared. UNet and Efficientnetb3 can be used as the classifier in the development of a comprehensive DFU domain-specific autonomous wound detection pipeline.
    摘要 糖尿病足沟(DFU)是一种需要不断监控和评估治疗的病情。DFU患者人数在增加,将很快超过现有的医疗资源。自动监控和评估DFU伤口是医疗领域的急需领域。在本文中,我们评估和找出了最精准的特征提取器,它是深度学习伤口探测网络的核心基础。我们使用MAP和F1-score进行评估,并比较了14种网络。UNet和EfficientNetb3的特征提取器组合得到了最佳评估结果。UNet和Efficientnetb3可以用作发展全面的DFU领域专门自动伤口探测管线的分类器。

A Graph Neural Network-Based QUBO-Formulated Hamiltonian-Inspired Loss Function for Combinatorial Optimization using Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.16277
  • repo_url: None
  • paper_authors: Redwan Ahmed Rizvee, Raheeb Hasan, Md. Mosaddek Khan
  • For: The paper is written to address the challenge of solving combinatorial optimization problems (CO) over graphs using quantum optimization algorithms, specifically by leveraging the Quadratic Unconstrained Binary Optimization (QUBO) formulation and the Ising Hamiltonian.* Methods: The paper proposes a generic framework called PI-GNN, which combines Graph Neural Network (GNN) architecture with a QUBO-formulated Hamiltonian-inspired loss function to solve CO problems over graphs. The authors also introduce a novel Monty Carlo Tree Search-based strategy with GNN that applies guided search through manual perturbation of node labels during training.* Results: The paper reports that the proposed methods can improve the performance of solving CO problems over graphs, with up to 44% improvement in the number of constraint violations compared to the PI-GNN. The results demonstrate the effectiveness of the proposed methods in addressing the challenge of solving CO problems over graphs using quantum optimization algorithms.
    Abstract Quadratic Unconstrained Binary Optimization (QUBO) is a generic technique to model various NP-hard Combinatorial Optimization problems (CO) in the form of binary variables. Ising Hamiltonian is used to model the energy function of a system. QUBO to Ising Hamiltonian is regarded as a technique to solve various canonical optimization problems through quantum optimization algorithms. Recently, PI-GNN, a generic framework, has been proposed to address CO problems over graphs based on Graph Neural Network (GNN) architecture. They introduced a generic QUBO-formulated Hamiltonian-inspired loss function that was directly optimized using GNN. PI-GNN is highly scalable but there lies a noticeable decrease in the number of satisfied constraints when compared to problem-specific algorithms and becomes more pronounced with increased graph densities. Here, We identify a behavioral pattern related to it and devise strategies to improve its performance. Another group of literature uses Reinforcement learning (RL) to solve the aforementioned NP-hard problems using problem-specific reward functions. In this work, we also focus on creating a bridge between the RL-based solutions and the QUBO-formulated Hamiltonian. We formulate and empirically evaluate the compatibility of the QUBO-formulated Hamiltonian as the generic reward function in the RL-based paradigm in the form of rewards. Furthermore, we also introduce a novel Monty Carlo Tree Search-based strategy with GNN where we apply a guided search through manual perturbation of node labels during training. We empirically evaluated our methods and observed up to 44% improvement in the number of constraint violations compared to the PI-GNN.
    摘要 Quadratic Unconstrained Binary Optimization (QUBO) 是一种通用技术,用于模型不同NP-hard Combinatorial Optimization问题 (CO) 中的 binary 变量。Ising ハミルтоニアン是用于模型系统的能量函数。QUBO 到 Ising ハミルтоニアン 被视为一种用于解决多种 canonical 优化问题的 quantum 优化算法。现在,PI-GNN 是一种通用框架,用于Addressing CO 问题在图上基于图神经网络 (GNN) 架构。它们引入了一个通用 QUBO-formulated Hamiltonian-inspired 产生函数,直接使用 GNN 进行优化。PI-GNN 具有高可扩展性,但是存在一定的约束满足率减少,特别是在图密度增加时。在这里,我们发现了一种行为特征,并提出了改进其性能的策略。另一些文献使用 Reinforcement Learning (RL) 解决NP-hard问题,我们也将关注将 QUBO-formulated Hamiltonian 作为特定问题的奖励函数在 RL 基础上的应用。我们提出了一种将 QUBO-formulated Hamiltonian 作为通用奖励函数的方法,并进行了实验评估。此外,我们还引入了一种基于 Monty Carlo Tree Search 的 GNN 策略,其中我们在训练时通过手动扰动节点标签来进行指导搜索。我们对方法进行了实验评估,并发现了对 PI-GNN 的44%改进。

RelVAE: Generative Pretraining for few-shot Visual Relationship Detection

  • paper_url: http://arxiv.org/abs/2311.16261
  • repo_url: None
  • paper_authors: Sotiris Karapiperis, Markos Diomataris, Vassilis Pitsikalis
  • for: 本研究 targets the problem of few-shot Visual Relationship Detection (VRD), which has been neglected by the community due to the lack of high-quality, diverse, and large-scale datasets.
  • methods: 本研究 introduce a generative model that captures the variation of semantic, visual, and spatial information of relations inside a latent space, and exploits its representations for efficient few-shot classification.
  • results: 本研究 achieves better performance than baselines on VG200 and VRD datasets through few-shot training splits, and provides qualitative experiments to interpret the decisions of the model.
    Abstract Visual relations are complex, multimodal concepts that play an important role in the way humans perceive the world. As a result of their complexity, high-quality, diverse and large scale datasets for visual relations are still absent. In an attempt to overcome this data barrier, we choose to focus on the problem of few-shot Visual Relationship Detection (VRD), a setting that has been so far neglected by the community. In this work we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that is able to capture the variation of semantic, visual and spatial information of relations inside a latent space and later exploiting its representations in order to achieve efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets where our model outperforms the baselines. Lastly we attempt to interpret the decisions of the model by conducting various qualitative experiments.
    摘要 Visual relations are complex, multimodal concepts that play an important role in how humans perceive the world. Due to their complexity, high-quality, diverse, and large-scale datasets for visual relations are still lacking. To address this data gap, we focus on the problem of few-shot Visual Relationship Detection (VRD), which has been neglected by the community so far. In this work, we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that can capture the variation of semantic, visual, and spatial information of relations inside a latent space, and then exploiting its representations for efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets, where our model outperforms the baselines. Finally, we attempt to interpret the decisions of the model by conducting various qualitative experiments.

Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation

  • paper_url: http://arxiv.org/abs/2311.16254
  • repo_url: https://github.com/aimagelab/safe-clip
  • paper_authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
  • for: 使 vision-and-language 模型更安全,以便在敏感和可信任的场景中使用。
  • methods: 通过精细化大语言模型,将不安全的概念从视力语言模型中除去。 fine-tune 从 100 个手动精心挑选的对。
  • results: 对 embedding 空间进行广泛的实验,证明我们的模型可以在检索和文本到图像生成中使用。同时,我们还证明了使用预训练的图像生成器。Here’s the breakdown of each point in English:
  • for: The paper aims to make vision-and-language models safer for use in sensitive and trustworthy contexts.
  • methods: The authors propose a methodology to remove sensitivity to not-safe-for-work concepts from vision-and-language models, using a distilled language model that converts between safe and unsafe sentences, and fine-tuning starting from just 100 manually curated pairs.
  • results: The authors conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, demonstrating that their model can be properly employed with pre-trained image generators.
    Abstract Vision-and-Language models such as CLIP have demonstrated remarkable effectiveness across a wide range of tasks. However, these models are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concern in their adoption. To overcome these limitations, we introduce a methodology to make Vision-and-Language models safer by removing their sensitivity to not-safe-for-work concepts. We show how this can be done by distilling from a large language model which converts between safe and unsafe sentences and which is fine-tuned starting from just 100 manually-curated pairs. We conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, where we show that our model can also be properly employed with pre-trained image generators. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.
    摘要 CLIP类的视觉语言模型已经展示了广泛的应用场景,但这些模型通常是通过网络规模的数据进行训练,这可能会导致不适合的内容和不安全的行为的发展,从而限制其在敏感和可靠的场景中的应用。为解决这些限制,我们提出了一种方法来使视觉语言模型更安全,即去掉它们对不安全的概念的敏感性。我们通过一个大型语言模型,将安全和不安全的句子转换为对应的句子,并从100个手动精心抽样的对话开始进行练习。我们在 embedding 空间进行了广泛的实验,并证明我们的模型可以与预训练的图像生成器结合使用。我们的源代码和训练模型可以在 GitHub 上获取:https://github.com/aimagelab/safe-clip。

IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers

  • paper_url: http://arxiv.org/abs/2311.17072
  • repo_url: None
  • paper_authors: Chenglin Yang, Siyuan Qiao, Yuan Cao, Yu Zhang, Tao Zhu, Alan Yuille, Jiahui Yu
  • for: 本研究旨在减小基于生成目标的视觉语言模型在分类任务上的性能差距。
  • methods: 我们改进了生成描述对象的评估目标,以减少语言模型对视觉信号的分布偏好,并设计了一种生成训练目标来匹配评估目标。
  • results: 我们的模型在 zero-shot 分类任务上的 ImageNet 上获得了$> 18%$ 的改进,与标准描述器相当,并在 MSCOCO 和 Flickr30K 上表现出色地完成了零shot 图像文本检索任务。
    Abstract Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules. Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves $> 18\%$ improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.
    摘要 <>Translate the given text into Simplified Chinese.<>生成训练已经被证明可以建立视觉语言模型。然而,在零shot推理标准 bencmarks 上,仍然存在生成和推理目标的性能差距。在这篇论文中,我们希望缩小这个差距,不需要任何资金调整或附加模块。特别是,我们专注于缩小生成captioner和CLIP推理器之间的差距。我们开始分析captioner和推理器的预测结果,发现生成caption inherit语言模型培养的分布偏好,使其更加关注视觉信号。为解决这个问题,我们修改了captioner的评价目标,以减少分布偏好,并将着眼于图像输入增加信息的量。我们还设计了一个生成训练目标,与评价目标匹配。我们称之为信息增加(IG)captioner。我们在公共Laion-5B数据集上预训模型,并进行了一系列推理评价。在零shot类别化ImageNet上,IG captioner比标准captioner多获得了18%以上的提升,与CLIP推理器的性能相似。IG captioner还在MSCOCO和Flickr30K上展现出了强大的零shot图像文本检索性能。我们希望这篇论文可以鼓励更多人在视觉语言模型中结合生成和推理训练程序。

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

  • paper_url: http://arxiv.org/abs/2311.16103
  • repo_url: https://github.com/pku-yuangroup/video-bench
  • paper_authors: Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan
  • for: 本研究旨在提出一个全面的评估系统,以帮助开发智能感知和决策的影像大型语言模型(Video-LLMs)。
  • methods: 本研究使用了10个精心设计的任务,以评估Video-LLMs在不同的水平上的能力,包括影像专门理解、基于先前知识的问题回答、以及理解和决策。此外,我们还提供了一个自动化的工具箱,便于计算指标和生成排名。
  • results: 研究发现现有的Video-LLMs仍然与人类水平的理解和分析真实影像许多的差异,提供了宝贵的研究方向。
    Abstract Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for the users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes \textit{Video-Bench}, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: Video-exclusive Understanding, Prior Knowledge-based Question-Answering, and Comprehension and Decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and generating convenient final scores. We evaluate 8 representative Video-LLMs using \textit{Video-Bench}. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world videos, offering valuable insights for future research directions. The benchmark and toolkit are available at: \url{https://github.com/PKU-YuanGroup/Video-Bench}.
    摘要 大型语言模型(Video-LLM)在最近引入,旨在提高视频理解和掌握,以及覆盖广泛的用户问题。在追求人工通用智能的目标下,一个真正智能的 Video-LLM 模型不仅应能看到和理解周围环境,还应具备人类常识水平,并对用户进行了解和决策。为促进这种模型的发展,建立一个可靠和全面的评估系统成为了非常重要的。为此,本文提出了 \textit{Video-Bench},一个新的全面的benchmark,以及特制的工具包,专门用于评估 Video-LLM。benchmark 包括 10 个精心制作的任务,评估 Video-LLM 在三个不同的水平:视频专用理解、基于习知的问答、和理解和决策。此外,我们还提供了自动化的工具包,用于处理模型输出的各种任务,以便计算指标和生成便利的最终分数。我们使用 \textit{Video-Bench} 评估了 8 个代表性的 Video-LLM。结果显示,当前 Video-LLM 仍然很有限地完成了真实世界视频中的人类化理解和分析,提供了有价值的研究方向。benchmark 和工具包可以在: 中获取。

Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback

  • paper_url: http://arxiv.org/abs/2311.16102
  • repo_url: https://github.com/mihirp1998/Diffusion-TTA
  • paper_authors: Mihir Prabhudesai, Tsung-Wei Ke, Alexander C. Li, Deepak Pathak, Katerina Fragkiadaki
  • for: 本研究旨在探讨如何使用生成模型来提高推理模型的准确率。
  • methods: 我们提出了一种基于扩散模型的测试时适应方法,即Diffusion-TTA,可以使已经训练过的推理模型在测试集中具有更高的准确率。我们通过修改扩散模型的条件来使用生成反馈来适应测试集中的每个例子。然后,我们通过评估图像可能性目标来最大化图像的可能性,并通过反推导出来更新推理模型的参数。
  • results: 我们的实验结果表明,Diffusion-TTA可以在大规模的预训练推理模型上显著提高准确率,包括ImageNet分类器、CLIP模型、图像像素标注器和图像深度预测器。Diffusion-TTA也超过了现有的测试时适应方法,包括TTT-MAE和TENT,特别是在在线适应设置中,推理模型在测试集中 continually 适应每个例子。我们的代码、结果和视觉化可以在我们的网站上找到:https://diffusion-tta.github.io/.
    Abstract The advancements in generative modeling, particularly the advent of diffusion models, have sparked a fundamental question: how can these models be effectively used for discriminative tasks? In this work, we find that generative models can be great test-time adapters for discriminative models. Our method, Diffusion-TTA, adapts pre-trained discriminative models such as image classifiers, segmenters and depth predictors, to each unlabelled example in the test set using generative feedback from a diffusion model. We achieve this by modulating the conditioning of the diffusion model using the output of the discriminative model. We then maximize the image likelihood objective by backpropagating the gradients to discriminative model's parameters. We show Diffusion-TTA significantly enhances the accuracy of various large-scale pre-trained discriminative models, such as, ImageNet classifiers, CLIP models, image pixel labellers and image depth predictors. Diffusion-TTA outperforms existing test-time adaptation methods, including TTT-MAE and TENT, and particularly shines in online adaptation setups, where the discriminative model is continually adapted to each example in the test set. We provide access to code, results, and visualizations on our website: https://diffusion-tta.github.io/.
    摘要 “生成模型的进步,尤其是扩散模型的出现,引起了一个基本问题:如何使用这些模型来进行推断性任务?在这个工作中,我们发现了生成模型可以作为推断模型的test-time adapter。我们的方法,Diffusion-TTA,使用生成反馈来调整预训练的推断模型,以便在测试集中对每个无标示示例进行适应。我们使用扩散模型的输出来修改生成模型的conditioning,然后通过推断模型的参数来最大化图像可能性目标。我们示出Diffusion-TTA可以显著提高各种大规模预训练的推断模型的准确率,包括图像分类器、分割器、深度预测器等。Diffusion-TTA超过了现有的test-time adaptation方法,包括TTT-MAE和TENT,特别在在线适应设置中,推断模型 continually adapts to each example in the test set。我们在我们的网站上提供了代码、结果和视觉化:https://diffusion-tta.github.io/。”Note: The translation is in Simplified Chinese, which is one of the two standard varieties of Chinese. The other variety is Traditional Chinese.

On Bringing Robots Home

  • paper_url: http://arxiv.org/abs/2311.16098
  • repo_url: https://github.com/notmahi/dobb-e
  • paper_authors: Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, Lerrel Pinto
  • for: 这项研究的目的是开发一种可靠、有效的家用机器人系统,以满足家庭中的多种任务需求。
  • methods: 这项研究使用了低成本的组件和iPhone制成的示例采集工具(The Stick),收集了13小时的数据,并使用Home Pretrained Representations(HPR)模型进行训练。在新的家庭环境中,只需5分钟的示例和15分钟的适应, Dobb-E 系统就可以可靠地解决任务。
  • results: 在纽约市和周边地区的30天实验中, Dobb-E 系统在10户109个任务中达到了81%的成功率。此外,实验还发现了很多在实验室Robotics中缺失或忽略的挑战,例如强烈的阴影和非专家用户示例质量的变化。
    Abstract Throughout history, we have successfully integrated various machines into our homes. Dishwashers, laundry machines, stand mixers, and robot vacuums are a few recent examples. However, these machines excel at performing only a single task effectively. The concept of a "generalist machine" in homes - a domestic assistant that can adapt and learn from our needs, all while remaining cost-effective - has long been a goal in robotics that has been steadily pursued for decades. In this work, we initiate a large-scale effort towards this goal by introducing Dobb-E, an affordable yet versatile general-purpose system for learning robotic manipulation within household settings. Dobb-E can learn a new task with only five minutes of a user showing it how to do it, thanks to a demonstration collection tool ("The Stick") we built out of cheap parts and iPhones. We use the Stick to collect 13 hours of data in 22 homes of New York City, and train Home Pretrained Representations (HPR). Then, in a novel home environment, with five minutes of demonstrations and fifteen minutes of adapting the HPR model, we show that Dobb-E can reliably solve the task on the Stretch, a mobile robot readily available on the market. Across roughly 30 days of experimentation in homes of New York City and surrounding areas, we test our system in 10 homes, with a total of 109 tasks in different environments, and finally achieve a success rate of 81%. Beyond success percentages, our experiments reveal a plethora of unique challenges absent or ignored in lab robotics. These range from effects of strong shadows, to variable demonstration quality by non-expert users. With the hope of accelerating research on home robots, and eventually seeing robot butlers in every home, we open-source Dobb-E software stack and models, our data, and our hardware designs at https://dobb-e.com
    摘要 历史上,我们已经成功地将多种机器 integrate 到了我们的家中。洗衣机、干洗机、搅拌机和机器干净器是其中的一些最近的例子。然而,这些机器只能够很好地完成单一任务。“家庭助手”这一概念——一种可以适应和学习我们需求的家用机器人——在机器人领域已经是多年来追求的目标。在这项工作中,我们发起了一项大规模努力,推出了 Dobbe,一种可以适应多种任务的家用机器人系统。Dobbe可以通过只需5分钟的用户示例来学习新任务,这得到了我们自己设计的“棒”(The Stick)数据采集工具的帮助。我们使用棒采集了13个小时的数据,并训练了家庭预处理表示(HPR)模型。然后,在一个新的家庭环境中,只需5分钟的示例和15分钟的适应HPR模型,我们证明了Dobbe可以可靠地解决任务。在纽约市和周边地区的约30天内,我们在10户109个任务的不同环境中进行了30天的实验,最终实现了81%的成功率。除了成功率之外,我们的实验还揭示了室内机器人研究中缺失或忽略的许多独特挑战。这些挑战包括影响强烈的阴影,以及非专家用户的示例质量的变化。我们希望通过开源Dobbe软件堆栈和模型,我们的数据和硬件设计,加速家用机器人研究,并 eventually 在每个家庭中看到机器察看。更多信息请访问https://dobb-e.com。

Interactive Autonomous Navigation with Internal State Inference and Interactivity Estimation

  • paper_url: http://arxiv.org/abs/2311.16091
  • repo_url: None
  • paper_authors: Jiachen Li, David Isele, Kanghoon Lee, Jinkyoo Park, Kikuo Fujimura, Mykel J. Kochenderfer
    for:* 这个论文主要目标是提高智能代理人(如自动驾驶车辆)在复杂enario中导航的能力,并提供可解释的中间指标。methods:* 该论文提出三个辅助任务,即空间时间关系理解任务,并将其集成到标准的深度强化学习框架中,以改善决策性能并提供可解释的中间指标。* 该论文使用空间时间图 neural network 来编码关系 между动态实体,以增强内部状态推断和决策。* 论文还提出了一种互动度估计机制,基于Predicted trajectories在不同情况下的差异,以衡量ego agent对其他交互代理人的影响度。results:* 该论文在基于Intelligent Intersection Driver Model (IIDM)的交叉口驾驶 simulator 中测试了其方法,并取得了robust和当前领先的性能。* 该论文的方法提供了可解释的中间指标(即内部状态和互动度),以帮助决策。
    Abstract Deep reinforcement learning (DRL) provides a promising way for intelligent agents (e.g., autonomous vehicles) to learn to navigate complex scenarios. However, DRL with neural networks as function approximators is typically considered a black box with little explainability and often suffers from suboptimal performance, especially for autonomous navigation in highly interactive multi-agent environments. To address these issues, we propose three auxiliary tasks with spatio-temporal relational reasoning and integrate them into the standard DRL framework, which improves the decision making performance and provides explainable intermediate indicators. We propose to explicitly infer the internal states (i.e., traits and intentions) of surrounding agents (e.g., human drivers) as well as to predict their future trajectories in the situations with and without the ego agent through counterfactual reasoning. These auxiliary tasks provide additional supervision signals to infer the behavior patterns of other interactive agents. Multiple variants of framework integration strategies are compared. We also employ a spatio-temporal graph neural network to encode relations between dynamic entities, which enhances both internal state inference and decision making of the ego agent. Moreover, we propose an interactivity estimation mechanism based on the difference between predicted trajectories in these two situations, which indicates the degree of influence of the ego agent on other agents. To validate the proposed method, we design an intersection driving simulator based on the Intelligent Intersection Driver Model (IIDM) that simulates vehicles and pedestrians. Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics and provides explainable intermediate indicators (i.e., internal states, and interactivity scores) for decision making.
    摘要 深度强化学习(DRL)提供了智能代理人(如自动驾驶车辆)在复杂enario中导航的可能性。然而,DRL使用神经网络作为函数估计器通常被视为黑盒子,具有少量解释性,并常受优化性下降,特别是在高度互动多代理人环境中。为解决这些问题,我们提议三个辅助任务,包括空间时间关系理解,并将其 integrate into the standard DRL framework。这种方法可以提高决策性能并提供可解释的中间指标。我们还提议明确周围代理人(如人类 drivers)的内部状态(例如特质和意图)的推理,以及未来轨迹预测在各种情况下。这些辅助任务提供了更多的监督信号,以便推理其他互动代理人的行为模式。我们还使用空间时间图 neural network来编码关系 between dynamic entities,这有助于internal state推理和决策。此外,我们还提出了对比预测结果来计算代理人之间的互动程度的估计机制。为验证我们的方法,我们设计了基于Intelligent Intersection Driver Model(IIDM)的交叉口驾驶 simulator,该模型 simulate vehicles and pedestrians。我们的方法实现了 robust和状态艺术的表现,并提供可解释的中间指标(例如内部状态和互动度) для决策。

MAST: Model-Agnostic Sparsified Training

  • paper_url: http://arxiv.org/abs/2311.16086
  • repo_url: https://github.com/konstmish/opt_methods
  • paper_authors: Yury Demidovich, Grigory Malinovsky, Egor Shulgin, Peter Richtárik
  • for: 提高机器学习模型训练的效率和稳定性
  • methods: 使用随机笔记算子和初始预训练模型,实现模型和梯度缩短训练
  • results: 提出了一种新的优化问题表述,并实现了对这种问题的解释和分析,同时还提出了一些基于这种问题表述的SGD算法和其变体,包括抽象抽取、分布式SGD和减少噪声技术等,能够提高机器学习模型训练的效率和稳定性。
    Abstract We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.
    摘要 我们介绍一个新的优化问题设计,与传统的机器学习模型损失函数优化方法不同。不同于传统的设计,我们的方法明示地包含一个初始预训练的模型和随机绘制算法,以简化模型和梯度的训练过程。我们证明了我们的目标函数的内在性和标准化形式ulation的连接,并提出了多种基于新问题设计的梯度下降法,包括样本选择法、分布式版本和内部统计变化减少技术。我们实现了更紧密的测度误差率和松动条件,将理论原理和实应应用处理融合,涵盖了轻量级训练和斜梯度训练等重要技术。这个研究具有推动理论理解模型训练的可能性。

Transformer-QEC: Quantum Error Correction Code Decoding with Transferable Transformers

  • paper_url: http://arxiv.org/abs/2311.16082
  • repo_url: None
  • paper_authors: Hanrui Wang, Pengyu Liu, Kevin Shao, Dantong Li, Jiaqi Gu, David Z. Pan, Yongshan Ding, Song Han
  • for: 这个研究旨在开发一个基于trasformer架构的量子错误补偿(QEC)解oder,以提高量子 Computing 系统中的错误率。
  • methods: 本研究使用了一个mix loss trainingapproach,结合了地方物理错误和全息平衡标签的losses,以train一个基于transformer架构的QEC解oder。
  • results: 根据evaluation on six个code distance和ten个不同的错误配置,我们的模型 consistently outperforms non-ML decoders和其他ML decoders,以 дости得最好的逻辑错误率。此外,trasnfer learning可以 Save over 10x of training cost.
    Abstract Quantum computing has the potential to solve problems that are intractable for classical systems, yet the high error rates in contemporary quantum devices often exceed tolerable limits for useful algorithm execution. Quantum Error Correction (QEC) mitigates this by employing redundancy, distributing quantum information across multiple data qubits and utilizing syndrome qubits to monitor their states for errors. The syndromes are subsequently interpreted by a decoding algorithm to identify and correct errors in the data qubits. This task is complex due to the multiplicity of error sources affecting both data and syndrome qubits as well as syndrome extraction operations. Additionally, identical syndromes can emanate from different error sources, necessitating a decoding algorithm that evaluates syndromes collectively. Although machine learning (ML) decoders such as multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) have been proposed, they often focus on local syndrome regions and require retraining when adjusting for different code distances. We introduce a transformer-based QEC decoder which employs self-attention to achieve a global receptive field across all input syndromes. It incorporates a mixed loss training approach, combining both local physical error and global parity label losses. Moreover, the transformer architecture's inherent adaptability to variable-length inputs allows for efficient transfer learning, enabling the decoder to adapt to varying code distances without retraining. Evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, thereby achieving best logical error rates. Moreover, the transfer learning can save over 10x of training cost.
    摘要 We propose a transformer-based QEC decoder that employs self-attention to achieve a global receptive field across all input syndromes. The decoder incorporates a mixed loss training approach, combining both local physical error and global parity label losses. Moreover, the transformer architecture's inherent adaptability to variable-length inputs allows for efficient transfer learning, enabling the decoder to adapt to varying code distances without retraining.Our evaluation on six code distances and ten different error configurations demonstrates that our model consistently outperforms non-ML decoders, such as Union Find (UF) and Minimum Weight Perfect Matching (MWPM), and other ML decoders, achieving best logical error rates. Moreover, the transfer learning can save over 10x of training cost.

ViT-Lens-2: Gateway to Omni-modal Intelligence

  • paper_url: http://arxiv.org/abs/2311.16081
  • repo_url: https://github.com/TencentARC/ViT-Lens
  • paper_authors: Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou
  • for: 提高 AI 代理的能力,大规模基础模型可以大幅提高理解和指令执行,但目前关注视觉和语言,忽略了开放环境中多种感知模式的潜力。
  • methods: 提出了 ViT-Lens-2,它可以有效地将新的感知模式与预训练的 ViT 集成,并将其们 proyect到预定义的空间中,以便进行有效的表征学习。
  • results: 在多种理解任务上,ViT-Lens-2 可以提供新的状态记录,并且可以具有无需重新训练的 zeroshot 类别化能力。通过将 ViT-Lens-2 集成到多模态基础模型中,可以实现 Any-modality to Text and Image Generation 的零shot 模式。
    Abstract Aiming to advance AI agents, large foundation models significantly improve reasoning and instruction execution, yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However, the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically, the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space, which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio, tactile and EEG, and set new state-of-the-art results across various understanding tasks, such as zero-shot classification. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens.
    摘要 目标是提高人工智能代理人,大型基金模型可以显著提高理解和执行 instrucion,但当前关注视觉和语言而忽视了多样化感知环境中的潜在能力。然而,数据驱动的视觉和语言模型的成功很可能是成本高或不可能重现的。在这篇论文中,我们提出了ViT-Lens-2,它实现了高效多模态表示学习,通过使用预训练的ViT和对应的特点灵活进行模态匹配。具体来说,模态特定的镜头将任何模态信号投影到中间嵌入空间,然后由强大的ViT进行处理,并利用预训练的视觉知识进行编码。所得到的表示被优化向接受模式独立空间进行对齐,这个空间是由商业化基础模型预定义的。ViT-Lens-2提供了一个统一的解决方案,可以有效地将预训练的ViT应用于新的模态,同时保持高效的数据 régime。我们适应ViT-Lens-2来学习3D点云、深度、音频、感觉和EEG等多种模态的表示,并在不同理解任务中设置新的状态之册记录。通过将ViT-Lens-2灵活地集成到多模态基础模型中,我们实现了无需训练的Any-模态到文本和图像生成。代码和模型可以在https://github.com/TencentARC/ViT-Lens中获取。

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

  • paper_url: http://arxiv.org/abs/2311.16079
  • repo_url: https://github.com/epfllm/meditron
  • paper_authors: Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut
    for:MEDITRON aims to improve access to large-scale medical language models, with the goal of democratizing medical knowledge.methods:MEDITRON uses a suite of open-source language models with 7B and 70B parameters, adapted from Llama-2 and pretrained on a comprehensive medical corpus.results:MEDITRON achieves significant performance gains over several state-of-the-art baselines in medical benchmarks, with a 6% absolute performance gain over the best public baseline and within 5% of GPT-4.
    Abstract Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.
    摘要 大型语言模型(LLM)有可能将医学知识普及化。虽然许多努力已经用于利用和改进LLM的医学知识和推理能力,但现有的模型都是关闭源代码(如PaLM、GPT-4)或者缺乏大规模(<= 13B参数),这限制了它们的能力。在这项工作中,我们通过发布MEDITRON:一个开源的LMM模型,以7B和70B参数进行适应医学领域。MEDITRON基于Llama-2(通过我们对Nvidia的Megatron-LM分布式训练器的修改),并对医学领域综合抽取的医学文献进行预训练。经过四个主要医学指标的评估,MEDITRON在比较多个状态的基eline上显示了明显的性能提升。总的来说,MEDITRON在参数类型中的最佳公共基线上获得6%的绝对性能提升,并在最强基eline上获得3%的提升。相比于关闭源LLM,MEDITRON-70B在GPT-3.5和Med-PaLM上出色,与GPT-4和Med-PaLM-2的性能相仿。我们发布了对医学预训练文献的编辑代码和MEDITRON模型参数,以便促进开源的医学LLM模型的发展。

BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights

  • paper_url: http://arxiv.org/abs/2311.16075
  • repo_url: None
  • paper_authors: François Remy, Kris Demuynck, Thomas Demeester
  • for: 本研究旨在利用大语言模型 complement biomedical knowledge graphs,以提高生物医学和临床领域semantic模型的训练。
  • methods: 该研究提出了三个步骤,包括改进的对照学习阶段、新的自适应学习阶段以及权重平均阶段。
  • results: 通过对BIOLORD测试集和多个下游任务进行严格评估,研究人员demonstrated了与前一代状态OF-THE-ART(+2pts在MedSTS、+2.5pts在MedNLI-S、+6.1pts在EHR-Rel-B)的一致性和显著性表现提升。此外,研究人员还分配了一个可与50多种语言相容的多语言模型,并在7种欧洲语言上进行了finetuning。这些新的模型可以帮助临床管道中的许多实践。
    Abstract In this study, we investigate the potential of Large Language Models to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. Drawing on the wealth of the UMLS knowledge graph and harnessing cutting-edge Large Language Models, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of three steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. Through rigorous evaluations via the extensive BioLORD testing suite and diverse downstream tasks, we demonstrate consistent and substantial performance improvements over the previous state of the art (e.g. +2pts on MedSTS, +2.5pts on MedNLI-S, +6.1pts on EHR-Rel-B). Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications.
    摘要 在这个研究中,我们研究了大型自然语言模型是否可以补充生物医学知识图的训练,以提高生物医学和临床领域的semantic模型的性能。基于UMLS知识图的丰富资源和前沿技术,我们提出了一种新的state-of-the-art方法,包括三个阶段:改进的对比学习阶段、新的自适应阶段和加权平均阶段。通过对BIoLORD测试集和多种下游任务进行严格评估,我们证明了我们的模型在前一代模型的基础上显著提高了性能(如+2点MedSTS、+2.5点MedNLI-S、+6.1点EHR-Rel-B)。此外,我们还释出了一个兼容50多种语言的多语言模型,并在7种欧洲语言上进行了较好的finetuning。这些新模型将有助于许多临床管道,而且我们的新多语言模型将开启了一条新的国际生物信息学研究之路。因此,我们希望BIoLORD-2023将成为未来生物医学应用中的珍贵工具。

A Survey on Vulnerability of Federated Learning: A Learning Algorithm Perspective

  • paper_url: http://arxiv.org/abs/2311.16065
  • repo_url: https://github.com/rand2ai/awesome-vulnerability-of-federated-learning
  • paper_authors: Xianghua Xie, Chen Hu, Hanchi Ren, Jingjing Deng
  • for: 本文概述了分布式学习(FL)系统面临的攻击方法,并从新的攻击来源和目标角度进行分类。
  • methods: 本文根据攻击源和目标分类 существу的威胁模型为四种:数据到模型(D2M)、模型到数据(M2D)、模型到模型(M2M)以及复杂攻击。对每种攻击类型,我们讨论了防御策略,包括使用单一指标、排除恶意客户端以及在不同阶段检查客户端模型等。
  • results: 本文指出,在不同阶段都可以通过欺骗、重构 lok 数据和插入后门来发起攻击。这些威胁不仅可以下降模型性能,还可以泄露本地数据和插入后门。此外,攻击方法在不断发展,早期研究通常通过增强恶意梯度来攻击,而现在则是通过轻微地修改本地模型中的最小权重来绕过防御措施。
    Abstract This review paper takes a comprehensive look at malicious attacks against FL, categorizing them from new perspectives on attack origins and targets, and providing insights into their methodology and impact. In this survey, we focus on threat models targeting the learning process of FL systems. Based on the source and target of the attack, we categorize existing threat models into four types, Data to Model (D2M), Model to Data (M2D), Model to Model (M2M) and composite attacks. For each attack type, we discuss the defense strategies proposed, highlighting their effectiveness, assumptions and potential areas for improvement. Defense strategies have evolved from using a singular metric to excluding malicious clients, to employing a multifaceted approach examining client models at various phases. In this survey paper, our research indicates that the to-learn data, the learning gradients, and the learned model at different stages all can be manipulated to initiate malicious attacks that range from undermining model performance, reconstructing private local data, and to inserting backdoors. We have also seen these threat are becoming more insidious. While earlier studies typically amplified malicious gradients, recent endeavors subtly alter the least significant weights in local models to bypass defense measures. This literature review provides a holistic understanding of the current FL threat landscape and highlights the importance of developing robust, efficient, and privacy-preserving defenses to ensure the safe and trusted adoption of FL in real-world applications.
    摘要 We identify four types of attacks: Data to Model (D2M), Model to Data (M2D), Model to Model (M2M), and composite attacks. Each attack type has unique methodology and impact, and we discuss the effectiveness, assumptions, and potential areas for improvement of defense strategies.Existing defense strategies have evolved from using a single metric to excluding malicious clients, to employing a multifaceted approach examining client models at various phases. Our research shows that the to-learn data, learning gradients, and learned model at different stages can all be manipulated to initiate malicious attacks, such as undermining model performance, reconstructing private local data, and inserting backdoors.Recent attacks have become more insidious, subtly altering the least significant weights in local models to bypass defense measures. This literature review provides a comprehensive understanding of the current FL threat landscape and highlights the importance of developing robust, efficient, and privacy-preserving defenses to ensure the safe and trusted adoption of FL in real-world applications.

OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.16038
  • repo_url: https://github.com/wzzheng/occworld
  • paper_authors: Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, Jiwen Lu
  • for: The paper aims to improve the understanding of the 3D scene evolution in autonomous driving by proposing a new framework called OccWorld, which learns a world model in the 3D occupancy space.
  • methods: The proposed method uses a reconstruction-based scene tokenizer to obtain discrete scene tokens, and a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens.
  • results: The paper demonstrates the effectiveness of OccWorld in modeling the evolution of driving scenes through extensive experiments on the nuScenes benchmark, and shows competitive planning results without using instance and map supervision.Here are the three points in Simplified Chinese:
  • for: 该论文目的是提高自动驾驶中Scene的理解,提出了一种新的OccWorld框架,该框架在3D占用空间中学习世界模型。
  • methods: 该方法使用了重建基于Scene tokenizer来获得精细Scene tokens,并采用GPT-like的空间-时间生成变换器来生成后续Scene和自己 tokens。
  • results: 论文通过对nuScenes数据集进行广泛的实验,显示OccWorld可以有效地模型驾驶场景的发展,并且不需要使用实例和地图指导来实现竞争的规划结果。
    Abstract Understanding how the 3D scene evolves is vital for making decisions in autonomous driving. Most existing methods achieve this by predicting the movements of object boxes, which cannot capture more fine-grained scene information. In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D Occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. We propose to learn a world model based on 3D occupancy rather than 3D bounding boxes and segmentation maps for three reasons: 1) expressiveness. 3D occupancy can describe the more fine-grained 3D structure of the scene; 2) efficiency. 3D occupancy is more economical to obtain (e.g., from sparse LiDAR points). 3) versatility. 3D occupancy can adapt to both vision and LiDAR. To facilitate the modeling of the world evolution, we learn a reconstruction-based scene tokenizer on the 3D occupancy to obtain discrete scene tokens to describe the surrounding scenes. We then adopt a GPT-like spatial-temporal generative transformer to generate subsequent scene and ego tokens to decode the future occupancy and ego trajectory. Extensive experiments on the widely used nuScenes benchmark demonstrate the ability of OccWorld to effectively model the evolution of the driving scenes. OccWorld also produces competitive planning results without using instance and map supervision. Code: https://github.com/wzzheng/OccWorld.
    摘要 理解三维场景的演化是自动驾驶决策的关键。现有方法多数采用预测 объек 框的运动来实现这一目标,但这些方法无法捕捉更细致的场景信息。在这篇论文中,我们探索了一种新的世界模型框架——OccWorld,用于同时预测egos车的运动和周围场景的演化。我们建议基于三维占用空间学习世界模型,而不是基于三维 bounding box 和分 segmentation 图像。我们有以下三个理由:1. 表达能力。三维占用空间可以更好地描述场景的更细致结构。2. 效率。三维占用空间更 econo 寻取 (例如,从稀疏 LiDAR 点获取).3. 灵活性。三维占用空间可以适应视觉和 LiDAR。为了促进世界演化的模型,我们在三维占用空间上学习了一种Scene Tokenizer,以获得周围场景的精炼 Token 来描述场景的演化。然后,我们采用了一种GPT-like的空间-时间生成变换器,以生成后续场景和egos Token,以解码未来占用和egos车的轨迹。广泛的 nuScenes 测试数据显示了 OccWorld 的能力在场景演化方面的效果。OccWorld 还生成了不使用实例和地图监督的可行的规划结果。代码:https://github.com/wzzheng/OccWorld。

RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

  • paper_url: http://arxiv.org/abs/2311.16035
  • repo_url: None
  • paper_authors: Hanrui Wang, Yilian Liu, Pengyu Liu, Jiaqi Gu, Zirui Li, Zhiding Liang, Jinglei Cheng, Yongshan Ding, Xuehai Qian, Yiyu Shi, David Z. Pan, Frederic T. Chong, Song Han
  • for: 这个论文的目的是提出一种高效稳定的量子状态准备算法,以提高量子计算机的精度和可靠性。
  • methods: 这个算法使用了变量量量子状态准备(VQSP)方法,通过Iterative tuning ansatz parameters来 aproximate target state。此外,这个算法还使用了实际机器的测量结果来进行质量控制,以提高training efficiency和稳定性。
  • results: 这个算法在4种量子算法的状态准备任务上达到了 coherent error reduction of up to 7.1 $\times$ 和 state fidelity improvement of up to 96%和81%,而且在平均情况下,与基线方法相比,这个算法提高了精度by 50%和72%。
    Abstract Quantum state preparation, a crucial subroutine in quantum computing, involves generating a target quantum state from initialized qubits. Arbitrary state preparation algorithms can be broadly categorized into arithmetic decomposition (AD) and variational quantum state preparation (VQSP). AD employs a predefined procedure to decompose the target state into a series of gates, whereas VQSP iteratively tunes ansatz parameters to approximate target state. VQSP is particularly apt for Noisy-Intermediate Scale Quantum (NISQ) machines due to its shorter circuits. However, achieving noise-robust parameter optimization still remains challenging. We present RobustState, a novel VQSP training methodology that combines high robustness with high training efficiency. The core idea involves utilizing measurement outcomes from real machines to perform back-propagation through classical simulators, thus incorporating real quantum noise into gradient calculations. RobustState serves as a versatile, plug-and-play technique applicable for training parameters from scratch or fine-tuning existing parameters to enhance fidelity on target machines. It is adaptable to various ansatzes at both gate and pulse levels and can even benefit other variational algorithms, such as variational unitary synthesis. Comprehensive evaluation of RobustState on state preparation tasks for 4 distinct quantum algorithms using 10 real quantum machines demonstrates a coherent error reduction of up to 7.1 $\times$ and state fidelity improvement of up to 96\% and 81\% for 4-Q and 5-Q states, respectively. On average, RobustState improves fidelity by 50\% and 72\% for 4-Q and 5-Q states compared to baseline approaches.
    摘要 Quantum state preparation, a crucial subroutine in quantum computing, involves generating a target quantum state from initialized qubits. Arbitrary state preparation algorithms can be broadly categorized into arithmetic decomposition (AD) and variational quantum state preparation (VQSP). AD employs a predefined procedure to decompose the target state into a series of gates, whereas VQSP iteratively tunes ansatz parameters to approximate target state. VQSP is particularly apt for Noisy-Intermediate Scale Quantum (NISQ) machines due to its shorter circuits. However, achieving noise-robust parameter optimization still remains challenging. 我们提出了一个新的VQSP训练方法,即RobustState,它结合了高可靠性和高训练效率。这个核心思想是使用真实机器的测量结果通过классиerne simulator进行反推,将真实量子噪声 incorporated into gradient calculations。RobustState是一个通用、插件化的方法,可以用于从头开始训练或是对现有的parameters进行增强,以提高目标机器的精度。它适用于不同的ansatzes at both gate and pulse levels,甚至可以帮助其他variational algorithms, such as variational unitary synthesis。 我们对4种不同的量子算法的state preparation任务使用10部真实机器进行了 comprehensive evaluation。结果显示,RobustState可以 reducing coherent error by up to 7.1 times and improving state fidelity by up to 96% and 81% for 4-Q and 5-Q states, respectively. On average, RobustState improves fidelity by 50% and 72% for 4-Q and 5-Q states compared to baseline approaches.

Machine Learning-Enhanced Aircraft Landing Scheduling under Uncertainties

  • paper_url: http://arxiv.org/abs/2311.16030
  • repo_url: None
  • paper_authors: Yutian Pang, Peng Zhao, Jueming Hu, Yongming Liu
  • for: 这篇论文旨在减少飞机延误和财务损失,通过创新的机器学习(ML)强化进场时间安排方法来提高自动化和安全性。
  • methods: 论文提出了一个多元条件的机器学习预测器,可以根据航班事件来预测分类过渡时间。这个预测器是基于时间组合的混合整数线性程序(MILP)的一部分,并且考虑了历史航班纪录和模型预测的不确定性。
  • results: 论文使用了实际的飞机数据,证明了这种方法可以将总进场时间降低约17.2%,比FCFS规则更好。这种方法考虑了不确定性,因此在进场安排中具有更高的自信。
    Abstract This paper addresses aircraft delays, emphasizing their impact on safety and financial losses. To mitigate these issues, an innovative machine learning (ML)-enhanced landing scheduling methodology is proposed, aiming to improve automation and safety. Analyzing flight arrival delay scenarios reveals strong multimodal distributions and clusters in arrival flight time durations. A multi-stage conditional ML predictor enhances separation time prediction based on flight events. ML predictions are then integrated as safety constraints in a time-constrained traveling salesman problem formulation, solved using mixed-integer linear programming (MILP). Historical flight recordings and model predictions address uncertainties between successive flights, ensuring reliability. The proposed method is validated using real-world data from the Atlanta Air Route Traffic Control Center (ARTCC ZTL). Case studies demonstrate an average 17.2% reduction in total landing time compared to the First-Come-First-Served (FCFS) rule. Unlike FCFS, the proposed methodology considers uncertainties, instilling confidence in scheduling. The study concludes with remarks and outlines future research directions.
    摘要 The authors analyze flight arrival delay scenarios and find that there are strong multimodal distributions and clusters in arrival flight time durations. To address this, they develop a multi-stage conditional ML predictor that enhances separation time prediction based on flight events.The authors then integrate the ML predictions as safety constraints in a time-constrained traveling salesman problem formulation, which is solved using mixed-integer linear programming (MILP). They use historical flight recordings and model predictions to address uncertainties between successive flights, ensuring reliability.The proposed method is validated using real-world data from the Atlanta Air Route Traffic Control Center (ARTCC ZTL). Case studies show an average 17.2% reduction in total landing time compared to the First-Come-First-Served (FCFS) rule. Unlike FCFS, the proposed methodology considers uncertainties, instilling confidence in scheduling.The study concludes with remarks and outlines future research directions. The authors propose a new machine learning-based method for landing scheduling that takes into account uncertainties and improves safety and efficiency. The method is validated using real-world data and shows promising results.

An HCAI Methodological Framework: Putting It Into Action to Enable Human-Centered AI

  • paper_url: http://arxiv.org/abs/2311.16027
  • repo_url: None
  • paper_authors: Wei Xu, Zaifeng Gao, Marvin Dainoff
    for:This paper aims to provide a comprehensive and interdisciplinary methodological framework for Human-centered AI (HCAI) to help guide its implementation and overcome the current challenges in the field.methods:The proposed framework integrates seven components: design goals, design principles, implementation approaches, design paradigms, interdisciplinary teams, methods, and processes. The framework is designed to be systematic and executable, and can be applied to develop, transfer, and implement HCAI-based intelligent systems.results:The proposed framework is expected to help overcome the weaknesses in current frameworks and the challenges currently faced in implementing HCAI, and enable the design, development, and deployment of HCAI-based intelligent systems that can maximize the benefits of AI technology to humans while minimizing its potential adverse effects.
    Abstract Human-centered AI (HCAI), as a design philosophy, advocates prioritizing humans in designing, developing, and deploying intelligent systems, aiming to maximize the benefits of AI technology to humans and avoid its potential adverse effects. While HCAI has gained momentum, the lack of guidance on methodology in its implementation makes its adoption challenging. After assessing the needs for a methodological framework for HCAI, this paper first proposes a comprehensive and interdisciplinary HCAI methodological framework integrated with seven components, including design goals, design principles, implementation approaches, design paradigms, interdisciplinary teams, methods, and processes. THe implications of the framework are also discussed. This paper also presents a "three-layer" approach to facilitate the implementation of the framework. We believe the proposed framework is systematic and executable, which can overcome the weaknesses in current frameworks and the challenges currently faced in implementing HCAI. Thus, the framework can help put it into action to develop, transfer, and implement HCAI in practice, eventually enabling the design, development, and deployment of HCAI-based intelligent systems.
    摘要 人类中心的人工智能(HCAI),作为设计哲学,强调在设计、开发和部署智能系统方面尽可能增加人类的利益,避免人工智能技术的可能的负面影响。虽然HCAI已经赢得了许多支持,但由于实施方法的缺乏指导,使得其实施困难。本文首先提出了一个全面、跨学科的HCAI方法框架,包括设计目标、设计原则、实施方法、设计思维、跨学科团队、方法和过程等七个组成部分。此外,本文还提出了这七个组成部分之间的互动和相互关系,以及实施框架的影响。我们认为这种框架是系统和可执行的,可以超越现有框架的缺陷和实施HCAI所面临的挑战。因此,这种框架可以帮助实施HCAI,最终实现人工智能系统的设计、传递和实施。

Generative AI and US Intellectual Property Law

  • paper_url: http://arxiv.org/abs/2311.16023
  • repo_url: None
  • paper_authors: Cherie M Poland
  • for: 评估生成AI的法律和伦理问题,包括艺术家权益、内容生产、数据收集、隐私、信息准确性和知识产权。
  • methods: 使用行政和司法案例来检验生成AI软件系统是否具有独立知识产权。
  • results: 法律和伦理问题的解决方案尚未得到清晰定义,法院的判决也各有不同,不知道将来是否可以保护人类创作者的知识产权。
    Abstract The rapidity with which generative AI has been adopted and advanced has raised legal and ethical questions related to the impact on artists rights, content production, data collection, privacy, accuracy of information, and intellectual property rights. Recent administrative and case law challenges have shown that generative AI software systems do not have independent intellectual property rights in the content that they generate. It remains to be seen whether human content creators can retain their intellectual property rights against generative AI software, its developers, operators, and owners for the misappropriation of the work of human creatives, given the metes and bounds of existing law. Early signs from various courts are mixed as to whether and to what degree the results generated by AI models meet the legal standards of infringement under existing law.
    摘要 通过快速的推广和发展,生成式AI已经引起了法律和伦理问题,包括艺术家权益、内容生产、数据收集、隐私、信息准确性和知识产权。最近的行政和法律挑战表明,生成式AI软件系统没有独立的知识产权在生成的内容上。未来还需要看能否让人类创作者保留对生成AI软件、开发者、运营者和所有者的知识产权,尤其是在AI模型生成的内容中侵犯人类创作者的作品。法律的批准程度还未得到证实,初步的法律裁决表明,AI模型生成的内容是否符合现有法律的侵权标准还需要进一步的探讨。

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

  • paper_url: http://arxiv.org/abs/2311.16502
  • repo_url: https://github.com/MMMU-Benchmark/MMMU
  • paper_authors: Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
  • for: 评估多模态模型在大规模多学科任务中的表现,需要大学 уров域专业知识和计划性思维。
  • methods: 使用11.5万个精心收集的多模态问题,覆盖六大核心学科:艺术与设计、商业、科学、医学与医疗、人文与社会科学、技术与工程。这些问题包括30个不同的主题和183个下属领域,共有30种多样化的图像类型,如图表、 диаграм、地图、表格、音乐Sheet、化学结构。
  • results: 对14种开源LMM和专有GPT-4V进行评估,发现MMMU对多模态模型带来了极大的挑战,只有GPT-4V达到56%的准确率,表明存在大量的改进空间。 believe MMMU将鼓励社区建立下一代多模态基础模型,向专家人工智能寻求进步。
    Abstract We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.
    摘要 我们介绍MMMU:一个新的评价标准,用于评测多Modal模型在大量多学科任务上的敏捷推理和专业知识。MMMU包括11.5万个精心收集的多Modal问题,来自大学考试、考试、和书籍,涵盖六个核心学科:艺术与设计、商业、科学、医学与医疗、人文社科和技术工程。这些问题覆盖30个主题和183个子领域,包括30种非常多样化的图像类型,如图表、 диаграм、地图、表格、乐谱和化学结构。与现有的标准不同,MMMU关注高级感知和专业知识的培养,挑战模型在专家面前完成类似任务。我们对14个开源LMM和商业GPT-4V(视觉)进行评估,显示MMMU对模型带来了很大挑战。即使高级GPT-4V只达56%的准确率,这还表明有很大的提升空间。我们认为MMMU将驱动社区建立下一代多Modal基础模型,进一步推动专业人工通用智能。

RIDE: Real-time Intrusion Detection via Explainable Machine Learning Implemented in a Memristor Hardware Architecture

  • paper_url: http://arxiv.org/abs/2311.16018
  • repo_url: None
  • paper_authors: Jingdi Chen, Lei Zhang, Joseph Riem, Gina Adam, Nathaniel D. Bastian, Tian Lan
  • for: 这篇论文旨在提出一种基于深度学习的网络入侵检测解决方案,用于实时检测高速通信网络中的恶意流量行为模式。
  • methods: 该解决方案使用自适应隐藏状态 Récurren Autoencoder 将流量序列编码成更紧凑的共同特征表示,然后将其传递给 DNN 基于的分类器进行检测。此外,我们还开发了一种 Software-Hardware Co-Design 方法,将检测策略转换成决策树,并使用 emerging 基于折射器设备的架构来实现。
  • results: 我们的方法可以减少计算时间和资源消耗,同时保持高的检测精度,并且可以实现实时检测。测试结果表明,使用我们的方法可以在实际数据集(如 UNSW 和 CIC-IDS 数据集)上达到 nearly three-nines 的检测精度,并且具有大量的速度提升(约 four 个数量级)。
    Abstract Deep Learning (DL) based methods have shown great promise in network intrusion detection by identifying malicious network traffic behavior patterns with high accuracy, but their applications to real-time, packet-level detections in high-speed communication networks are challenging due to the high computation time and resource requirements of Deep Neural Networks (DNNs), as well as lack of explainability. To this end, we propose a packet-level network intrusion detection solution that makes novel use of Recurrent Autoencoders to integrate an arbitrary-length sequence of packets into a more compact joint feature embedding, which is fed into a DNN-based classifier. To enable explainability and support real-time detections at micro-second speed, we further develop a Software-Hardware Co-Design approach to efficiently realize the proposed solution by converting the learned detection policies into decision trees and implementing them using an emerging architecture based on memristor devices. By jointly optimizing associated software and hardware constraints, we show that our approach leads to an extremely efficient, real-time solution with high detection accuracy at the packet level. Evaluation results on real-world datasets (e.g., UNSW and CIC-IDS datasets) demonstrate nearly three-nines detection accuracy with a substantial speedup of nearly four orders of magnitude.
    摘要

Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models

  • paper_url: http://arxiv.org/abs/2311.16017
  • repo_url: None
  • paper_authors: Stephen MacNeil, Paul Denny, Andrew Tran, Juho Leinonen, Seth Bernstein, Arto Hellas, Sami Sarsa, Joanne Kim
  • for: 本研究旨在探讨大语言模型(LLMs)是否可以自动检测逻辑错误,并为新手程序员提供易于理解的解释。
  • methods: 研究使用两种流行的LLMs(GPT-3和GPT-4)来检测逻辑错误,并与一个大型新手程序员群体(n=964)进行比较。
  • results: 研究发现,现在的LLMs与之前的一代LLMs有所提升,并且两者都超过了学生的表现。研究还提出了将这些模型 integrate into computing education tools 以支持学生学习编程。
    Abstract Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior -- in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard to spot when reading the code, and they can also at times be missed by automated tests. There is great educational potential in automatically detecting logic errors, especially when paired with suitable feedback for novices. Large language models (LLMs) have recently demonstrated surprising performance for a range of computing tasks, including generating and explaining code. These capabilities are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs. On the other hand, logic errors relate to the runtime performance of code and thus may not be as well suited to analysis by LLMs. To explore this, we investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors. We compare LLM performance with a large cohort of introductory computing students $(n=964)$ solving the same error detection task. Through a mixed-methods analysis of student and model responses, we observe significant improvement in logic error identification between the previous and current generation of LLMs, and find that both LLM generations significantly outperform students. We outline how such models could be integrated into computing education tools, and discuss their potential for supporting students when learning programming.
    摘要 找到和解决逻辑错误可能是新手程序员最棘手的挑战之一。与 sintaxis错误不同,逻辑错误可能更加潜在和谜团。在某些情况下,错误代码甚至可能会显示正确的行为 -- 在其他情况下,问题可能是解释问题Statement的方式。这些错误可能在阅读代码时很难发现,而且也可能会在自动化测试中被missed。有很大的教育潜力在自动检测逻辑错误,特别是在与新手程序员配合使用。大型自然语言模型(LLMs)最近在多种计算任务中表现出色,包括生成和解释代码。这些能力与代码语法密切相关,与代码中下一个token预测行为相似。然而,逻辑错误与代码的运行性相关,因此可能不太适合LLMs的分析。为了探索这一点,我们研究了两个流行的LLMs,GPT-3和GPT-4,在检测和提供新手友好的逻辑错误解释方面的性能。我们将这些模型的表现与一群新手计算学生(n=964)解决同样的错误检测任务进行比较。通过混合方法分析学生和模型的回答,我们发现了当前和前一代LLM的性能有 significi cant提高,并发现这两个LLM生成是学生的表现的显著超越。我们介绍了如何将这些模型 integrate into computing education tools,并讨论了它们在学习编程时的潜在价值。

Forecasting Auxiliary Energy Consumption for Electric Heavy-Duty Vehicles

  • paper_url: http://arxiv.org/abs/2311.16003
  • repo_url: None
  • paper_authors: Yuantao Fan, Zhenkan Wang, Sepideh Pashami, Slawomir Nowaczyk, Henrik Ydreskog
  • for: 预测电动商用重载车辆的能源消耗是关键的,以便优化运营和路径规划充电。此外,理解预测何种结果是由哪些因素导致的,对于这种预测模型来说非常重要,以获得用户信任并在实践中部署。由于商用车辆在运输任务、 ambient 和司机方面都不同,因此需要建立一个基于 AI 系统来预测能源消耗。
  • methods: 我们使用多个回归模型在数据集中进行训练,以解决现有的 XAI 方法(如 LIME 或 SHAP)在面临多元化人口时产生误导性结果的问题。这种方法不仅导致了更好的回归性能,还产生了更直观和一致的解释。
  • results: 我们通过实验表明,将复杂的问题分解成 simpler ones 可以得到更好的回归性能和解释性能。在 synthetic 和实际 dataset 上进行了实验,并显示了这种方法的有效性。
    Abstract Accurate energy consumption prediction is crucial for optimizing the operation of electric commercial heavy-duty vehicles, e.g., route planning for charging. Moreover, understanding why certain predictions are cast is paramount for such a predictive model to gain user trust and be deployed in practice. Since commercial vehicles operate differently as transportation tasks, ambient, and drivers vary, a heterogeneous population is expected when building an AI system for forecasting energy consumption. The dependencies between the input features and the target values are expected to also differ across sub-populations. One well-known example of such a statistical phenomenon is the Simpson paradox. In this paper, we illustrate that such a setting poses a challenge for existing XAI methods that produce global feature statistics, e.g. LIME or SHAP, causing them to yield misleading results. We demonstrate a potential solution by training multiple regression models on subsets of data. It not only leads to superior regression performance but also more relevant and consistent LIME explanations. Given that the employed groupings correspond to relevant sub-populations, the associations between the input features and the target values are consistent within each cluster but different across clusters. Experiments on both synthetic and real-world datasets show that such splitting of a complex problem into simpler ones yields better regression performance and interpretability.
    摘要 准确预测能源消耗是商用重型汽车运营优化的关键,例如路径规划充电。此外,理解预测的原因对于这种预测模型来说非常重要,以至于用户信任和实施。由于商用车辆在交通任务、 ambient 和驾驶员方面都不同,因此在建立人工智能系统 для预测能源消耗时需要面临到非常多样化的人口。预测值和输入特征之间的依赖关系也预期会在不同的子人口中出现差异。这种统计现象被称为Simpson paradox。本文显示,这种设置会导致现有的XAI方法(例如LIME或SHAP)生成不准确的结果。我们示例了一种解决方案,即在数据 subsets 上训练多个回归模型。这不仅导致了更好的回归性能,而且也导致了更加相关和一致的LIME解释。由于emploied grouping对应了有用的子人口,输入特征和目标值之间的关系在每个小组内一致,但在不同的小组之间差异。在 sintetic 和实际数据上进行了实验,我们发现这种分解复杂问题为更简单的问题可以提供更好的回归性能和可读性。

InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery

  • paper_url: http://arxiv.org/abs/2311.16208
  • repo_url: https://github.com/IDEA-XL/InstructMol
  • paper_authors: He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, Yu Li
  • for: 这篇论文旨在探讨人工智能在药物探索中的演化,并解决对于普遍化和广泛训练的挑战。
  • methods: 本研究使用了多Modal Language Models(LLMs),并将其调整为 instruction-tuning 方法,使其能够对分子结构和自然语言进行有效的Alignment。
  • results: 研究发现,InstructMol 在药物探索相关的分子任务中表现出色,超越了主流 LLMS 和特殊化模型,并大幅缩小了对特殊化模型的差距,从而建立了一个可靠和多功能的药物探索助手。
    Abstract The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.
    摘要 人工智能在药物发现领域的快速进化面临总结和培训挑战,然而大型自然语言模型(LLM)具有重新定义与复杂分子数据的互动的承诺。我们的新贡献——InstructMol,是一种多模式LLM,通过指令调整方法,将分子结构与自然语言相互对应,使用两个阶段培训策略,结合有限的领域特定数据、分子和文本信息,实现了药物发现相关的分子任务的显著性能提高,超越了领导的LLM和特殊模型,从而建立了一个可靠的药物发现助手。

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

  • paper_url: http://arxiv.org/abs/2311.15993
  • repo_url: None
  • paper_authors: Shaobo Wang, Xiangdong Zhang, Junchi Yan
  • for: 提高深度神经网络训练稳定性
  • methods: 使用简单的特征压缩阈值解决特征压缩问题,并将不同的normalization变体统一以提高每个组件的表现
  • results: 在不同的视觉脊梁上显著提高性能,特别是在训练初期快速减少网络训练时间,在ImageNet分类任务中提高了约3%的前1位精度
    Abstract Batch Normalization (BN) has become an essential technique in contemporary neural network design, enhancing training stability. Specifically, BN employs centering and scaling operations to standardize features along the batch dimension and uses an affine transformation to recover features. Although standard BN has shown its capability to improve deep neural network training and convergence, it still exhibits inherent limitations in certain cases. Most existing techniques that enhance BN consider a single or a few aspects of BN. In this paper, we first identify problems with BN from a feature perspective and explore that feature condensation exists in the learning when employing BN, which negatively affects testing performance. To tackle this problem, we propose a two-stage unified framework called Unified Batch Normalization (UBN). In the first stage, we utilize a simple feature condensation threshold to alleviate the feature condensation, which hinders inappropriate statistic updates in normalization. In the second stage, we unify various normalization variants to boost each component of BN. Our experimental results reveal that UBN significantly enhances performance across different visual backbones and notably expedites network training convergence, particularly in early training stages. Notably, our method improved about 3% in top-1 accuracy on ImageNet classification with large batch sizes, showing the effectiveness of our approach in real-world scenarios.
    摘要 批量normalization(BN)已成为当代神经网络设计中的关键技术,提高训练稳定性。具体来说,BN通过中心化和缩放操作标准化特征以批量维度,并使用非线性变换回归特征。虽然标准BN已经证明可以改善深度神经网络训练和结束,但它仍然在某些情况下表现出缺陷。大多数现有的BN增强技术只考虑BN的一些方面。在这篇论文中,我们首先从特征角度描述BN的问题,发现在使用BN时存在特征减少现象,这会负面影响测试性能。为解决这个问题,我们提议一种两stage的统一框架,称为统一批量normalization(UBN)。在第一stage,我们使用简单的特征减少阈值来缓解特征减少,避免不合理的统计更新。在第二stage,我们统一各种normalization变种,以提高每个BN ком成分的性能。我们的实验结果表明,UBN可以明显提高不同的视觉脊梁上的性能,特别是在早期训练阶段。吸引注意的是,我们的方法在ImageNet图像分类任务中提高了约3%的顶部1准确率, demonstrating the effectiveness of our approach in real-world scenarios.

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.16512
  • repo_url: https://github.com/CoSeR-main/CoSeR-main.github.io
  • paper_authors: Haoze Sun, Wenbo Li, Jianzhuang Liu, Haoyu Chen, Renjing Pei, Xueyi Zou, Youliang Yan, Yujiu Yang
  • for: 提高超解像(SR)模型的 semantic 精度,使其能够更好地理解低分辨率图像中的全局含义。
  • methods: 提出了 Cognitive Super-Resolution (CoSeR) 框架,将图像的显示特征和语言理解结合,生成一个认知嵌入,以便利用大型文本到图像扩散模型的优势,并生成高质量参照图像来优化 SR 过程。另外,提出了一种全注意力充注束的“All-in-Attention”方法,将所有的条件信息集中到一个模块中。
  • results: 通过实验表明,CoSeR 方法可以很好地恢复semantic 正确和 фото真实的细节,在多个标准团顿上达到了状态的发展性表现。
    Abstract Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks.
    摘要 现有的超高清(SR)模型主要集中于Restoring本地 текстура细节,经常忽视场景中的全局 semantics信息。这种欠佳可能导致场景中的重要 semantics细节排除或者SR过程中引入错误的 texture。在我们的工作中,我们提出了认知超高清(CoSeR)框架,赋予 SR 模型理解低分辨率图像的能力。我们实现这一目标通过将图像外观和语言理解联系起来生成一个认知嵌入,不仅可以通过大规模的文本到图像扩散模型获得优化的参考图像,还可以利用这个嵌入来激活大量的文本-图像扩散模型。此外,我们还提出了一种新的条件注入方案called "All-in-Attention",将所有的条件信息集成到一个模块中。因此,我们的方法可以成功地恢复semantically正确和 фото实际的细节,在多个标准准测中达到了状态机器人表现。

Sparsify-then-Classify: From Internal Neurons of Large Language Models To Efficient Text Classifiers

  • paper_url: http://arxiv.org/abs/2311.15983
  • repo_url: https://github.com/difanj0713/sparsify-then-classify
  • paper_authors: Yilun Liu, Difan Jiao, Ashton Anderson
  • for: 提高预训练语言模型(LLM)在文本分类任务中的性能和可解释性。
  • methods: 使用多种pooling策略对所有层的各种活动和隐藏状态进行归一化,然后对文本进行分类。
  • results: 对多种模型和数据集进行实验,显示STC可以提高预训练和微调模型的分类性能,同时具有更高的效率和可解释性。
    Abstract Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. However, existing approaches for applying pretrained LLMs to text classification predominantly rely on using single token outputs from only the last layer of hidden states. As a result, they suffer from limitations in efficiency, task-specificity, and interpretability. In our work, we contribute an approach that uses all internal representations by employing multiple pooling strategies on all activation and hidden states. Our novel lightweight strategy, Sparsify-then-Classify (STC) first sparsifies task-specific features layer-by-layer, then aggregates across layers for text classification. STC can be applied as a seamless plug-and-play module on top of existing LLMs. Our experiments on a comprehensive set of models and datasets demonstrate that STC not only consistently improves the classification performance of pretrained and fine-tuned models, but is also more efficient for both training and inference, and is more intrinsically interpretable.
    摘要 LLMs 已经革命化了许多任务,其中包括文本分类。然而,现有的应用预训练 LLMs 到文本分类的方法主要仅仅使用最后一层隐藏状态单个元素的输出。这会导致它们受到效率、任务特点和可解释性的限制。在我们的工作中,我们提出了一种使用所有内部表示的方法,使用各层Activation和隐藏状态中的多种池化策略。我们的新的轻量级策略——Sparsify-then-Classify(STC)先将任务特定的特征层次地减少,然后在层次上进行文本分类。STC可以轻松地应用于现有的 LLMs 之上,作为一个可替换的模块。我们在一系列模型和数据集上进行了实验,表明 STC 不仅可以改进预训练和精度调整的模型的分类性能,还可以更高效地进行训练和推理,同时也更容易解释。

  • paper_url: http://arxiv.org/abs/2311.15979
  • repo_url: None
  • paper_authors: Weiying Zhao, Natalia Efremova
  • for: 这个研究旨在精确估计土壤有机碳(SOC),以提供可持续的土地和农业管理。
  • methods: 本研究使用Graph Neural Networks (GNNs)和高分辨率卫星地图,并与位置编码器结合,以捕捉土壤和气候特征之间的复杂关系。
  • results: 研究结果显示,使用LUCAS数据库,PESAGE和PETransformer模型能够更好地估计SOC,表明这些模型能够 Capture复杂的SOC和气候特征之间关系。
    Abstract Soil organic carbon (SOC) plays a pivotal role in the global carbon cycle, impacting climate dynamics and necessitating accurate estimation for sustainable land and agricultural management. While traditional methods of SOC estimation face resolution and accuracy challenges, recent technological solutions harness remote sensing, machine learning, and high-resolution satellite mapping. Graph Neural Networks (GNNs), especially when integrated with positional encoders, can capture complex relationships between soil and climate. Using the LUCAS database, this study compared four GNN operators in the positional encoder framework. Results revealed that the PESAGE and PETransformer models outperformed others in SOC estimation, indicating their potential in capturing the complex relationship between SOC and climate features. Our findings confirm the feasibility of applications of GNN architectures in SOC prediction, establishing a framework for future explorations of this topic with more advanced GNN models.
    摘要 soil organic carbon (SOC) 对全球碳ecycle 发挥关键作用,影响气候动力学和可持续地面和农业管理。 traditional SOC 估算方法面临分解和准确性挑战,而现代技术解决方案利用 remote sensing、机器学习和高分辨率卫星地图。 Graph Neural Networks (GNNs),特别是与位置编码器结合,可捕捉 soil 和气候特征之间的复杂关系。 根据 LUCAS 数据库,本研究比较了四种 GNN 操作器在位置编码器框架中。 结果表明 PESAGE 和 PETransformer 模型在 SOC 估算中表现最佳,这表明它们可以 capture soil 和气候特征之间的复杂关系。 我们的发现证明了 GNN 架构在 SOC 预测中的可行性,建立了未来这个领域的探索框架,以及更高级的 GNN 模型。

Efficient Pre-training for Localized Instruction Generation of Videos

  • paper_url: http://arxiv.org/abs/2311.15964
  • repo_url: None
  • paper_authors: Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
  • for: 这个论文是为了提高步骤地示例视频理解和生成文本说明的技术。
  • methods: 该论文提出了一种自动筛选和提升视频讲解文本质量的技术,包括筛选掉无关的讲解文本和自动将讲解文本替换为人工写的文本说明。
  • results: 该技术可以生成高性能的步骤地示例视频理解和生成文本说明模型,使用论文中的数据集可以在零shot和微调Setting中达到当前最佳性能,而且使用的计算资源相对较少。
    Abstract Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current web-scale datasets, enables efficient training of large-scale models with competitive performance. We complement our Sieve-\&-Swap approach with a Procedure Transformer (ProcX) for end-to-end step localization and instruction generation for procedural videos. When this model is pre-trained on our curated dataset, it achieves state-of-the-art performance in zero-shot and finetuning settings on YouCook2 and Tasty, while using a fraction of the computational resources.
    摘要 “执行视频显示了逐步示例的任务,如食谱准备。理解这些视频是具有挑战性,需要精确地确定步骤的地方和生成文本指示。手动标注步骤和写入指示是成本高昂的,这限制了当前数据集的大小,阻碍有效学习。利用大量但具有噪音的视频脚本数据集进行预训练可以提高性能,但需要 significativ computational resources。此外,脚本中含有不相关的内容,并且样式与人工标注员写的指示不同。为了解决这两个问题,我们提出了一种方法:(i)筛选不相关的脚本,(ii)将脚本替换为人工标注员写的文本指示。这些替换的指示来自文本只recipe数据集。我们称之为Sieve-&-Swap。我们的Sieve-&-Swap方法可以自动筛选出许多不相关的脚本,并将其替换为高质量的文本指示,从而生成一个许多小于当前网络规模的数据集。我们的curated dataset可以高效地训练大规模模型,并达到竞争性性能。我们补充了我们的Sieve-&-Swap方法,使用Procedure Transformer(ProcX)来实现步骤地理位和指示生成。当我们的模型在我们的curated dataset上进行预训练时,它在零shot和 fine-tuning 设置下在 YouCook2 和 Tasty 上达到了状态艺术性能,同时使用的计算资源只是当前模型的一部分。”

Addressing Long-Horizon Tasks by Integrating Program Synthesis and State Machines

  • paper_url: http://arxiv.org/abs/2311.15960
  • repo_url: None
  • paper_authors: Yu-An Lin, Chen-Tao Lee, Guan-Ting Liu, Pu-Jen Cheng, Shao-Hua Sun
  • for: 解决深度强化学习在不同领域中的表现不佳,特别是缺乏普适性和可 integrate 性。
  • methods: 本研究提出了Program Machine Policies(POMPs),它将Programmatic RL和状态机策略相结合,以便表现复杂的行为和长期任务。POMPs使用一种可以检索有效、多样、兼容的程序的方法,然后将这些程序作为状态机的模式,通过学习转移函数来转移 между模式程序,以捕捉长期重复任务。
  • results: 对多个任务进行测试,我们的提出的框架超越了Programmatic RL和深度强化学习的基elines,并能够无需调整 inductively 推广到更长的时间轴。缺省研究证明了我们的搜索算法对检索有效程序的集合的效iveness。
    Abstract Deep reinforcement learning excels in various domains but lacks generalizability and interoperability. Programmatic RL methods (Trivedi et al., 2021; Liu et al., 2023) reformulate solving RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors. This work proposes Program Machine Policies (POMPs), which bridge the advantages of programmatic RL and state machine policies, allowing for the representation of complex behaviors and the address of long-term tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing long-horizon repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to generalize to even longer horizons without any fine-tuning inductively. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes.
    摘要 深度强化学习在多个领域表现出色,但缺乏通用性和可操作性。程序化RL方法(Trivedi等,2021年;Liu等,2023年)将解决RL任务 reformulate为生成可解释的程序,可以在环境中执行。尽管有激发人的结果,这些方法受到短期任务的限制。然而,通过表示RL策略为状态机(Inala等,2020年)可以逐渐普适到长期任务,但是它在掌握复杂和多样化行为方面受到限制。本工作提出了Program机制策略(POMPs),它将程序化RL和状态机策略的优点相结合,允许表示复杂行为并解决长期任务。具体来说,我们提出一种方法可以检索一组有效、多样、兼容的程序。然后,我们将这些程序作为状态机的模式使用,学习一个转移函数来转移模式程序,以便捕捉长期重复行为。我们的提议框架在多个任务上超越程序化RL和深度RL基elines,并证明可以无需适应学习 inductively 扩展到更长的时间 horizons。ablation 研究证明我们提出的搜索算法对检索模式程序的有效性。

CheapNET: Improving Light-weight speech enhancement network by projected loss function

  • paper_url: http://arxiv.org/abs/2311.15959
  • repo_url: None
  • paper_authors: Kaijun Tan, Benzhe Dai, Jiakui Li, Wenyu Mao
  • for: 提高 speech 质量和减少噪声
  • methods: 使用 projection loss function 和 direct predictions on LAEC pre-processed outputs
  • results: near state-of-the-art 噪声抑制性能和超越 industry-leading 模型的 echo cancellation 性能
    Abstract Noise suppression and echo cancellation are critical in speech enhancement and essential for smart devices and real-time communication. Deployed in voice processing front-ends and edge devices, these algorithms must ensure efficient real-time inference with low computational demands. Traditional edge-based noise suppression often uses MSE-based amplitude spectrum mask training, but this approach has limitations. We introduce a novel projection loss function, diverging from MSE, to enhance noise suppression. This method uses projection techniques to isolate key audio components from noise, significantly improving model performance. For echo cancellation, the function enables direct predictions on LAEC pre-processed outputs, substantially enhancing performance. Our noise suppression model achieves near state-of-the-art results with only 3.1M parameters and 0.4GFlops/s computational load. Moreover, our echo cancellation model outperforms replicated industry-leading models, introducing a new perspective in speech enhancement.
    摘要 噪声抑制和声音反射抑制是智能设备和实时通信中关键的,需要实时推理,计算 overhead 低。传统的边缘基础上的噪声抑制通常使用MSE基准 спектроgram搅拌训练,但这种方法有局限性。我们引入一种新的投影损失函数,与MSE diverge,以提高噪声抑制。这种方法使用投影技术隔离关键的音频组件与噪声,显著提高模型性能。而对声音反射,该函数允许直接预测LAEC预处理输出,大幅提高性能。我们的噪声抑制模型达到了近状态艺术的结果,仅有3.1M参数和0.4GFlops/s计算负荷。此外,我们的声音反射模型超越了复制的行业领先模型,开拓了新的抑制 speech 技术的视角。

Replay across Experiments: A Natural Extension of Off-Policy RL

  • paper_url: http://arxiv.org/abs/2311.15951
  • repo_url: None
  • paper_authors: Dhruva Tirumala, Thomas Lampe, Jose Enrique Chen, Tuomas Haarnoja, Sandy Huang, Guy Lever, Ben Moran, Tim Hertweck, Leonard Hasenclever, Martin Riedmiller, Nicolas Heess, Markus Wulfmeier
  • for: 提高控制器性能和研究循环时间
  • methods: 重用先前实验的经验来提高探索和启动学习,并减少改变的需要
  • results: 在多种RL算法和控制领域中显示出提高性能,包括 egocentric 视觉中的困难探索任务
    Abstract Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.
    摘要 重复数据是抽象RL的主要机制,它们在稳定性和数据效率方面发挥重要作用。我们提出了一种简单 yet effective的框架,可以在多个实验中扩展重复数据的使用,以提高控制性和研究迭代时间。核心思想是在前一个实验中收集的经验可以重新使用来提高探索和启动学习,同时尽量避免对RL工作流程的更改。我们通过多种RL算法和控制领域的实验表明了这种方法的效果,包括自我中心视觉中的困难探索任务。通过完整的抑制试验,我们证明了这种方法对数据质量和量的选择以及不同的超参数选择有稳定性。最后,我们讨论了如何广泛应用这种方法,以增加研究生命周期中的可重用性和鲁棒性,以及在不同的random seed或超参数变化中重新加载数据。

Auto-CsiNet: Scenario-customized Automatic Neural Network Architecture Generation for Massive MIMO CSI Feedback

  • paper_url: http://arxiv.org/abs/2311.15950
  • repo_url: None
  • paper_authors: Xiangyi Li, Jiajia Guo, Chao-Kai Wen, Shi Jin
  • for: 这个论文旨在自动生成适应环境的通道状态信息反馈 neural network 架构,以便在无线通信中实现最佳性能。
  • methods: 该论文提出了使用神经网络搜索(NAS)自动生成适应环境的 CSI 反馈神经网络架构,并采用了机器学习自动化和梯度下降的方法来实现效率和成本效果。
  • results: 实验结果表明,自动生成的架构(称为 Auto-CsiNet)在重建性能和复杂度两个方面都有较好的表现,比手动设计的模型提高约14%,并且可以减少复杂度约50%。
    Abstract Deep learning has revolutionized the design of the channel state information (CSI) feedback module in wireless communications. However, designing the optimal neural network (NN) architecture for CSI feedback can be a laborious and time-consuming process. Manual design can be prohibitively expensive for customizing NNs to different scenarios. This paper proposes using neural architecture search (NAS) to automate the generation of scenario-customized CSI feedback NN architectures, thereby maximizing the potential of deep learning in exclusive environments. By employing automated machine learning and gradient-descent-based NAS, an efficient and cost-effective architecture design process is achieved. The proposed approach leverages implicit scene knowledge, integrating it into the scenario customization process in a data-driven manner, and fully exploits the potential of deep learning for each specific scenario. To address the issue of excessive search, early stopping and elastic selection mechanisms are employed, enhancing the efficiency of the proposed scheme. The experimental results demonstrate that the automatically generated architecture, known as Auto-CsiNet, outperforms manually-designed models in both reconstruction performance (achieving approximately a 14% improvement) and complexity (reducing it by approximately 50%). Furthermore, the paper analyzes the impact of the scenario on the NN architecture and its capacity.
    摘要 深度学习革命化了无线通信频率信息反馈模块的设计。然而,为每个enario设计最佳神经网络(NN)架构可能是一项劳累和时间consuming的任务。这篇论文提议使用神经网络搜索(NAS)自动生成适应不同enario的CSI反馈NN架构,从而最大化深度学习在专属环境中的潜力。通过自动化机器学习和梯度下降基于NAS,实现了效率和成本效果的架构设计过程。提议的方法利用含义场知识,将其integrated到场景定制过程中,以全面利用深度学习的潜力。为了解决搜索过度的问题,提议使用 early stopping和elastic选择机制,提高提议的方案的效率。实验结果表明,自动生成的架构,称为Auto-CsiNet,在重建性能(减少了约14%)和复杂性(减少了约50%)两个方面都超过了手动设计的模型。此外,论文还分析了场景对NN架构和其容量的影响。

A new fuzzy multi-attribute group decision-making method based on TOPSIS and optimization models

  • paper_url: http://arxiv.org/abs/2311.15933
  • repo_url: None
  • paper_authors: Qixiao Hu, Shiquan Zhang, Chaolang Hu, Yuetong Liu
  • for: 这种paper是为了解决多Attribute组决策问题,利用interval-valued intuitionistic fuzzy sets环境中的TOPSIS和优化模型。
  • methods: 该paper提出了一种基于TOPSIS和优化模型的新方法,包括:Establishing an optimization model to determine expert weights by minimizing the sum of differences between individual evaluations and the overall consistent evaluations of all experts; obtaining the improved closeness index for evaluating each alternative based on TOPSIS method; and determining the attribute weight by establishing an optimization model to maximize the closeness of each alternative.
  • results: 该paper的实验结果表明,该方法可以充分发挥主观和客观权重方法的优点,并且可以解决多Attribute组决策问题中的冲突和不确定性。
    Abstract In this paper, a new method based on TOPSIS and optimization models is proposed for multi-attribute group decision-making in the environment of interval-valued intuitionistic fuzzy sets.Firstly, by minimizing the sum of differences between individual evaluations and the overallconsistent evaluations of all experts, a new optimization model is established for determining expert weights. Secondly, based on TOPSIS method, the improved closeness index for evaluating each alternative is obtained. Finally, the attribute weight is determined by establishing an optimization model with the goal of maximizing the closeness of each alternative, and it is brought into the closeness index so that the alternatives can be ranked. Combining all these together, the complete fuzzy multi-attribute group decision-making algorithm is formulated, which can give full play to the advantages of subjective and objective weighting methods. In the end, the feasibility and effectiveness of the provided method are verified by a real case study.
    摘要 在这篇论文中,一种基于TOPSIS和优化模型的新方法是提出的,用于多Attribute组决策。首先,通过最小化所有专家评估和共识评估之间的差异总和,确定专家权重。其次,基于TOPSIS方法,改进了评估每个选项的靠近指数。最后,通过确定每个选项的属性权重,并将其作为靠近指数的一部分,完成了完整的多 Attribute 组决策算法。这种算法可以充分发挥主观和客观权重方法的优点。在实际案例研究中,提供的方法的可行性和效果得到了证明。Here's a word-for-word translation of the text into Simplified Chinese:在这篇论文中,一种基于TOPSIS和优化模型的新方法是提出的,用于多Attribute组决策。首先,通过最小化所有专家评估和共识评估之间的差异总和,确定专家权重。其次,基于TOPSIS方法,改进了评估每个选项的靠近指数。最后,通过确定每个选项的属性权重,并将其作为靠近指数的一部分,完成了完整的多 Attribute 组决策算法。这种算法可以充分发挥主观和客观权重方法的优点。在实际案例研究中,提供的方法的可行性和效果得到了证明。

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

  • paper_url: http://arxiv.org/abs/2311.15930
  • repo_url: https://github.com/facebookresearch/worldsense
  • paper_authors: Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent
  • for: 本文旨在评估大自然语言处理器(LLM)是否能够维护简单世界模型,通过测试它们是否能够从描述简单的物体安排中导出简单的推论。
  • methods: 本文使用了一个名为WorldSense的合成benchmark来测试LLM的能力,该benchmark包括三种问题类型,每种问题类型有自己的极少控制问题,以避免偏见。
  • results: 作者在三种state-of-the-art chat-LLM(GPT3.5、GPT4和Llama2-chat)上运行了benchmark,发现这些模型在只有三个物体时仍会出现错误,并且具有强烈的回答偏好,不管问题是什么。这些错误并不是由链条提问和上下文学习所解决。此外,训练在相似问题上并不能使模型泛化到非约束问题空间。
    Abstract We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.
    摘要 我们提出了WorldSense,一个测试标准用于评估语言模型是否可以顺利地维护含义词语模型,通过测试它们从描述简单的物体安排中缺少推理的能力。WorldSense是一个合成测试标准,包括三种问题类型,每个类型有自己的杜而不受偏见的控制问题,以避免语言和表达的相关性。我们在三个现状顶尖的chat-LLM(GPT3.5、GPT4和Llama2-chat)上运行了我们的benchmark,并显示这些模型在三个对象时仍然会出错。此外,它们具有很重的回答偏好,即不管问题,它们都会采取某些回答。错误持续存在,即使使用链式思维提示和上下文学习。最后,我们表明了虽然在类似问题上进行finetuning可以获得显著改进(在同 distribution和 OUT-OF-distribution),但是finetuned模型不能超越问题空间的约束。

Reinforcement Learning for Wildfire Mitigation in Simulated Disaster Environments

  • paper_url: http://arxiv.org/abs/2311.15925
  • repo_url: https://github.com/mitrefireline/simfire
  • paper_authors: Alexander Tapley, Marissa Dotter, Michael Doyle, Aidan Fennelly, Dhanuj Gandikota, Savanna Smith, Michael Threet, Tim Welsh
  • for: 这个论文旨在提供一种可靠的野火预测模拟器和自适应机器学习框架,以帮助研究人员和实践者更好地预测和应对野火的威胁。
  • methods: 这个论文使用的方法包括了一种名为SimFire的野火预测模拟器和一种名为SimHarness的自适应机器学习框架。这两个工具可以让研究人员和实践者更好地模拟和评估火fighter的干预措施,以及制定优化资源分配和值保护的策略计划。
  • results: 这个论文的结果表明,使用SimFire和SimHarness这两个工具可以帮助研究人员和实践者更好地预测和应对野火的威胁,并且可以提高火fighter的干预效果和资源分配的优化。
    Abstract Climate change has resulted in a year over year increase in adverse weather and weather conditions which contribute to increasingly severe fire seasons. Without effective mitigation, these fires pose a threat to life, property, ecology, cultural heritage, and critical infrastructure. To better prepare for and react to the increasing threat of wildfires, more accurate fire modelers and mitigation responses are necessary. In this paper, we introduce SimFire, a versatile wildland fire projection simulator designed to generate realistic wildfire scenarios, and SimHarness, a modular agent-based machine learning wrapper capable of automatically generating land management strategies within SimFire to reduce the overall damage to the area. Together, this publicly available system allows researchers and practitioners the ability to emulate and assess the effectiveness of firefighter interventions and formulate strategic plans that prioritize value preservation and resource allocation optimization. The repositories are available for download at https://github.com/mitrefireline.
    摘要 климаbing change hath resulteth in a year over year increase in adverse weather and weather conditions which contribute to increasingly severe fire seasons. Without effective mitigation, these fires pose a threat to life, property, ecology, cultural heritage, and critical infrastructure. To better prepare for and react to the increasing threat of wildfires, more accurate fire modelers and mitigation responses are necessary. In this paper, we introduce SimFire, a versatile wildland fire projection simulator designed to generate realistic wildfire scenarios, and SimHarness, a modular agent-based machine learning wrapper capable of automatically generating land management strategies within SimFire to reduce the overall damage to the area. Together, this publicly available system allows researchers and practitioners the ability to emulate and assess the effectiveness of firefighter interventions and formulate strategic plans that prioritize value preservation and resource allocation optimization. The repositories are available for download at https://github.com/mitrefireline.Note that Simplified Chinese is the standard form of Chinese used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

Diagnosis driven Anomaly Detection for CPS

  • paper_url: http://arxiv.org/abs/2311.15924
  • repo_url: None
  • paper_authors: Henrik S. Steude, Lukas Moddemann, Alexander Diedrich, Jonas Ehrhardt, Oliver Niggemann
  • for: 这篇论文主要用于提出了一种基于深度学习的异常探测方法,用于生成适当的诊断输入。
  • methods: 本论文使用了深度学习的异常探测方法,并与传统的一般诊断方法进行结合,以提供一个整体的诊断解决方案。
  • results: 在实验和实际应用中,本论文的模型具有优秀的表现,较以往的州流探测方法有所提高。
    Abstract In Cyber-Physical Systems (CPS) research, anomaly detection (detecting abnormal behavior) and diagnosis (identifying the underlying root cause) are often treated as distinct, isolated tasks. However, diagnosis algorithms require symptoms, i.e. temporally and spatially isolated anomalies, as input. Thus, anomaly detection and diagnosis must be developed together to provide a holistic solution for diagnosis in CPS. We therefore propose a method for utilizing deep learning-based anomaly detection to generate inputs for Consistency-Based Diagnosis (CBD). We evaluate our approach on a simulated and a real-world CPS dataset, where our model demonstrates strong performance relative to other state-of-the-art models.
    摘要 在 cyber-physical systems (CPS) 研究中,异常检测 (检测异常行为) 和诊断 (确定根本原因) 常常被视为分开的、隔离的任务。然而,诊断算法需要症状作为输入,即时间和空间上的异常现象。因此,异常检测和诊断必须同时开发,以提供 CPs 中的总体解决方案。我们因此提议使用深度学习基于异常检测来生成 Consistency-Based Diagnosis (CBD) 的输入。我们在模拟和实际 CPs 数据集上评估了我们的方法,其表现比其他现有模型更强。

A Fully Data-Driven Approach for Realistic Traffic Signal Control Using Offline Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.15920
  • repo_url: None
  • paper_authors: Jianxiong Li, Shichao Lin, Tianyu Shi, Chujie Tian, Yu Mei, Jian Song, Xianyuan Zhan, Ruimin Li
    for: 本研究的目的是提出一种数据驱动、模拟器自由的实时交通信号控制(D2TSC)框架,以便在实际交通系统中实现高效的交通流控制。methods: 本研究使用了已知的交通流理论和机器学习技术,构建了一个奖券推导模型,从粗略交通数据中推断奖券信号。然后,我们提出了一种减少样本数量的离线RL方法,以便直接从历史数据集中学习交通信号控制策略。results: 我们通过了广泛的实验表明,我们的方法在比较具有实际应用意义的实际交通口岸上表现出优于 convential和离线RL基准。此外,我们的方法还能够更好地适应实际交通系统中的复杂环境。
    Abstract The optimization of traffic signal control (TSC) is critical for an efficient transportation system. In recent years, reinforcement learning (RL) techniques have emerged as a popular approach for TSC and show promising results for highly adaptive control. However, existing RL-based methods suffer from notably poor real-world applicability and hardly have any successful deployments. The reasons for such failures are mostly due to the reliance on over-idealized traffic simulators for policy optimization, as well as using unrealistic fine-grained state observations and reward signals that are not directly obtainable from real-world sensors. In this paper, we propose a fully Data-Driven and simulator-free framework for realistic Traffic Signal Control (D2TSC). Specifically, we combine well-established traffic flow theory with machine learning to construct a reward inference model to infer the reward signals from coarse-grained traffic data. With the inferred rewards, we further propose a sample-efficient offline RL method to enable direct signal control policy learning from historical offline datasets of real-world intersections. To evaluate our approach, we collect historical traffic data from a real-world intersection, and develop a highly customized simulation environment that strictly follows real data characteristics. We demonstrate through extensive experiments that our approach achieves superior performance over conventional and offline RL baselines, and also enjoys much better real-world applicability.
    摘要 优化交通信号控制(TSC)是交通系统的关键因素,以提高效率。在最近几年,使用强化学习(RL)技术进行TSC已经成为一种流行的方法,并且在高度适应控制方面表现出了良好的结果。然而,现有RL基于方法在实际应用中表现糟糕,主要是因为使用过于理想化的交通模拟器进行策略优化,以及使用不可接触的细化状态观察和不直接从实际感知器得到的奖励信号。在这篇论文中,我们提出了一个完全数据驱动的、模拟器自由的框架,以便实际交通信号控制(D2TSC)。具体来说,我们结合了已有的交通流理论和机器学习技术,构建了奖励推理模型,以直接从粗化交通数据中推断奖励信号。使用推断出来的奖励信号,我们进一步提议了一种效率高的离线RL方法,以便从历史离线实际数据库中学习交通信号控制策略。为评估我们的方法,我们收集了实际交通数据,并开发了高度定制的实际环境,严格遵循实际数据特点。我们通过了详细的实验表明,我们的方法在比较基线和离线RL方法的情况下表现出了显著更好的性能,并且也具有更好的实际应用性。

Continual Instruction Tuning for Large Multimodal Models

  • paper_url: http://arxiv.org/abs/2311.16206
  • repo_url: None
  • paper_authors: Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang
  • for: 本研究旨在探讨大型多模态模型(LMM)在持续式 instrucion tuning 中是否会出现衰弱现象,以及现有的三种持续学习方法是否适用于 LMM 的持续 instruction tuning。
  • methods: 本研究使用了多任务联合 instrucion tuning 来探讨 LMM 的持续学习能力,并采用了数据重播和模型扩展策略来 mitigate 衰弱现象。
  • results: 实验结果显示, Multi-task joint instrucion tuning 可以提高 LMM 的持续学习能力,而数据重播和模型扩展策略可以在不同的情况下具有显著的改进效果。
    Abstract Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.
    摘要 现在广泛采用的指令调整方法可以将大型多模型(LMM)与人类意图进行对应。它统一了视觉语言任务的数据格式,允许多任务共同培训。然而,视觉语言任务在实践中不断创新。而不是总是重新训练LMM, continual learning提供了模型在不断发展数据中高效地利用数据的机会。本研究旨在解决以下两个问题:1)LMM在 continual instruction tuning 中是否仍然会出现衰弱现象?2)现有的三种类型的 continual learning 方法是否仍然适用于 LMM 的 continual instruction tuning?我们进行了广泛的研究,并建立了首个 benchmark 测试环境。我们发现,在不断 instruction-tuning LMM 时,衰弱现象仍然存在,但是多任务联合培训可以提高模型的持续学习能力并减轻衰弱。其次,我们将经典的 continual learning 方法与我们的场景结合,并证明了数据重播和模型扩展策略在多种场景中的效果。而 regularization-based 方法只能在多任务联合培训的情况下表现出色。最后,我们探究了视觉语言任务对的相互关系和忘记动力,并提出了基于任务相互关系的 regularization 和模型扩展策略,以提高 LMM 的持续学习能力。实验结果表明,我们的方法能够一直提高模型的性能。

Towards Adaptive RF Fingerprint-based Authentication of IIoT devices

  • paper_url: http://arxiv.org/abs/2311.15888
  • repo_url: None
  • paper_authors: Emmanuel Lomba, Ricardo Severino, Ana Fernández Vilas
  • for: 这个论文是为了解决现代化的互联网物联网(IoT)技术在医疗和工业领域中的安全和网络安全问题。
  • methods: 这篇论文使用了人工智能(AI)适应式电波指纹技术,在物理层(PHY层)上进行高精度的设备认证。
  • results: 该研究实现了一种高效和灵活的IIoT设备认证方法,可以在复杂的RF环境中提供高度准确的设备认证。
    Abstract As IoT technologies mature, they are increasingly finding their way into more sensitive domains, such as Medical and Industrial IoT, in which safety and cyber-security are of great importance. While the number of deployed IoT devices continues to increase exponentially, they still present severe cyber-security vulnerabilities. Effective authentication is paramount to support trustworthy IIoT communications, however, current solutions focus on upper-layer identity verification or key-based cryptography which are often inadequate to the heterogeneous IIoT environment. In this work, we present a first step towards achieving powerful and flexible IIoT device authentication, by leveraging AI adaptive Radio Frequency Fingerprinting technique selection and tuning, at the PHY layer for highly accurate device authentication over challenging RF environments.
    摘要 In this work, we propose a novel approach to achieving powerful and flexible IIoT device authentication. By leveraging AI adaptive Radio Frequency Fingerprinting technique selection and tuning at the PHY layer, we can achieve highly accurate device authentication even in challenging RF environments. This approach represents a significant step forward in addressing the cyber-security vulnerabilities of IoT devices and ensuring the trustworthiness of IIoT communications.

RO-LLaMA: Generalist LLM for Radiation Oncology via Noise Augmentation and Consistency Regularization

  • paper_url: http://arxiv.org/abs/2311.15876
  • repo_url: None
  • paper_authors: Kwanyoung Kim, Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Jin Sung Kim, Yong Bae Kim, Jong Chul Ye
  • for: 这个研究旨在开发一个通用的大语言模型(LLM),以应对放射科医生的工作流程。
  • methods: 这个模型使用了一种新的Consistency Embedding Fine-Tuning(CEFTune)技术,以提高模型对于过程中的错误的耐性,并将这个概念应用到了LLM驱动的分 segmentation框架中。
  • results: 实验结果显示,这个提案的RO-LLaMA模型在多中心资料集上表现出色,能够应对多种任务,并且具有扩展性。
    Abstract Recent advancements in Artificial Intelligence (AI) have profoundly influenced medical fields, by providing tools to reduce clinical workloads. However, most AI models are constrained to execute uni-modal tasks, in stark contrast to the comprehensive approaches utilized by medical professionals. To address this, here we present RO-LLaMA, a versatile generalist large language model (LLM) tailored for the field of radiation oncology. This model seamlessly covers a wide range of the workflow of radiation oncologists, adept at various tasks such as clinical report summarization, radiation therapy plan suggestion, and plan-guided therapy target volume segmentation. In particular, to maximize the end-to-end performance, we further present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LLM's robustness to additional errors at the intermediates while preserving the capability of handling clean inputs, and creatively transform this concept into LLM-driven segmentation framework as Consistency Embedding Segmentation (CESEG). Experimental results on multi-centre cohort sets demonstrate our proposed RO-LLaMA's promising performance for diverse tasks with generalization capabilities.
    摘要 To optimize end-to-end performance, we introduce a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which enhances the LLM's robustness to intermediate errors while maintaining its ability to handle clean inputs. We innovatively apply this concept to create a LLM-driven segmentation framework, called Consistency Embedding Segmentation (CESEG). Experimental results on multi-center cohort sets demonstrate the promising performance of our proposed RO-LLaMA for diverse tasks with generalization capabilities.

Utilizing Explainability Techniques for Reinforcement Learning Model Assurance

  • paper_url: http://arxiv.org/abs/2311.15838
  • repo_url: https://github.com/mitre/arlin
  • paper_authors: Alexander Tapley, Kyle Gatesman, Luis Robaina, Brett Bissey, Joseph Weissman
  • for: 提高深度学习强化学习(DRL)模型的透明性,增加用户信任和实际应用情况中的使用。
  • methods: 利用可解释强化学习(XRL)技术,在训练过程中检测和修复DRL模型中的潜在漏洞和关键点。
  • results: 通过ARLIN工具包(Assured RL Model Interrogation)的详细、人类可读的解释输出,可以快速检测和修复训练过程中的潜在漏洞和关键点,从而提高DRL模型的可靠性和安全性。
    Abstract Explainable Reinforcement Learning (XRL) can provide transparency into the decision-making process of a Deep Reinforcement Learning (DRL) model and increase user trust and adoption in real-world use cases. By utilizing XRL techniques, researchers can identify potential vulnerabilities within a trained DRL model prior to deployment, therefore limiting the potential for mission failure or mistakes by the system. This paper introduces the ARLIN (Assured RL Model Interrogation) Toolkit, an open-source Python library that identifies potential vulnerabilities and critical points within trained DRL models through detailed, human-interpretable explainability outputs. To illustrate ARLIN's effectiveness, we provide explainability visualizations and vulnerability analysis for a publicly available DRL model. The open-source code repository is available for download at https://github.com/mitre/arlin.
    摘要 可解释强化学习(XRL)可以提供强化学习模型决策过程中的透明性,从而增加用户信任和实际应用场景中的采用。通过利用XRL技术,研究人员可以在训练过的深度强化学习模型中发现潜在漏洞,以避免在部署前可能导致任务失败或模型错误。本文介绍了ARLIN(保证RL模型探索)工具箱,一个开源的Python库,通过详细的人类可读性解释输出来识别和找出训练过的DRL模型中的潜在漏洞和关键点。为证明ARLIN的有效性,我们提供了解释性视觉和漏洞分析 для一个公开available的DRL模型。可下载代码存储库的链接在https://github.com/mitre/arlin。

Scale-Dropout: Estimating Uncertainty in Deep Neural Networks Using Stochastic Scale

  • paper_url: http://arxiv.org/abs/2311.15816
  • repo_url: None
  • paper_authors: Soyed Tuhin Ahmed, Kamal Danouchi, Michael Hefenbrock, Guillaume Prenat, Lorena Anghel, Mehdi B. Tahoori
  • for: 提高神经网络(NN)的可靠性和信任度,尤其是在安全关键应用中。
  • methods: 使用抽象 Bayesian Neural Networks(BayNN)和Dropout作为一种系统性的方法来衡量不确定性,但它们具有高硬件开销。
  • results: 我们提出了一种新的规则化技术——Scale Dropout,并在MC-Scale Dropout基于BNN中实现了高效的不确定性估计。我们的方法只需一个杂态单元,无论模型的大小如何,这导致了一个非常可扩展的抽象 Bayesian NN。此外,我们还介绍了一种基于Spintronic memory的CIM架构,实现了和现有最佳化的能源投入相比,更高达100倍的能源节省。我们验证了我们的方法,并证明它们可以提高预测性能和不确定性估计,相比于相关的研究。
    Abstract Uncertainty estimation in Neural Networks (NNs) is vital in improving reliability and confidence in predictions, particularly in safety-critical applications. Bayesian Neural Networks (BayNNs) with Dropout as an approximation offer a systematic approach to quantifying uncertainty, but they inherently suffer from high hardware overhead in terms of power, memory, and computation. Thus, the applicability of BayNNs to edge devices with limited resources or to high-performance applications is challenging. Some of the inherent costs of BayNNs can be reduced by accelerating them in hardware on a Computation-In-Memory (CIM) architecture with spintronic memories and binarizing their parameters. However, numerous stochastic units are required to implement conventional dropout-based BayNN. In this paper, we propose the Scale Dropout, a novel regularization technique for Binary Neural Networks (BNNs), and Monte Carlo-Scale Dropout (MC-Scale Dropout)-based BayNNs for efficient uncertainty estimation. Our approach requires only one stochastic unit for the entire model, irrespective of the model size, leading to a highly scalable Bayesian NN. Furthermore, we introduce a novel Spintronic memory-based CIM architecture for the proposed BayNN that achieves more than $100\times$ energy savings compared to the state-of-the-art. We validated our method to show up to a $1\%$ improvement in predictive performance and superior uncertainty estimates compared to related works.
    摘要 不确定性估计在神经网络(NN)中是重要的,尤其在安全关键应用中。潘恩神经网络(BayNN)与Dropout作为一种近似方法可以系统地量化不确定性,但它们具有高硬件开销,包括电力、内存和计算资源。因此,将BayNN应用到边缘设备或高性能应用中是挑战。潘恩神经网络的一些内置成本可以通过硬件加速来减少,例如使用计算在内存(CIM)架构和磁矩随机存储器。然而,实现抽样dropout-based BayNN需要许多随机单元。在本文中,我们提出了缩放抽样(Scale Dropout),一种新的正则化技术 для简化神经网络(BNN),以及基于MC-Scale Dropout的BayNN для高效的不确定性估计。我们的方法只需一个随机单元,不管模型大小如何,从而实现了非常可扩展的潘恩神经网络。此外,我们还介绍了一种基于磁矩随机存储器的CIM架构,实现了与状态前的能效级别比较,并且可以达到100倍以上的能效级别。我们验证了我们的方法,并证明它可以在预测性能和不确定性估计方面达到1%的改进。

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

  • paper_url: http://arxiv.org/abs/2311.15813
  • repo_url: https://github.com/aniki-ly/FlowZero
  • paper_authors: Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
  • for: 本文提出了一种名为FlowZero的新框架,用于通过文本转换为 temporally-coherent 视频。
  • methods: FlowZero 使用大型自然语言模型(LLMs)理解复杂的空间-时间动力学,并使用图像扩散模型生成视频。
  • results: FlowZero 可以生成具有流畅物体运动和帧次协调性的coherent 视频,并且可以在零 shot 情况下 Synthesize 视频。
    Abstract Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.
    摘要

Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach

  • paper_url: http://arxiv.org/abs/2311.16514
  • repo_url: None
  • paper_authors: Ayush K. Rai, Tarun Krishna, Feiyan Hu, Alexandru Drimbarean, Kevin McGuinness, Alan F. Smeaton, Noel E. O’Connor
    for:这篇论文的目的是提出一种新的方法来生成开放集 recognize 视频异常(VAD)中的假异常(PA),以便在自适应滤波器(AE)中进行异常检测。methods:这篇论文使用了一种名为 Latent Diffusion Model 的隐藏差分模型来填充一个masked out的区域,并通过mixup来模拟空间时间的扰动。此外,该论文还提出了一种简单的一致性检测方法,可以在开放集识别(OCC) Setting 下检测真实的异常。results:实验结果表明,该方法可以在四个 VAD 测试集上达到与其他现有state-of-the-art PA生成和重建方法相同的性能,并且可以在不同的测试集上转移和普适性的异常检测。
    Abstract Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.
    摘要 视频异常检测(VAD)是一种开放集成识别任务,通常被视为一种一类分类(OCC)问题,其训练数据包含正常的视频实例,测试数据则包含正常和异常的视频实例。现有研究使用只有正常数据创建 pseudo-anomalies(PA),并假设了real-world异常的物体和运动速度以将先验知识注入到自动编码器(AE)基于模型中的恢复模型中。这种方法提出了一种生成通用的空间temporal Pseudo-anomalies(PAs),通过在一个遮盖区域的图像中使用预训练的潜在扩散模型进行填充,并使用mixup进行光流的杂化以模拟数据中的空间temporal扭曲。此外,我们提出了一个简单的一致框架,用于在 OCC 设定下检测真实的异常。我们的方法在四个 VAD benchmark 数据集上(Ped2、Avenue、ShanghaiTech和UBnormal)进行了广泛的实验,结果表明我们的方法与其他现有的 PA 生成和恢复基于方法在 OCC 设定下表现相当。我们的分析还检验了 PA 的传输性和通用性,提供了价值的洞察,通过使用 PA 来识别真实的异常。

Planning for the Efficient Updating of Mutual Fund Portfolios

  • paper_url: http://arxiv.org/abs/2311.16204
  • repo_url: None
  • paper_authors: Tomás de la Rosa
  • for: 更新或重新均衡基金 portefolio
  • methods: linear programming 和冒险搜索方法
  • results: 提高成本效益 compared to compared base strategy
    Abstract Once there is a decision of rebalancing or updating a portfolio of funds, the process of changing the current portfolio to the target one, involves a set of transactions that are susceptible of being optimized. This is particularly relevant when managers have to handle the implications of different types of instruments. In this work we present linear programming and heuristic search approaches that produce plans for executing the update. The evaluation of our proposals shows cost improvements over the compared based strategy. The models can be easily extended to other realistic scenarios in which a holistic portfolio management is required
    摘要 一旦决定重新均衡或更新投资组合,将现有投资组合更改为目标组合,涉及到一系列可优化的交易。这 particualrly relevant when managers need to handle different types of instruments的影响。在这篇文章中,我们提出了线性Programming和启发搜索方法,用于生成更新执行的计划。我们的提议的成本优化相比基于的策略。这些模型可以轻松扩展到其他实际情况,在holistic投资管理中需要。Here's the text with some additional information about the Simplified Chinese translation:Simplified Chinese is a writing system used in mainland China and Singapore. It is based on the Traditional Chinese characters, but with some simplifications to make it easier to write and read. In this translation, I have used Simplified Chinese characters to represent the text.The translation is written in a formal and professional tone, using technical vocabulary and grammatical structures appropriate for a business or financial context. I have also tried to preserve the original meaning and intent of the text, while adapting it to the Simplified Chinese language.Please note that there may be some differences in the translation due to the complexity of the text and the nuances of the Simplified Chinese language. If you have any further questions or requests, please feel free to ask.

A Social-aware Gaussian Pre-trained Model for Effective Cold-start Recommendation

  • paper_url: http://arxiv.org/abs/2311.15790
  • repo_url: None
  • paper_authors: Siwei Liu, Xi Wang, Craig Macdonald, Iadh Ounis
  • for: 提高推荐系统的性能,特别是对冷启用户的推荐。
  • methods: 使用社交关系信息和交互数据进行预训练,并采用 Gaussian Mixture Model (GMM) 进行后续精度训练。
  • results: 与16个基eline比较,SGP模型在三个公共数据集上显著超过最佳基eline by up to 7.7% in terms of NDCG@10,并且能够有效地解决冷启用户的问题。
    Abstract The use of pre-training is an emerging technique to enhance a neural model's performance, which has been shown to be effective for many neural language models such as BERT. This technique has also been used to enhance the performance of recommender systems. In such recommender systems, pre-training models are used to learn a better initialisation for both users and items. However, recent existing pre-trained recommender systems tend to only incorporate the user interaction data at the pre-training stage, making it difficult to deliver good recommendations, especially when the interaction data is sparse. To alleviate this common data sparsity issue, we propose to pre-train the recommendation model not only with the interaction data but also with other available information such as the social relations among users, thereby providing the recommender system with a better initialisation compared with solely relying on the user interaction data. We propose a novel recommendation model, the Social-aware Gaussian Pre-trained model (SGP), which encodes the user social relations and interaction data at the pre-training stage in a Graph Neural Network (GNN). Afterwards, in the subsequent fine-tuning stage, our SGP model adopts a Gaussian Mixture Model (GMM) to factorise these pre-trained embeddings for further training, thereby benefiting the cold-start users from these pre-built social relations. Our extensive experiments on three public datasets show that, in comparison to 16 competitive baselines, our SGP model significantly outperforms the best baseline by upto 7.7% in terms of NDCG@10. In addition, we show that SGP permits to effectively alleviate the cold-start problem, especially when users newly register to the system through their friends' suggestions.
    摘要 使用预训练技术可以提高神经网络模型的性能,这种技术已经被证明是对许多神经语言模型,如BERT,有效。这种技术还被用来提高推荐系统的性能。在这些推荐系统中,预训练模型用于学习更好的初始化 для用户和物品。然而,现有的预训练推荐系统通常只是在预训练阶段使用用户交互数据,这使得提供好的推荐became difficult,特别是当交互数据 scarcity 。为解决这个常见的数据稀缺问题,我们提议在预训练阶段不仅使用交互数据,还使用其他可用的信息,如用户之间的社交关系,以提供推荐系统更好的初始化。我们提出了一种新的推荐模型,社交意识 Gaussian Pre-trained Model (SGP),它在图 neural network (GNN) 中编码用户社交关系和交互数据。在后续的精度调整阶段,我们的 SGP 模型采用 Gaussian Mixture Model (GMM) 来因子化这些预训练嵌入,以进一步训练,从而为冷启用户带来优势。我们在三个公共数据集上进行了广泛的实验,与 16 个基线比较,我们的 SGP 模型在 NDCG@10 指标上与最佳基线之间比较,高达 7.7%。此外,我们还证明了 SGP 可以有效地解决冷启用户问题,特别是当用户通过朋友的建议注册到系统时。

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

  • paper_url: http://arxiv.org/abs/2311.15786
  • repo_url: https://github.com/ieit-yuan/yuan-2.0
  • paper_authors: Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, Chao Wang
  • for: 本研究旨在描述一种基于本地依赖知识的注意力机制(Localized Filtering-based Attention,LFA),并在该机制基础之上开发了一个大型自然语言处理模型(Yuan 2.0)。
  • methods: 本研究使用了LFA机制,并开发了一种高质量预训练和细化训练数据的生成方法。此外,提出了一种分布式训练方法,该方法通过非均匀管道并行、数据并行和优化并行来降低内节点通信带宽需求,并在大规模分布式训练中达到了良好性能。
  • results: 本研究显示,基于LFA机制的Yuan 2.0模型在代码生成、数学问题解决和对话中表现出了卓越的能力,与现有模型相比,其表现更加出色。
    Abstract In this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention. Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer parallel is proposed, which greatly reduces the bandwidth requirements of intra-node communication, and achieves good performance in large-scale distributed training. Yuan 2.0 models display impressive ability in code generation, math problem-solving, and chat compared with existing models. The latest version of YUAN 2.0, including model weights and source code, is accessible at Github.
    摘要 在这个工作中,我们引入了本地化筛选基于注意力(LFA),以包含自然语言的本地依赖关系知识。基于LFA,我们开发并发布了Yuan 2.0,一个大型语言模型,其参数范围从2.1亿到102.6亿。我们提出了一种数据筛选和生成方法,用于建立高质量预训练和细化训练 dataset。我们还提出了一种分布式训练方法,包括非均匀管道并行、数据并行和优化器并行,这有效减少了内节点通信带宽的需求,并在大规模分布式训练中达到了良好的性能。Yuan 2.0模型在代码生成、数学问题解决和对话中表现出色,与现有模型相比。最新版本的YUAN 2.0模型、模型参数和源代码都可以在Github上下载。

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16503
  • repo_url: None
  • paper_authors: Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, Xianglong Liu
    for:* 这个 paper 的目的是解决传统Diffusion模型在应用时的问题,包括对应用时间的依赖和内存需求。methods:* 这个 paper 使用了一个名为 Temporal Feature Maintenance Quantization (TFMQ) 的框架,它透过一个名为 Temporal Information Block (TIB) 的新设计来维护时间特征。* TIM 使用了一个名为 temporal information aware reconstruction (TIAR) 和 finite set calibration (FSC) 的新方法来调整时间特征,以确保生成结果的质量。results:* 这个 paper 的结果显示,使用 TFMQ 框架可以维护时间特征并确保生成结果的质量,并且可以在4位量子化下 achieved model performance nearly on par with the full-precision model。* 这个 paper 的方法可以实现几乎无额外计算成本和加速量化时间,并且在 LSUN-Bedrooms $256 \times 256$ 上比前一代方法快速量化时间 $2.0 \times$。
    Abstract The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.
    摘要 Diffusion模型,一种广泛使用的图像生成框架,由于其广泛应用而受到较大的挑战,包括长时间执行时间和大量内存需求。为Addressing these issues in traditional models, Efficient Post-training Quantization (PTQ) is crucial. Unlike traditional models, diffusion models heavily rely on the time-step $t$ to achieve satisfactory multi-round denoising. Typically, $t$ is encoded to a temporal feature by a few modules, regardless of the sampling data. However, existing PTQ methods do not optimize these modules separately. They use inappropriate reconstruction targets and complex calibration methods, leading to a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency.To solve these issues, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework based on a Temporal Information Block, which is only related to the time-step $t$ and unrelated to the sampling data. With the help of the innovative block design, we develop temporal information-aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. By using the framework, we can maintain most of the temporal information and ensure end-to-end generation quality.Our extensive experiments on various datasets and diffusion models show state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by 2.0 times compared to previous works on LSUN-Bedrooms $256 \times 256$.

Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2311.15781
  • repo_url: https://github.com/apple/ml-kge
  • paper_authors: Simone Conia, Min Li, Daniel Lee, Umar Farooq Minhas, Ihab Ilyas, Yunyao Li
  • for: 增强非英语语言知识图的质量和量
  • methods: combinatorial machine translation, web search, and large language models
  • results: 提出了一种新的自动知识图增强任务,并进行了对 bridging the gap in both the quantity and quality of textual information between English and non-English languages的 investigateHere’s a more detailed explanation of each point:
  • for: The paper is written to address the issue of lack of high-quality textual information in non-English languages, and to propose a novel task of automatic Knowledge Graph Enhancement (KGE) to improve the quantity and quality of textual information in non-English languages.
  • methods: The paper uses a combination of three methods: Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs) to generate high-quality textual information in non-English languages.
  • results: The paper presents a novel unsupervised approach called M-NTA, which combines MT, WS, and LLMs to generate high-quality textual information in non-English languages. The approach is evaluated on a human-curated benchmark called WikiKGE-10, which covers 10 languages across 7 language families. The results show that the proposed approach can significantly improve the quantity and quality of textual information in non-English languages, and can be used to improve various knowledge graph tasks such as Entity Linking, Knowledge Graph Completion, and Question Answering.
    Abstract Recent work in Natural Language Processing and Computer Vision has been using textual information -- e.g., entity names and descriptions -- available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task of automatic Knowledge Graph Enhancement (KGE) and perform a thorough investigation on bridging the gap in both the quantity and quality of textual information between English and non-English languages. More specifically, we: i) bring to light the problem of increasing multilingual coverage and precision of entity names and descriptions in Wikidata; ii) demonstrate that state-of-the-art methods, namely, Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs), struggle with this task; iii) present M-NTA, a novel unsupervised approach that combines MT, WS, and LLMs to generate high-quality textual information; and, iv) study the impact of increasing multilingual coverage and precision of non-English textual information in Entity Linking, Knowledge Graph Completion, and Question Answering. As part of our effort towards better multilingual knowledge graphs, we also introduce WikiKGE-10, the first human-curated benchmark to evaluate KGE approaches in 10 languages across 7 language families.
    摘要 近期的自然语言处理和计算机视觉研究使用知识图中的文本信息(例如实体名称和描述)来训练神经网络模型,以高质量结构数据进行降解。然而,在非英语语言方面,文本信息的量和质量相对较少。为解决这个问题,我们提出了自动知识图增强(KGE)任务,并进行了详细的调查,旨在bridging英语和非英语语言之间的文本信息差距。更 Specifically,我们:1. 抛光了wikidata中实体名称和描述的多语言覆盖率和准确率的问题;2. 示出了当前状态的方法(机器翻译、网络搜索和大语言模型)在这个任务上遇到的问题;3. 提出了M-NTA,一种新的无监督方法,通过机器翻译、网络搜索和大语言模型来生成高质量的文本信息;4. 研究了不同语言的文本信息增强对实体链接、知识图完成和问答任务的影响。作为我们为多语言知识图努力的一部分,我们还介绍了 WikiKGE-10,首个由人手纠正的 KGE 评价标准,覆盖10种语言,7种语言家族。

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

  • paper_url: http://arxiv.org/abs/2311.15759
  • repo_url: None
  • paper_authors: Yunxin Li, Baotian Hu, Wei Wang, Xiaochun Cao, Min Zhang
  • for: 该论文旨在提高大型语言模型(LLMs)的多模态生成能力,以及利用多模态知识进行语言生成。
  • methods: 该论文提出了一种方法,即MKS2,用于强化LLMs的多模态知识存储和共享。该方法包括在LLMs内部 integrate open-world visual information的Modular Visual Memory组件,以及在多模态知识之间进行协同生成的soft Mixtures-of-Multimodal Experts架构。
  • results: 该论文的实验结果表明,MKS2可以有效增强LLMs在需要物理或通用知识的上下文中的理解能力,并在多模态benchmark上达到竞争性的 результаTS。
    Abstract Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual-language understanding, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance overall capabilities of LLMs, which could be regraded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixtures-of-Multimodal Experts architecture in LLMs to invoke multimodal knowledge collaboration during generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on multimodal benchmarks.
    摘要 最近的多模态大语言模型(MLLMs)技术已经取得了显著的多模态生成能力,类似于GPT-4。这些模型主要将视觉信息映射到语言表示空间中,利用大语言模型的广泛知识和强大的文本生成能力来生成多模态指令遵循Response。我们可以称这种方法为“视觉语言理解”,但我们注意到这些 MLLMs 可能会忽略利用视觉知识来提高整体语言模型的能力,这可以被称为“视觉增强语言模型”。在这篇论文中,我们提出了一种方法called MKS2,旨在通过强化语言模型来提高多模态知识存储和共享。具体来说,我们引入了内部块中的 Modular Visual Memory 组件,用于高效地存储开放世界的视觉信息。此外,我们还提出了一种软 Mixtures-of-Multimodal Experts 架构,用于在生成过程中调用多模态知识协作。我们的全面实验表明,MKS2 可以强化语言模型在需要物理或通用知识的上下文中的理解能力,同时也可以在多模态标准准则上提供竞争力的结果。

SceneDM: Scene-level Multi-agent Trajectory Generation with Consistent Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.15736
  • repo_url: None
  • paper_authors: Zhiming Guo, Xing Gao, Jianlan Zhou, Xinyu Cai, Botian Shi
  • For: 本文提出了一种新的扩展 diffusion models 框架,用于生成场景中所有代理人的共同未来运动轨迹,以提高自驾算法的开发和评估。* Methods: 该框架使用一种基于 transformer 网络的新方法来有效处理代理人之间的互动,以提高生成轨迹的一致性。 另外,为了保证代理人的轨迹平滑,该框架还采用了一种简单 yet effective 的一致扩散方法。* Results: 该框架在 Waymo Sim Agents Benchmark 上实现了状态顶尖的结果,并提供了一个场景级别的评估函数来评估代理人的安全性和路径跟随性。
    Abstract Realistic scene-level multi-agent motion simulations are crucial for developing and evaluating self-driving algorithms. However, most existing works focus on generating trajectories for a certain single agent type, and typically ignore the consistency of generated trajectories. In this paper, we propose a novel framework based on diffusion models, called SceneDM, to generate joint and consistent future motions of all the agents, including vehicles, bicycles, pedestrians, etc., in a scene. To enhance the consistency of the generated trajectories, we resort to a new Transformer-based network to effectively handle agent-agent interactions in the inverse process of motion diffusion. In consideration of the smoothness of agent trajectories, we further design a simple yet effective consistent diffusion approach, to improve the model in exploiting short-term temporal dependencies. Furthermore, a scene-level scoring function is attached to evaluate the safety and road-adherence of the generated agent's motions and help filter out unrealistic simulations. Finally, SceneDM achieves state-of-the-art results on the Waymo Sim Agents Benchmark. Project webpage is available at https://alperen-hub.github.io/SceneDM.
    摘要 现实场景级多Agent运动 simulations 是开发和评估自动驾驶算法的关键。然而,大多数现有工作都是为特定单个代理类生成轨迹,通常忽略生成轨迹的一致性。在这篇论文中,我们提出了一种基于扩散模型的新框架,called SceneDM,用于生成场景中所有代理的共同未来运动。为了提高生成轨迹的一致性,我们采用了一种新的Transformer网络来有效地处理代理之间的互动在运动扩散过程中。为了保证代理轨迹的平滑性,我们还设计了一种简单 yet effective的一致扩散方法,以便提高模型在短期时间相互依赖性上的利用。此外,为了评估生成的代理运动的安全性和道路遵从度,我们采用了一种场景级分数函数。最后,SceneDM实现了Waymo Sim Agents Benchmark的国际级 Result。更多信息请参考https://alperen-hub.github.io/SceneDM。

Adinkra Symbol Recognition using Classical Machine Learning and Deep Learning

  • paper_url: http://arxiv.org/abs/2311.15728
  • repo_url: None
  • paper_authors: Michael Adjeisah, Kwame Omono Asamoah, Martha Asamoah Yeboah, Raji Rafiu King, Godwin Ferguson Achaab, Kingsley Adjei
  • for: 本研究旨在提高黑人社区和非洲国家对人工智能(AI)的认知和参与度。
  • methods: 本研究使用了класси型机器学习和深度学习模型,构建了ADINKRA数据集,并使用了预训练模型如VGG和ResNet进行Feature抽取和分类。
  • results: 研究提出了一个简单的CNN模型,并使用了降噪训练来提高模型的性能。模型的准确率和融合率得到了评估,并Visual化了模型的预测结果。这些评估结果可 serve as a foundational benchmark for future assessments of the ADINKRA dataset。
    Abstract Artificial intelligence (AI) has emerged as a transformative influence, engendering paradigm shifts in global societies, spanning academia and industry. However, in light of these rapid advances, addressing the underrepresentation of black communities and African countries in AI is crucial. Boosting enthusiasm for AI can be effectively accomplished by showcasing straightforward applications around tasks like identifying and categorizing traditional symbols, such as Adinkra symbols, or familiar objects within the community. In this research endeavor, we dived into classical machine learning and harnessed the power of deep learning models to tackle the intricate task of classifying and recognizing Adinkra symbols. The idea led to a newly constructed ADINKRA dataset comprising 174,338 images meticulously organized into 62 distinct classes, each representing a singular and emblematic symbol. We constructed a CNN model for classification and recognition using six convolutional layers, three fully connected (FC) layers, and optional dropout regularization. The model is a simpler and smaller version of VGG, with fewer layers, smaller channel sizes, and a fixed kernel size. Additionally, we tap into the transfer learning capabilities provided by pre-trained models like VGG and ResNet. These models assist us in both classifying images and extracting features that can be used with classical machine learning models. We assess the model's performance by measuring its accuracy and convergence rate and visualizing the areas that significantly influence its predictions. These evaluations serve as a foundational benchmark for future assessments of the ADINKRA dataset. We hope this application exemplar inspires ideas on the various uses of AI in organizing our traditional and modern lives.
    摘要 人工智能(AI)已经成为全球社会的转型因素,促使了许多学术和业务领域的 paradigm shift。然而,鉴于这些快速的进步,对黑人社区和非洲国家在AI领域的下 Representation是非常重要的。可以通过示例如标识和分类传统符号,如 Adinkra 符号,或者在社区中熟悉的物品,来增强对 AI 的热情。在这个研究项目中,我们投入了古典机器学习和深度学习模型,以解决标识和识别 Adinkra 符号的复杂任务。我们构建了一个名为 ADINKRA 的数据集,包含 174,338 个图像,并且将这些图像分成 62 个不同的类别,每个类别都代表了一种独特的符号。我们构建了一个基于 CNN 的分类和识别模型,该模型包括六层杂凝层、三层全连接层和可选的dropout regularization。这个模型是 VGG 模型的简化和小型版本,它具有 fewer 层、更小的通道大小和固定核大小。此外,我们还利用了预训练的 VGG 和 ResNet 模型,以便在图像分类和特征提取方面提供转移学习能力。我们评估模型的性能 by measuring its accuracy and convergence rate,并 visualize the areas that significantly influence its predictions。这些评估作为我们 ADINKRA 数据集的基础benchmark,以便未来对这些数据集进行评估。我们希望这个应用程序可以激发对 AI 在组织我们传统和现代生活中的各种应用的想法。

Italian Crossword Generator: Enhancing Education through Interactive Word Puzzles

  • paper_url: http://arxiv.org/abs/2311.15723
  • repo_url: None
  • paper_authors: Kamyar Zeinalipour, Tommaso laquinta, Asya Zanollo, Giovanni Angelini, Leonardo Rigutini, Marco Maggini, Marco Gori
  • for: 提高学生的参与度、理解度、批判性思维和记忆保持。
  • methods: 使用自然语言处理和机器学习的最新技术,如 GPT3-DaVinci、GPT3-Curie、GPT3-Babbage、GPT3-Ada 和 BERT-uncased,生成高质量的学习拼团。
  • results: 实现了创造高标准的学习拼团,为学生提供有利的学习经验。
    Abstract Educational crosswords offer numerous benefits for students, including increased engagement, improved understanding, critical thinking, and memory retention. Creating high-quality educational crosswords can be challenging, but recent advances in natural language processing and machine learning have made it possible to use language models to generate nice wordplays. The exploitation of cutting-edge language models like GPT3-DaVinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT-uncased has led to the development of a comprehensive system for generating and verifying crossword clues. A large dataset of clue-answer pairs was compiled to fine-tune the models in a supervised manner to generate original and challenging clues from a given keyword. On the other hand, for generating crossword clues from a given text, Zero/Few-shot learning techniques were used to extract clues from the input text, adding variety and creativity to the puzzles. We employed the fine-tuned model to generate data and labeled the acceptability of clue-answer parts with human supervision. To ensure quality, we developed a classifier by fine-tuning existing language models on the labeled dataset. Conversely, to assess the quality of clues generated from the given text using zero/few-shot learning, we employed a zero-shot learning approach to check the quality of generated clues. The results of the evaluation have been very promising, demonstrating the effectiveness of the approach in creating high-standard educational crosswords that offer students engaging and rewarding learning experiences.
    摘要 学习十字ixen� proposes numerous benefits for students, including increased engagement, improved understanding, critical thinking, and memory retention. Creating high-quality educational crosswords can be challenging, but recent advances in自然语言处理 and machine learning have made it possible to use language models to generate nice wordplays. The exploitation of cutting-edge language models like GPT3-DaVinci, GPT3-Curie, GPT3-Babbage, GPT3-Ada, and BERT-uncased has led to the development of a comprehensive system for generating and verifying crossword clues. A large dataset of clue-answer pairs was compiled to fine-tune the models in a supervised manner to generate original and challenging clues from a given keyword. On the other hand, for generating crossword clues from a given text, Zero/Few-shot learning techniques were used to extract clues from the input text, adding variety and creativity to the puzzles. We employed the fine-tuned model to generate data and labeled the acceptability of clue-answer parts with human supervision. To ensure quality, we developed a classifier by fine-tuning existing language models on the labeled dataset. Conversely, to assess the quality of clues generated from the given text using zero/few-shot learning, we employed a zero-shot learning approach to check the quality of generated clues. The results of the evaluation have been very promising, demonstrating the effectiveness of the approach in creating high-standard educational crosswords that offer students engaging and rewarding learning experiences.

GLIME: General, Stable and Local LIME Explanation

  • paper_url: http://arxiv.org/abs/2311.15722
  • repo_url: https://github.com/thutzr/glime-general-stable-and-local-lime-explanation
  • paper_authors: Zeren Tan, Yang Tian, Jian Li
  • for: 本研究旨在解释黑盒机器学习模型的预测结果,提高模型的可解释性。
  • methods: 本研究使用了增强版LIME方法(GLIME),其具有更快的 converges 和更高的稳定性,以及可以根据具体情况选择的样本分布。
  • results: GLIME 可以提供更高的本地准确性(local fidelity)和独立于参考选择的解释,并且可以快速地适应不同的场景。
    Abstract As black-box machine learning models grow in complexity and find applications in high-stakes scenarios, it is imperative to provide explanations for their predictions. Although Local Interpretable Model-agnostic Explanations (LIME) [22] is a widely adpoted method for understanding model behaviors, it is unstable with respect to random seeds [35,24,3] and exhibits low local fidelity (i.e., how well the explanation approximates the model's local behaviors) [21,16]. Our study shows that this instability problem stems from small sample weights, leading to the dominance of regularization and slow convergence. Additionally, LIME's sampling neighborhood is non-local and biased towards the reference, resulting in poor local fidelity and sensitivity to reference choice. To tackle these challenges, we introduce GLIME, an enhanced framework extending LIME and unifying several prior methods. Within the GLIME framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, GLIME generates explanations with higher local fidelity compared to LIME. GLIME explanations are independent of reference choice. Moreover, GLIME offers users the flexibility to choose a sampling distribution based on their specific scenarios.
    摘要 为这些复杂的黑盒机器学习模型在高风险场景中应用,提供模型预测的解释是非常重要的。尽管本地可解释模型行为的方法(LIME)广泛应用[22],但它具有随机种子不稳定和本地准确性低的问题[35,24,3]。我们的研究表明,这些问题源于小样本权重,导致模型训练过程中的整体化和慢速收敛。此外,LIME的采样区域不具本地性和偏袋性,导致解释不准确和参照选择敏感。为解决这些挑战,我们提出了GLIME框架,它是LIME的扩展和统一多种先前方法。在GLIME框架中,我们得到了LIME的等价形式ulation,实现了更快的收敛和更高的稳定性。通过使用本地无偏采样分布,GLIME生成的解释具有高本地准确性,与参照选择无关。此外,GLIME给用户提供了适应特定场景的采样分布的选择 flexibility。

Variational Autoencoders for Feature Exploration and Malignancy Prediction of Lung Lesions

  • paper_url: http://arxiv.org/abs/2311.15719
  • repo_url: https://github.com/benkeel/vae_lung_lesion_bmvc
  • paper_authors: Benjamin Keel, Aaron Quyn, David Jayne, Samuel D. Relton
  • For: This study aims to develop an accurate and interpretable AI model for lung cancer diagnosis from routine CT scans.* Methods: The proposed model uses Variational Autoencoders (VAEs) to learn latent vector representations of lung cancer lesions, which are then used in a multi-layer perceptron (MLP) classifier for diagnosis. The study compares the performance of VAEs with two variations: Gaussian VAE (GVAE) and Dirichlet VAE (DirVAE).* Results: The best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows the VAE latent space separates malignant and benign lesions based on meaningful feature components, and latent space traversals correspond to clinically meaningful feature changes.
    Abstract Lung cancer is responsible for 21% of cancer deaths in the UK and five-year survival rates are heavily influenced by the stage the cancer was identified at. Recent studies have demonstrated the capability of AI methods for accurate and early diagnosis of lung cancer from routine scans. However, this evidence has not translated into clinical practice with one barrier being a lack of interpretable models. This study investigates the application Variational Autoencoders (VAEs), a type of generative AI model, to lung cancer lesions. Proposed models were trained on lesions extracted from 3D CT scans in the LIDC-IDRI public dataset. Latent vector representations of 2D slices produced by the VAEs were explored through clustering to justify their quality and used in an MLP classifier model for lung cancer diagnosis, the best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows the VAE latent space separates the dataset of malignant and benign lesions based on meaningful feature components including tumour size, shape, patient and malignancy class. We also include a comparative analysis of the standard Gaussian VAE (GVAE) and the more recent Dirichlet VAE (DirVAE), which replaces the prior with a Dirichlet distribution to encourage a more explainable latent space with disentangled feature representation. Finally, we demonstrate the potential for latent space traversals corresponding to clinically meaningful feature changes.
    摘要 肺癌负责英国癌症死亡21%,五年生存率受肺癌发现阶段的影响。 latest studies have shown that AI methods can accurately and early diagnose lung cancer from routine scans. However, this evidence has not been applied in clinical practice, one of the barriers is the lack of interpretable models. This study uses Variational Autoencoders (VAEs), a type of generative AI model, to diagnose lung cancer. The proposed models were trained on lesions extracted from 3D CT scans in the LIDC-IDRI public dataset. The latent vector representations of 2D slices produced by the VAEs were explored through clustering to justify their quality and used in an MLP classifier model for lung cancer diagnosis, the best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows that the VAE latent space separates the dataset of malignant and benign lesions based on meaningful feature components, including tumor size, shape, patient, and malignancy class. We also include a comparative analysis of the standard Gaussian VAE (GVAE) and the more recent Dirichlet VAE (DirVAE), which replaces the prior with a Dirichlet distribution to encourage a more explainable latent space with disentangled feature representation. Finally, we demonstrate the potential for latent space traversals corresponding to clinically meaningful feature changes.

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation

  • paper_url: http://arxiv.org/abs/2311.15698
  • repo_url: None
  • paper_authors: Federico A. Galatolo, Mario G. C. A. Cimino
  • For: The paper aims to generate high-quality, language-specific chat corpora using a self-chat mechanism, with a focus on underrepresented languages like Italian.* Methods: The authors use a combination of a generator LLM and an embedder LLM to create new samples and ensure diversity, and propose a new MLM model-based quality assessment metric for evaluating and filtering the corpora.* Results: The refined Italian chat corpus and the fine-tuned LLM model (cerbero-7b) demonstrate significantly enhanced language comprehension and question-answering skills, establishing a new state-of-the-art for Italian LLMs.Here are the three points in Simplified Chinese text:* For: 这个研究旨在使用自我聊天机制生成高质量、语言特定的聊天 corpus,尤其是对于被排除的语言如意大利语。* Methods: 作者们使用一种 generator LLM 和一种 embedder LLM 组合创建新的样本,并提出一种基于 MLM 模型的质量评估指标来评估和筛选 corpus。* Results: 经过精炼的意大利聊天 corpus 和 fine-tune 的 LLM 模型(cerbero-7b)在语言理解和问答能力方面表现出了显著提高,创造了意大利 LLM 的新状态。
    Abstract This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked Language Modelling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills. The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian LLMs. This approach marks a substantial advancement in the development of language-specific LLMs, with a special emphasis on augmenting corpora for underrepresented languages like Italian.
    摘要 Simplified Chinese translation:这个研究提出了一种新的方法,用于生成高质量的语言特定的聊天 corpora,使用自我聊天机制。我们将生成器语言模型(LLM)用于生成新样本,并使用嵌入器语言模型来保证多样性。我们提出了一种基于Masked Language Modelling(MLM)模型的质量评估指标,用于评估和筛选 corpora。使用 llama2-70b 作为生成器和多语言句子转换器作为嵌入器,我们生成了一个意大利聊天 corpora,并对基于英语 ChatGPT 自动聊天数据的 Fauno corpus 进行了改进。改进过程包括使用结构声明和自然语言处理技术。两个 corpora 都经过了全面的质量评估使用我们提出的 MLM 模型基于的质量指标。意大利 LLM 通过这些 corpora 进行了高度改进,其语言理解和回答问题能力显著提高。结果的模型,cerbero-7b,在意大利 LLM 中建立了新的州际标准。这种方法标志着语言特定 LLM 的发展受到了重要的促进,特别是增强对语言少数民族语言 like 意大利语言的 corpora。

Peptide Binding Classification on Quantum Computers

  • paper_url: http://arxiv.org/abs/2311.15696
  • repo_url: https://github.com/cqcl/peptide-binding-classification-on-quantum-computers
  • paper_authors: Charles London, Douglas Brown, Wenduan Xu, Sezen Vatansever, Christopher James Langmead, Dimitri Kartsaklis, Stephen Clark, Konstantinos Meichanetzidis
  • for: 这个研究用于应用近端量子计算机解决生物计算中的一个任务,并且在设计药物蛋白的过程中找到竞争性的表现。
  • methods: 这个研究使用了参数化量子Circuit来建立量子模型,并在适合的资源需求下进行序列分类。为了研究噪声的影响,将一些最佳性能的量子模型输入 emulator 的现有的噪音量子处理器。然后将这些量子模型在Quantinuum H1-1 磁铁量子处理器上执行,并观察到几乎完全相符的无噪模拟。
  • results: 这个研究发现,近端量子计算机可以在设计药物蛋白中使用,并且可以和现有的分类模型竞争。此外,这个研究还使用了特征属性方法来检查量子模型是否具有有意义的关系,并发现量子模型至少和现有的分类模型一样好。
    Abstract We conduct an extensive study on using near-term quantum computers for a task in the domain of computational biology. By constructing quantum models based on parameterised quantum circuits we perform sequence classification on a task relevant to the design of therapeutic proteins, and find competitive performance with classical baselines of similar scale. To study the effect of noise, we run some of the best-performing quantum models with favourable resource requirements on emulators of state-of-the-art noisy quantum processors. We then apply error mitigation methods to improve the signal. We further execute these quantum models on the Quantinuum H1-1 trapped-ion quantum processor and observe very close agreement with noiseless exact simulation. Finally, we perform feature attribution methods and find that the quantum models indeed identify sensible relationships, at least as well as the classical baselines. This work constitutes the first proof-of-concept application of near-term quantum computing to a task critical to the design of therapeutic proteins, opening the route toward larger-scale applications in this and related fields, in line with the hardware development roadmaps of near-term quantum technologies.
    摘要 我们进行了一项广泛的研究,用近期量子计算机解决生物计算领域中的任务。我们根据参数化的量子电路构建量子模型,在与经典基elines相似的规模上进行序列分类任务,并发现了竞争性的性能。为了研究噪声的影响,我们使用现有的state-of-the-art噪声量子处理器的模拟器运行一些最佳性能的量子模型,并应用错误修正方法提高信号。然后,我们在Quantinuum H1-1 磁共振量子处理器上执行这些量子模型,并观察到与无噪声准确模拟几乎完全一致。最后,我们应用特征归因方法,发现量子模型确实可以识别有意义的关系,至少与经典基elines相似。这项工作成为近期量子计算机技术的硬件开发路线图中的首个证明,开创了更大规模应用的道路,在生物计算和相关领域。

Regularization by Texts for Latent Diffusion Inverse Solvers

  • paper_url: http://arxiv.org/abs/2311.15658
  • repo_url: None
  • paper_authors: Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, Jong Chul Ye
  • for: 解决 inverse problems 的方法,使用 diffusion models 作为有效的生成假设。
  • methods: 通过 incorporating regularization by texts (TReg),在 reverse sampling 阶段应用文本描述解决方案的预 conceived 结构,并通过 null-text 优化进行动态强制。
  • results: TReg 能够成功减少 latent diffusion inverse solvers 中的ambiguity,提高其效果和准确性。
    Abstract The recent advent of diffusion models has led to significant progress in solving inverse problems, leveraging these models as effective generative priors. Nonetheless, challenges related to the ill-posed nature of such problems remain, often due to inherent ambiguities in measurements. Drawing inspiration from the human ability to resolve visual ambiguities through perceptual biases, here we introduce a novel latent diffusion inverse solver by incorporating regularization by texts (TReg). Specifically, TReg applies the textual description of the preconception of the solution during the reverse sampling phase, of which description isndynamically reinforced through null-text optimization for adaptive negation. Our comprehensive experimental results demonstrate that TReg successfully mitigates ambiguity in latent diffusion inverse solvers, enhancing their effectiveness and accuracy.
    摘要 Here's the text in Simplified Chinese:近期的扩散模型在解决反向问题方面带来了重要的进步,利用这些模型作为有效的生成假设。然而,这些问题的缺失定义性问题仍然存在,通常是因为测量数据中的内在含糊。引用人类对视觉含糊的解决方法,我们在这里引入了一种新的秘密扩散反向解决方法,通过文本描述(TReg)来进行 regularization。具体来说,TReg在反向采样阶段应用文本描述,并通过null文本优化来动态强制实施。我们的实验结果表明,TReg可以成功减轻扩散反向解决方法中的含糊,提高其效iveness和准确性。

RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks

  • paper_url: http://arxiv.org/abs/2311.15649
  • repo_url: None
  • paper_authors: Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Dongbin Zhao, He Wang
    for:The paper aims to develop a RoboGPT agent that can make embodied long-term decisions for daily tasks through natural language instruction, addressing the issues of feasibility and correctness in LLMs-generated task plans.methods:The proposed RoboGPT agent consists of two modules: 1) LLMs-based planning with re-plan to break the task into multiple sub-goals, and 2) RoboSkill individually designed for sub-goals to learn better navigation and manipulation skills. The LLMs-based planning is enhanced with a new robotic dataset and re-plan, called RoboGPT.results:The proposed RoboGPT agent outperforms SOTA methods on the ALFRED daily tasks, and the LLMs-based planner exceeds SOTA LLM-based planners like ChatGPT in task-planning rationality for hundreds of unseen daily tasks and other domain tasks, while keeping the large model’s original broad application and generality.
    Abstract Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in Large Language Models (LLMs) in natural language processing have inspired efforts to use LLMs in complex robot planning. Despite LLMs' great generalization and comprehension of instruction tasks, LLMs-generated task plans sometimes lack feasibility and correctness. To address the problem, we propose a RoboGPT agent\footnote{our code and dataset will be released soon} for making embodied long-term decisions for daily tasks, with two modules: 1) LLMs-based planning with re-plan to break the task into multiple sub-goals; 2) RoboSkill individually designed for sub-goals to learn better navigation and manipulation skills. The LLMs-based planning is enhanced with a new robotic dataset and re-plan, called RoboGPT. The new robotic dataset of 67k daily instruction tasks is gathered for fine-tuning the Llama model and obtaining RoboGPT. RoboGPT planner with strong generalization can plan hundreds of daily instruction tasks. Additionally, a low-computational Re-Plan module is designed to allow plans to flexibly adapt to the environment, thereby addressing the nomenclature diversity challenge. The proposed RoboGPT agent outperforms SOTA methods on the ALFRED daily tasks. Moreover, RoboGPT planner exceeds SOTA LLM-based planners like ChatGPT in task-planning rationality for hundreds of unseen daily tasks, and even other domain tasks, while keeping the large model's original broad application and generality.
    摘要 机器人代理人需要掌握常识和长期顺序决策,以完成日常任务通过自然语言指令。大型自然语言处理(LLMs)的发展已经激发了使用LLMs在复杂机器人规划中的尝试。despite LLMs的很好的总结和指令任务的理解,LLMs生成的任务计划有时缺乏可行性和正确性。为解决这问题,我们提议一种名为RoboGPT的机器人代理人,用于实现身体内的长期决策,包括两个模块:1)基于LLMs的规划,通过重新规划破解任务为多个子目标;2)RoboSkill,特制 для每个子目标,以学习更好的导航和抓取技能。LLMs基于的规划得到了一个新的机器人数据集和重新规划(RoboGPT)的改进。新的机器人数据集包括67k天日指令任务,用于精度调整Llama模型并获得RoboGPT。RoboGPT规划器具有强大的通用化能力,可以计划百余天日指令任务。此外,我们还设计了一个低计算量的重新规划模块,以让计划能够灵活适应环境,解决了命名多样性挑战。提议的RoboGPT代理人超出了当前最佳方法在ALFRED日常任务上的表现,同时RoboGPT规划器也超越了基于LLM的其他域的规划器,包括ChatGPT,在未看到的日常任务上的任务规划理智,而且保持了大型模型的原始广泛应用和通用性。

  • paper_url: http://arxiv.org/abs/2311.15648
  • repo_url: None
  • paper_authors: Aboli Marathe
  • for: The paper is written for image generation using model-agnostic learning, with a focus on aligning semantic priors with generative capabilities.
  • methods: The paper proposes two methods for image generation: Reinforcement Learning from Diffusion Feedback (RLDF) and Noisy Diffusion Gradient. Both methods use a special Continuous Feature Grammar (CFG) encoding for continual semantic guidance.
  • results: The paper reports that RLDF generates high-quality images over varied domains, including retail, sports, and agriculture, with class-consistency and strong visual diversity. The results are demonstrated using only a single input image and no text input.
    Abstract Large vision-language models are steadily gaining personalization capabilities at the cost of fine-tuning or data augmentation. We present two models for image generation using model-agnostic learning that align semantic priors with generative capabilities. RLDF, or Reinforcement Learning from Diffusion Feedback, is a singular approach for visual imitation through prior-preserving reward function guidance. This employs Q-learning (with standard Q*) for generation and follows a semantic-rewarded trajectory for image search through finite encoding-tailored actions. The second proposed method, noisy diffusion gradient, is optimization driven. At the root of both methods is a special CFG encoding that we propose for continual semantic guidance. Using only a single input image and no text input, RLDF generates high-quality images over varied domains including retail, sports and agriculture showcasing class-consistency and strong visual diversity. Project website is available at https://infernolia.github.io/RLDF.
    摘要 大型视语模型逐渐增强个性化功能,而这与数据增强或微调相对应。我们提出了两种图像生成方法,使用模型无关学习来保持语义指导。RLDF(强化学习从扩散反馈)是一种单一的视觉模仿方法,通过保持语义指导的奖励函数导航。这使用Q学习(使用标准Q*)进行生成,并跟踪语义奖励轨迹来进行图像搜索,通过finite编码适应的动作。第二种提议的方法是噪声扩散梯度,它是依靠优化驱动的。这两种方法的核心都是我们提议的特殊CFG编码,用于持续Semantic导航。只需单个输入图像和没有文本输入,RLDF可以生成高质量图像,覆盖多个领域,包括零售、运动和农业,展示了类型一致性和强大的视觉多样性。项目网站可以在https://infernolia.github.io/RLDF查看。

ChatTraffic: Text-to-Traffic Generation via Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.16203
  • repo_url: https://github.com/ChyaZhang/ChatTraffic
  • paper_authors: Chengyang Zhang, Yong Zhang, Qitan Shao, Bo Li, Yisheng Lv, Xinglin Piao, Baocai Yin
  • for: 这篇论文的目的是提出一种基于文本描述的交通系统的交通预测方法,以解决传统交通预测方法的两个主要挑战:1)不敏感于非常事件,2)长期预测性不佳。
  • methods: 本文提出了一种基于生成模型的文本到交通数据生成任务(Text-to-Traffic Generation,简称TTG),并提出了一种名为ChatTraffic的扩散模型,将文本与路网和交通数据相关联,生成实际的交通情况。
  • results: 实验结果表明,ChatTraffic可以从文本中生成实际的交通情况。 codes和数据集可以在https://github.com/ChyaZhang/ChatTraffic上获取。
    Abstract Traffic prediction is one of the most significant foundations in Intelligent Transportation Systems (ITS). Traditional traffic prediction methods rely only on historical traffic data to predict traffic trends and face two main challenges. 1) insensitivity to unusual events. 2) poor performance in long-term prediction. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation and name the task Text-to-Traffic Generation (TTG). The key challenge of the TTG task is how to associate text with the spatial structure of the road network and traffic data for generating traffic situations. To this end, we propose ChatTraffic, the first diffusion model for text-to-traffic generation. To guarantee the consistency between synthetic and real data, we augment a diffusion model with the Graph Convolutional Network (GCN) to extract spatial correlations of traffic data. In addition, we construct a large dataset containing text-traffic pairs for the TTG task. We benchmarked our model qualitatively and quantitatively on the released dataset. The experimental results indicate that ChatTraffic can generate realistic traffic situations from the text. Our code and dataset are available at https://github.com/ChyaZhang/ChatTraffic.
    摘要 历史交通数据仅依靠历史交通数据预测交通趋势,面临两大挑战:1)不敏感于特殊事件。2)长期预测性能不佳。在这项工作中,我们探讨如何将生成模型与交通系统的文本描述结合使用,并将该任务命名为文本到交通生成(TTG)。TTG任务的关键挑战在于如何将文本与路网和交通数据相关联,以生成交通情况。为此,我们提出了ChatTraffic,首个泛化模型 для文本到交通生成。为保证生成的交通情况与实际数据一致,我们将泛化模型与图 convolutional neural network(GCN)结合使用,以EXTRACT交通数据的空间相关性。此外,我们构建了一个大量的文本-交通对数据集,用于TTG任务的评估。我们对模型进行了质量和量的测试,并发现ChatTraffic可以从文本中生成真实的交通情况。我们的代码和数据集可以在https://github.com/ChyaZhang/ChatTraffic上获取。

Phonetic-aware speaker embedding for far-field speaker verification

  • paper_url: http://arxiv.org/abs/2311.15627
  • repo_url: None
  • paper_authors: Zezhong Jin, Youzhi Tu, Man-Wai Mak
  • for: 提高远场声音识别系统的性能
  • methods: 结合语音特征信息进行joint-training speech recognition和speaker recognition
  • results: 在VOiCES Challenge 2019评估集和VoxCeleb1测试集上表现出色,提高了标准说话人embedding的性能
    Abstract When a speaker verification (SV) system operates far from the sound sourced, significant challenges arise due to the interference of noise and reverberation. Studies have shown that incorporating phonetic information into speaker embedding can improve the performance of text-independent SV. Inspired by this observation, we propose a joint-training speech recognition and speaker recognition (JTSS) framework to exploit phonetic content for far-field SV. The framework encourages speaker embeddings to preserve phonetic information by matching the frame-based feature maps of a speaker embedding network with wav2vec's vectors. The intuition is that phonetic information can preserve low-level acoustic dynamics with speaker information and thus partly compensate for the degradation due to noise and reverberation. Results show that the proposed framework outperforms the standard speaker embedding on the VOiCES Challenge 2019 evaluation set and the VoxCeleb1 test set. This indicates that leveraging phonetic information under far-field conditions is effective for learning robust speaker representations.
    摘要 当一个说话验证(SV)系统在声音源处运行时,由于干扰和回声而出现了重大挑战。研究表明,将音乐信息 incorporated into 说话 embedding 可以提高文本独立的 SV 性能。受这种观察的激发,我们提出了联合训练语音识别和说话识别(JTSS)框架,以利用语音特征来提高远场 SV。这个框架鼓励说话 embeddings 保留音乐信息,通过将框架基于特征地图匹配 wav2vec 的向量。我们的听力是,干扰信息可以保留低级声音动力,并且与说话信息相结合,有助于部分资料减轻干扰和回声的影响。实验结果表明,我们的提议 exceeds 标准说话 embedding 在 VOiCES Challenge 2019 评估集和 VoxCeleb1 测试集上的性能。这表明,在远场条件下利用干扰信息学习Robust 的说话表示是有效的。

Injecting linguistic knowledge into BERT for Dialogue State Tracking

  • paper_url: http://arxiv.org/abs/2311.15623
  • repo_url: None
  • paper_authors: Xiaohan Feng, Xixin Wu, Helen Meng
  • for: 这篇论文目的是提高对话状态跟踪(DST)模型的性能和可读性,并使用不需要更多标注或训练数据的无监督框架来实现这一目标。
  • methods: 该论文提出了一种使用无监督框架提取语言知识,然后将这些知识与BERT结合使用以提高DST任务的性能和可读性。该知识提取过程具有计算机 econonomical 的特点,不需要更多的标注或训练数据。
  • results: 该论文使用了Convex Polytopic Model(CPM)作为特征提取工具,并证明了这些特征与对话中的 sintactic 和 semantics 句子结构具有强相关性。这种相关性使得可以彻底理解DST模型决策过程中哪些语言特征的影响。该框架在不同的DST任务上进行了比较,并观察到了明显的性能提高。
    Abstract Dialogue State Tracking (DST) models often employ intricate neural network architectures, necessitating substantial training data, and their inference processes lack transparency. This paper proposes a method that extracts linguistic knowledge via an unsupervised framework and subsequently utilizes this knowledge to augment BERT's performance and interpretability in DST tasks. The knowledge extraction procedure is computationally economical and does not necessitate annotations or additional training data. The injection of the extracted knowledge necessitates the addition of only simple neural modules. We employ the Convex Polytopic Model (CPM) as a feature extraction tool for DST tasks and illustrate that the acquired features correlate with the syntactic and semantic patterns in the dialogues. This correlation facilitates a comprehensive understanding of the linguistic features influencing the DST model's decision-making process. We benchmark this framework on various DST tasks and observe a notable improvement in accuracy.
    摘要 对话状态跟踪(DST)模型经常使用复杂的神经网络架构,需要大量的训练数据,而且在推理过程中缺乏透明度。这篇论文提出了一种方法,通过无监督的框架提取语言知识,然后将这些知识添加到BERT模型中,以提高其性能和可读性在DST任务中。知识提取过程具有计算经济的优点,不需要标注或额外的训练数据。我们使用 convex polytopic model(CPM)作为对话状态任务的特征提取工具,并证明所获取的特征与对话中的 sintactic和semantic 模式呈相关关系。这种相关性使得我们更好地理解DST模型做出决策的语言特征。我们在不同的DST任务上 benchmark 这个框架,并观察到明显的准确率提升。

Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

  • paper_url: http://arxiv.org/abs/2311.15619
  • repo_url: None
  • paper_authors: Yifei Chen, Dapeng Chen, Ruijin Liu, Sai Zhou, Wenyuan Xue, Wei Peng
  • for: 提高视频分类任务中的表达力和泛化能力,尤其是面对未经见或未经分类的动作类别时。
  • methods: 提出了一种新的“对齐然后适应”(ALT)模式,在这种模式下,首先利用每帧图像中的实体-区域对应关系来进行对齐,然后将对齐后的图像embedding传递给一个基于转换器的视频适应器,以EXTRACT视频中最重要的实体的semantics。
  • results: 在完全监督的情况下,ALT在Kinetics-400上 achieve 88.1%的top-1准确率,并且在2抽样情况下出perform了前一个state-of-the-art的7.1%和9.2%。
    Abstract Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However, most existing methods follow an "adapt then align" paradigm, which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper, we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning, we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities, we feed their text embeddings to a transformer-based video adapter as the queries, which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics, particularly when facing unfamiliar or unseen categories. ALT achieves competitive performance and superior generalizability while requiring significantly low computational costs. In fully supervised scenarios, it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. In 2-shot experiments, ALT outperforms the previous state-of-the-art by 7.1% and 9.2% on HMDB-51 and UCF-101, respectively.
    摘要 大规模视语言预训模型已经在各种视频任务中实现了显著的成功。然而,现有的方法大多采用“适应然后对齐”(adapt then align)模式,其适应预训的图像编码器来学习视频水平表示,并使用一个热点或文本 embedding 作为监督。这种模式忽略了将静止图像映射到复杂的活动概念上的挑战。在这篇论文中,我们提出了一种新的“对齐然后适应”(ALT)模式。在适应视频表示学习之前,我们利用每帧的实体-区域对应关系来对每帧的图像编码器进行对齐。这些对齐关系通过将区域感知图像嵌入与在线构建的文本资源进行匹配来实现。通过对已对齐的实体进行文本嵌入的转化,我们可以将视频中最重要的实体的 semantics 提取到一个矢量中。这种模式可以在适应过程中重用视语言的对齐,并尝试通过实体来解释动作,从而bridging动作的复杂 semantics 和未知或未看到的类别。ALT实现了高度竞争力和低计算成本,在完全监督的情况下,在 Kinetics-400 上达到了 88.1% 的 top-1 准确率,并在 2-shot 实验中与前一代 state-of-the-art 相比,提高了7.1% 和 9.2% 的性能。

Spatially Covariant Image Registration with Text Prompts

  • paper_url: http://arxiv.org/abs/2311.15607
  • repo_url: None
  • paper_authors: Hang Zhang, Xiang Chen, Rongguang Wang, Renjiu Hu, Dongdong Liu, Gaolei Li
  • for: 医疗图像的描述性结构和空间不均的对比强度,可以通过利用解剖学知识来提高图像注射的效率。
  • methods: 文章提出了一种新的方法,即文本SCF,它将文本视觉语言模型编码的解剖区域文本嵌入与空间相关的缓冲滤波器相结合,以优化一个隐函数,该函数关系解剖区域文本嵌入和滤波器重量的关系,从而降低了传统的翻译不变性约束。
  • results: 文章的实验结果表明,文本SCF可以不 только提高计算效率,而且可以保持或提高注射精度。它可以capture解剖区域之间的Contextual交互,并能够保持结构性缺失的注射。在MICCAI Learn2Reg 2021挑战中,文本SCF对比存在状态OF-THE-ART模型,表现出优异的表现,并在abdominal注射任务中,大型模型Variant提高了 dice分数11.3%,而小型模型Variant保持了相同的精度,但减少了89.13%的网络参数和98.34%的计算操作。
    Abstract Medical images are often characterized by their structured anatomical representations and spatially inhomogeneous contrasts. Leveraging anatomical priors in neural networks can greatly enhance their utility in resource-constrained clinical settings. Prior research has harnessed such information for image segmentation, yet progress in deformable image registration has been modest. Our work introduces textSCF, a novel method that integrates spatially covariant filters and textual anatomical prompts encoded by visual-language models, to fill this gap. This approach optimizes an implicit function that correlates text embeddings of anatomical regions to filter weights, relaxing the typical translation-invariance constraint of convolutional operations. TextSCF not only boosts computational efficiency but can also retain or improve registration accuracy. By capturing the contextual interplay between anatomical regions, it offers impressive inter-regional transferability and the ability to preserve structural discontinuities during registration. TextSCF's performance has been rigorously tested on inter-subject brain MRI and abdominal CT registration tasks, outperforming existing state-of-the-art models in the MICCAI Learn2Reg 2021 challenge and leading the leaderboard. In abdominal registrations, textSCF's larger model variant improved the Dice score by 11.3% over the second-best model, while its smaller variant maintained similar accuracy but with an 89.13% reduction in network parameters and a 98.34\% decrease in computational operations.
    摘要 医学图像经常具有结构化的解剖特征和空间不均的对比度。利用解剖知识在神经网络中可以大幅提高图像注射的使用效率。先前的研究已经利用了这些信息进行图像分割,但是对于扭变图像注射的进步而言,还有很大的空间。我们的工作推出了文本SCF,一种新的方法,它将 integrate spatially covariant filters和文本解剖提示,由视觉语言模型编码,以填补这个空白。这种方法对图像嵌入中的文本解剖区域进行相似性 correlate,以适应不同的翻译不变性约束。文本SCF不仅提高计算效率,而且可以保持或提高注射精度。通过捕捉解剖区域之间的 Contextual 互动,它提供了印象的Inter-regional transferability和保持结构分割的能力。文本SCF的性能在Inter-subject brain MRI和 Abdomen CT注射任务上进行了严格的测试,在MICCAI Learn2Reg 2021挑战中超过了现有的状态 искусственный neural networks,并排名第一。在 Abdomen 注射任务上,文本SCF的大型变体提高了 dice 分数11.3%,而小型变体保持了相同的精度,但减少了89.13%的网络参数和98.34%的计算操作。

QuickDrop: Efficient Federated Unlearning by Integrated Dataset Distillation

  • paper_url: http://arxiv.org/abs/2311.15603
  • repo_url: None
  • paper_authors: Akash Dhasade, Yaohong Ding, Song Guo, Anne-marie Kermarrec, Martijn De Vos, Leijie Wu
  • for: 这个研究旨在实现 Federated Unlearning (FU),即从 Federated Learning (FL) 模型中删除特定训练数据。
  • methods: QuickDrop 使用dataset distillation (DD) 来加速删除和严重降低与现有方法相比的计算负载。每个客户端使用 DD 生成一个简洁的训练数据集,并将这个简洁的数据集用于删除。
  • results: QuickDrop 可以将删除时间降低到 463.8 倍和 65.1 倍,相比模型重新训练和现有 FU 方法。它还能够处理多个删除操作和100个客户端。
    Abstract Federated Unlearning (FU) aims to delete specific training data from an ML model trained using Federated Learning (FL). We introduce QuickDrop, an efficient and original FU method that utilizes dataset distillation (DD) to accelerate unlearning and drastically reduces computational overhead compared to existing approaches. In QuickDrop, each client uses DD to generate a compact dataset representative of the original training dataset, called a distilled dataset, and uses this compact dataset during unlearning. To unlearn specific knowledge from the global model, QuickDrop has clients execute Stochastic Gradient Ascent with samples from the distilled datasets, thus significantly reducing computational overhead compared to conventional FU methods. We further increase the efficiency of QuickDrop by ingeniously integrating DD into the FL training process. By reusing the gradient updates produced during FL training for DD, the overhead of creating distilled datasets becomes close to negligible. Evaluations on three standard datasets show that, with comparable accuracy guarantees, QuickDrop reduces the duration of unlearning by 463.8x compared to model retraining from scratch and 65.1x compared to existing FU approaches. We also demonstrate the scalability of QuickDrop with 100 clients and show its effectiveness while handling multiple unlearning operations.
    摘要 《联邦不学习(Federated Unlearning,FU)》的目标是从一个使用联邦学习(Federated Learning,FL)训练的机器学习模型中删除特定的训练数据。我们介绍了一种高效的原创方法叫做快速落幕(QuickDrop),它利用数据干涯(Dataset Distillation,DD)加速快速落幕,并在计算开销方面减少了与现有方法相比的多少。在快速落幕中,每个客户使用 DD 生成一个尺寸减少后的数据集,并将其用于快速落幕。为了从全球模型中快速卸载特定的知识,快速落幕使客户在 distilled 数据集上执行随机梯度上升,从而减少了计算开销。我们还巧妙地将 DD integrate 到 FL 训练过程中,使得创建 distilled 数据集的开销变得非常接近负数。我们在三个标准数据集上进行评估,发现,与相同的准确性保证,快速落幕可以比模型重新训练从scratch 463.8 倍快,并且比现有的 FU 方法65.1 倍快。我们还证明了快速落幕的扩展性,可以处理多个快速落幕操作,并在100个客户情况下进行评估。

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

  • paper_url: http://arxiv.org/abs/2311.15599
  • repo_url: https://github.com/ailab-cvc/unireplknet
  • paper_authors: Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan
  • for: 本文主要针对大kernel convolutional neural networks (ConvNets) 的研究,尤其是其建 architecture 的设计方法和在多Modalities 领域中的表现能力。
  • methods: 本文提出了四个建 architecture 指南,其核心思想是利用大kernel 的特点,即可以覆盖宽而不需要深度。此外, authors 还提出了一些模式相关的预处理技术来使得模型在不同的领域中表现出色。
  • results: 本文的模型在 ImageNet 等多个任务上达到了leading 性能,例如 ImageNet 精度达88.0%, ADE20K mIoU 达55.6%, COCO box AP 达56.4%。此外,模型还在时间序列预测和音频识别任务上达到了状态 Künstler 的表现。
    Abstract Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but there are two unresolved and critical issues that demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition. For example, our models achieve an ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%, demonstrating better performance and higher speed than a number of recently proposed powerful competitors. 2) We discover that large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. Code and all the models at https://github.com/AILab-CVC/UniRepLKNet.
    摘要 大量矩阵卷积神经网络 (ConvNet) 在最近的研究中得到了广泛的关注,但是还有两个未解决的核心问题需要进一步的研究。1)现有的大量矩阵ConvNet的体系设计大多采用了传统ConvNet或转换器的设计原则,而大量矩阵ConvNet的体系设计仍然受挑战。2)由于转换器在多个Modalities中占据了主导地位,因此需要调查大量矩阵ConvNet是否也有强大的通用见解能力,在不同的领域中表现出优异的能力。在这篇论文中,我们从两个方面提出了贡献。1)我们提出了四种大量矩阵ConvNet的体系设计指南,核心在于利用大kernel的特点,即可以覆盖广阔的场景而不需要深入学习。按照这些指南,我们提posed的大量矩阵ConvNet在图像识别 task中表现出了领先的性能。例如,我们的模型在ImageNet任务上 achievied an accuracy of 88.0%, ADE20K mIoU of 55.6%,和COCO box AP of 56.4%,比许多最近提出的强大竞争对手更好的性能和更高的速度。2)我们发现大kernel是解释大量矩阵ConvNet在不同领域中的出色表现的关键。通过ertain模式相关的预处理方法,我们的提案模型在时序预测和音频识别任务中达到了状态 искусственный智能领域的前列表现,无需特定的模式化化到体系结构。代码和所有模型可以在https://github.com/AILab-CVC/UniRepLKNet上获取。

Networked Multiagent Safe Reinforcement Learning for Low-carbon Demand Management in Distribution Network

  • paper_url: http://arxiv.org/abs/2311.15594
  • repo_url: None
  • paper_authors: Jichen Zhang, Linwei Sang, Yinliang Xu, Hongbin Sun
  • for: 该文章提出了一种基于多代理的两级操作框架,用于 Distribution Networks 中减少碳排放,考虑到供应 сторо面的碳排放allowance。
  • methods: 文章使用了分布式灵活负荷代理、分布式优化方法和安全学习算法来解决问题。
  • results: 案例研究表明,该方法可以满足供应 сторо面的碳排放限制,保证供应网络的安全运行,并保持两个供应 сторо面的隐私。
    Abstract This paper proposes a multiagent based bi-level operation framework for the low-carbon demand management in distribution networks considering the carbon emission allowance on the demand side. In the upper level, the aggregate load agents optimize the control signals for various types of loads to maximize the profits; in the lower level, the distribution network operator makes optimal dispatching decisions to minimize the operational costs and calculates the distribution locational marginal price and carbon intensity. The distributed flexible load agent has only incomplete information of the distribution network and cooperates with other agents using networked communication. Finally, the problem is formulated into a networked multi-agent constrained Markov decision process, which is solved using a safe reinforcement learning algorithm called consensus multi-agent constrained policy optimization considering the carbon emission allowance for each agent. Case studies with the IEEE 33-bus and 123-bus distribution network systems demonstrate the effectiveness of the proposed approach, in terms of satisfying the carbon emission constraint on demand side, ensuring the safe operation of the distribution network and preserving privacy of both sides.
    摘要

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2311.16201
  • repo_url: None
  • paper_authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev
  • for: 这个论文旨在探讨文本生成器在自适应方法下进行图像生成是否能够借鉴预训练的语言模型,并发现预训练语言模型对自适应文本生成带来有限的帮助。
  • methods: 这篇论文使用了预训练语言模型进行自适应文本生成,并分析了每个模式的token,发现图像token与文本token之间存在很大的semantic差异,使得预训练语言模型无法有效地模型图像token。
  • results: 研究发现,预训练语言模型在图像-文本对应任务上表现不佳,因为文本数据集中的文本token过于简单,导致语言模型在这些任务上受到了恶化的影响。
    Abstract Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.
    摘要 现代图像分割器,如VQ-VAE,已经实现了文本到图像生成,使用自动反推方法,类似于语言模型。但这些方法尚未利用预训语言模型,尽管它们可以适应多元下游任务。在这个工作中,我们探索这个差距,并将预训语言模型适应自动反推文本到图像生成,发现预训语言模型对图像token具有有限的帮助。我们提供了一个二重解释,分析每个modalities的token。首先,我们证明图像token在Semantics上与文本token有很大差异,使预训语言模型无法更有效地模型它们。其次,文本token在图像文本集中是比较简单的,导致语言模型在这些文本上的预训衰退。

Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.15569
  • repo_url: None
  • paper_authors: Yongjin Yang, Jongwoo Ko, Se-Young Yun
  • for: 这 paper 旨在解释CLIP类vision-language模型在不同下游任务中的应用,以及如何使用提示或适应器进行高效的转移学习。
  • methods: 这 paper 使用了视觉提示和文本适应器来研究 VLMs 的总体行为,并提出了一种可 ensemble 方法来提高总体性和特定任务知识的转移。
  • results: 实验结果表明,使用视觉提示提高分类分化,使用文本适应器进行任务适应,并使用我们提出的可 ensemble 方法可以在不同领域中提高总体性和特定任务知识的转移。
    Abstract Vision-Language Models (VLMs) like CLIP have demonstrated remarkable applicability across a variety of downstream tasks, including zero-shot image classification. Recently, the use of prompts or adapters for efficient transfer learning has gained significant attention for effectively adapting to downstream tasks. However, the roles of vision and text prompts, as well as adapters in terms of generalization and transfer difficulty, have been overlooked, limiting performance on unseen tasks. In this paper, we empirically analyze how VLMs behave when using vision and text prompts, adapters, and a combination of these components, marking a novel exploration by our study. Our observations find that utilizing vision prompts for class separability and text adapters for task adaptation is crucial for adaptability and generalizability. Moreover, to improve generalization across every domain, we propose an adaptive ensemble method that effectively combines the general knowledge of VLMs with task-specific knowledge according to transfer difficulty. Upon experimenting with extensive benchmarks, our method consistently outperforms all baselines, particularly on unseen tasks, demonstrating the effectiveness of our proposed approach.
    摘要 视力语言模型(VLM)如CLIP在多种下游任务中表现出色,包括零shot图像分类。近期,使用提示或适配器进行高效转移学习的使用吸引了广泛的关注,以便有效地适应下游任务。然而,视力和文本提示的角色以及适配器在总体化和转移难度方面尚未得到了充分的探讨,这限制了对未经见任务的表现。本文通过实验分析了在使用视力提示、文本适配器和这些组件时,VLM的行为。我们的观察发现,通过使用视力提示来提高分类分离度和使用文本适配器来适应任务是关键的。此外,为了在每个领域中提高总化,我们提议了一种可变集合方法,可以有效地结合VLM的通用知识和任务特定知识,根据转移难度进行调整。在延展 benchmark 上进行实验,我们的方法 consistently 超越了所有基eline,特别是在未经见任务上,证明了我们的提出的方法的效果。

Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text

  • paper_url: http://arxiv.org/abs/2311.15565
  • repo_url: None
  • paper_authors: Finbarrs Oketunji
  • for: 本研究探讨了利用现代混合深度学习模型准确区分人工生成的文本和人类写作。
  • methods: 我们采用了一种可靠的方法ológica,利用从多个源头选择的AI和人类文本 Dataset,每个文本都有标注 instrucciones。高级自然语言处理技术支持文本特征分析。 combining了复杂的神经网络,自定义模型能够探测人工和人类内容之间的细微差异。
  • results: 研究结果表明,自定义模型可以准确地区分AI和人类文本,并且在不同的文本类型和大小上具有高度的泛化能力。
    Abstract My research investigates the use of cutting-edge hybrid deep learning models to accurately differentiate between AI-generated text and human writing. I applied a robust methodology, utilising a carefully selected dataset comprising AI and human texts from various sources, each tagged with instructions. Advanced natural language processing techniques facilitated the analysis of textual features. Combining sophisticated neural networks, the custom model enabled it to detect nuanced differences between AI and human content.
    摘要 我的研究探讨了使用前沿混合深度学习模型来准确地分辨人工生成的文本和人类写作。我采用了一种可靠的方法ологи,利用从多个来源选择的精心标注的 dataset,包括人工和机器生成的文本,并通过高级自然语言处理技术进行文本特征分析。将复杂的神经网络结合使用,我的自定义模型能够检测人工和人类内容之间的细微差别。

Instruct2Attack: Language-Guided Semantic Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2311.15551
  • repo_url: None
  • paper_authors: Jiang Liu, Chen Wei, Yuxiang Guo, Heng Yu, Alan Yuille, Soheil Feizi, Chun Pong Lau, Rama Chellappa
  • for: 本文旨在开发一种语言指导的semantic attack,可以生成基于语言指令的semantically meaningful的抖担。
  • methods: 该攻击使用现有的 latent diffusion models,通过对反射扩散过程进行敌意指导,以搜索一个基于输入图像和文本指令的敌方幂等。
  • results: 与现有的噪声攻击和semantic攻击相比,I2A可以生成更自然和多样化的抖担例子,同时提供更好的控制性和可读性。 GPT-4被用来自动生成图像特定的文本指令,并成功破坏了当前最强的深度神经网络,以及在不同的网络架构中展现出了抗抗性和可迁移性。
    Abstract We propose Instruct2Attack (I2A), a language-guided semantic attack that generates semantically meaningful perturbations according to free-form language instructions. We make use of state-of-the-art latent diffusion models, where we adversarially guide the reverse diffusion process to search for an adversarial latent code conditioned on the input image and text instruction. Compared to existing noise-based and semantic attacks, I2A generates more natural and diverse adversarial examples while providing better controllability and interpretability. We further automate the attack process with GPT-4 to generate diverse image-specific text instructions. We show that I2A can successfully break state-of-the-art deep neural networks even under strong adversarial defenses, and demonstrate great transferability among a variety of network architectures.
    摘要 我们提出Instruct2Attack(I2A),一种语言导向的semantic攻击,它生成根据自由形式文本指令的semantically meaningful的损害。我们利用了现代的潜在扩散模型,在反对扩散过程中以敌对方式导引潜在码,以找到受影响的图像和文本指令的敏感潜在码。相比于现有的噪音基的攻击和semantic攻击,I2A可以更自然地生成多样化的攻击示例,同时提供更好的控制性和可读性。我们还使用GPT-4自动生成对应图像的文本指令。我们展示了I2A可以成功攻击现代深度神经网络,并且在不同的网络架构上展现出很好的转移性。

From Prediction to Action: The Critical Role of Proper Performance Estimation for Machine-Learning-Driven Materials Discovery

  • paper_url: http://arxiv.org/abs/2311.15549
  • repo_url: None
  • paper_authors: Mario Boley, Felix Luong, Simon Teshuva, Daniel F Schmidt, Lucas Foppa, Matthias Scheffler
  • for: 本研究旨在提高数据驱动材料发现的效率和准确性,通过使用统计性perty模型进行 iterative 决策。
  • methods: 本研究使用了一种基于模型驱动的养护函数,以最大化certain “奖励” über time。
  • results: 研究发现,传统的in-distribution表现度量不直接相关于发现奖励。此外,提出了一种新的性能估计方法,可以successfully predict Gaussian processes with the “expected improvement” acquisition function as the best option。
    Abstract Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in developing property models that predict well on average with respect to the training distribution, this form of in-distribution performance measurement is not directly coupled with the discovery reward. This is because an iterative discovery process has a shifting reward distribution that is over-proportionally determined by the model performance for exceptional materials. We demonstrate this problem using the example of bulk modulus maximization among double perovskite oxides. We find that the in-distribution predictive performance suggests random forests as superior to Gaussian process regression, while the results are inverse in terms of the discovery rewards. We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery, and we propose a novel such estimator that, in contrast to na\"ive reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function as the best out of four options in our demonstrational study for double perovskites. Importantly, it does so without requiring the over thousand ab initio computations that were needed to confirm this prediction.
    摘要 物料发现驱动了统计性质模型,是一个迭代决策过程,在其中,初始数据集被扩展了新的数据,由模型指导的获得函数提议--以最大化某种"奖励"的总值,例如已发现最大性质值。虽然物理科学社区在开发属性模型方面做出了很多进步,但这种在分布中的性能评估方式并不直接与发现奖励相关。这是因为迭代发现过程中的奖励分布会受到模型性能对特点材料的过度强调。我们使用双晶体矿物质的bulk modulus最大化为例子,发现在分布中的预测性能表示Random Forest在Gaussian Process Regression之上,而实际结果则是相反的。我们认为,从预计算得数据集中缺乏合适性能评估方法是数据驱动材料发现的基本问题,并提出了一种新的评估器。与普遍预测错误的评估器不同,我们的评估器可以预测Gaussian Processes with "预期改善"获得函数为最佳选择,并在我们的示例研究中成功地预测了double perovskite材料中的Gaussian Processes。此外,它并不需要如前一千次的ab initio计算,以确认这一预测。

Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination

  • paper_url: http://arxiv.org/abs/2311.15548
  • repo_url: None
  • paper_authors: Haoqiang Kang, Xiao-Yang Liu
  • for: This paper aims to empirically investigate the hallucination behaviors of large language models (LLMs) in financial tasks, and to evaluate the effectiveness of four practical methods for mitigating these behaviors.
  • methods: The paper uses empirical investigation and evaluation of four practical methods to study the hallucination behaviors of LLMs in financial tasks. The methods include few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method, and the prompt-based tool learning method.
  • results: The paper finds that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks, highlighting the urgent need for research efforts to mitigate these behaviors.
    Abstract The hallucination issue is recognized as a fundamental deficiency of large language models (LLMs), especially when applied to fields such as finance, education, and law. Despite the growing concerns, there has been a lack of empirical investigation. In this paper, we provide an empirical examination of LLMs' hallucination behaviors in financial tasks. First, we empirically investigate LLM model's ability of explaining financial concepts and terminologies. Second, we assess LLM models' capacity of querying historical stock prices. Third, to alleviate the hallucination issue, we evaluate the efficacy of four practical methods, including few-shot learning, Decoding by Contrasting Layers (DoLa), the Retrieval Augmentation Generation (RAG) method and the prompt-based tool learning method for a function to generate a query command. Finally, our major finding is that off-the-shelf LLMs experience serious hallucination behaviors in financial tasks. Therefore, there is an urgent need to call for research efforts in mitigating LLMs' hallucination.
    摘要 大language模型(LLM)的幻觉问题被认为是其基本缺陷,尤其在应用于金融、教育和法律等领域。尽管有增加的忧虑,但是有所 empirical investigation。在这篇论文中,我们提供了empirical examination of LLMs的幻觉行为在金融任务中。首先,我们实际检查LLM模型能够解释金融概念和terminology。第二,我们评估LLM模型对于历史股票价格的询问能力。第三,为了缓解幻觉问题,我们评估了四种实用方法,包括少量学习、Decoding by Contrasting Layers(DoLa)、Retrieval Augmentation Generation(RAG)方法和提示基本工具学习方法,以生成一个查询命令。最后,我们的主要发现是,标准LLMs在金融任务中会经历严重的幻觉。因此,有一定的急需要对LLMs的幻觉进行研究和缓解。

Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction

  • paper_url: http://arxiv.org/abs/2311.15545
  • repo_url: None
  • paper_authors: Zeyang Zhang, Xingwang Li, Fei Teng, Ning Lin, Xueling Zhu, Xin Wang, Wenwu Zhu
  • for: 预测人体血液中的 albumin水平,以便在抢救病人中维持优质血液水平。
  • methods: 我们提出了一个名为Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP)的框架,用于在医院化学病房中预测ICU病人的 albumin水平。我们首先将人体 albumin 预测视为动态图回归问题,以模型人体关系和动态变化。然后,我们提出了一种分离的动态图注意力机制,以捕捉和分离不同类型的模式。最后,我们提出了一种不变的动态图回归方法,以鼓励模型依靠不变的模式进行预测。
  • results: 我们的方法在比较多个基线方法的试验中表现出色,在人体 albumin 预测中达到了更高的准确率。
    Abstract Human albumin is essential for indicating the body's overall health. Accurately predicting plasma albumin levels and determining appropriate doses are urgent clinical challenges, particularly in critically ill patients, to maintain optimal blood levels. However, human albumin prediction is non-trivial that has to leverage the dynamics of biochemical markers as well as the experience of treating patients. Moreover, the problem of distribution shift is often encountered in real clinical data, which may lead to a decline in the model prediction performance and reduce the reliability of the model's application. In this paper, we propose a framework named Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP), which is able to provide accurate albumin predictions for Intensity Care Unit (ICU) patients during hospitalization. We first model human albumin prediction as a dynamic graph regression problem to model the dynamics and patient relationship. Then, we propose a disentangled dynamic graph attention mechanism to capture and disentangle the patterns whose relationship to labels under distribution shifts is invariant and variant respectively. Last, we propose an invariant dynamic graph regression method to encourage the model to rely on invariant patterns to make predictions. Moreover, we propose a dataset named Albumin level testing and nutritional dosing data for Intensive Care (ANIC) for evaluation. Extensive experiments demonstrate the superiority of our method compared to several baseline methods in human albumin prediction.
    摘要 人体albumin是评估身体健康的关键指标。准确预测血液albumin水平和确定合适的剂量是致命的临床挑战,特别是在急性病 patients中,以维持优化的血液水平。然而,人体albumin预测并非易事,需要利用生物化学指标的动态和患者的经验来预测。此外,实际临床数据中的分布shift问题也经常出现,可能导致预测性能下降并减少模型的可靠性。在这篇论文中,我们提出了一种名为Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP)的框架,能够在临床情况下提供准确的albumin预测。我们首先将人体albumin预测视为动态图回归问题,以模型人体之间的动态和关系。然后,我们提出了一种分离的动态图注意力机制,以捕捉和分离与标签之间的相关性的不变和变量Pattern。最后,我们提出了一种不变的动态图回归方法,以鼓励模型依靠不变的模式进行预测。此外,我们还提出了一个名为Albumin level testing and nutritional dosing data for Intensive Care (ANIC)的数据集,用于评估我们的方法。广泛的实验表明,我们的方法在人体albumin预测中胜过了多个基线方法。

MI-Gen: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images

  • paper_url: http://arxiv.org/abs/2311.16480
  • repo_url: None
  • paper_authors: Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Lin Yang
  • for: 这个论文是为了提高肿瘤诊断和治疗的数字 PATHOLOGY 基础知识而写的。
  • methods: 这个论文使用了全图像文本数据集(TCGA-PathoText)和多例生成模型(MI-Gen)来生成肿瘤报告。
  • results: 实验结果表明,这个模型可以生成包含多个临床指示的肿瘤报告,并且可以在下游诊断任务中进行转移学习。
    Abstract Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (TCGA-PathoText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues. Furthermore, WSI-text prediction can be seen as an approach of visual-language pre-training, which enables our model to be transferred to downstream diagnostic tasks like carcinoma grading and phenotyping. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping without adding extra parameters or tricky fine-tuning. Our collected dataset and related code will all be publicly available.
    摘要 全像图像是数字病理学的基础 для诊断和治疗肿瘤。写 PATHOLOGY 报告是劳动ioso和容易出错的 для不熟悉的病理学家。为了减少工作量和提高临床自动化,我们调查如何基于全像图像生成 PATHOLOGY 报告。在数据端,我们筹集了最大的 WSI-文本数据集(TCGA-PathoText)。Specifically,我们收集了 Nearly 10,000 high-quality WSI-文本对 для视觉语言模型,通过认知和清洁 PATHOLOGY 报告,描述TCGA中的诊断扫描片。在模型端,我们提出多例生成模型(MI-Gen),可以为 gigapixel WSIs 生成 PATHOLOGY 报告。我们在TCGA-PathoText 上 benchmark 我们的模型。实验结果表明,我们的模型可以为多个临床指示提供多个临床指示。此外, WSI-文本预测可以被视为视觉语言预处理,启用我们的模型在下游诊断任务中进行转移,如肿瘤分型和生物型分析。我们发现简单的 semantics 提取可以实现最好的表现(F1 分数 0.838)在 BRCA 分类中,无需添加额外参数或复杂的微调。我们收集的数据和相关代码都将公开可用。

SSIN: Self-Supervised Learning for Rainfall Spatial Interpolation

  • paper_url: http://arxiv.org/abs/2311.15530
  • repo_url: https://github.com/jlidw/ssin
  • paper_authors: Jia Li, Yanyan Shen, Lei Chen, Charles Wang Wai NG
    for:这篇论文的目的是提出一个新的数据驱动自监督学习框架(SSIN),用于测量降水分布的空间 interpolating。methods:SSIN 使用了一个基于 transformer 架构的 SpaFormer 模型,通过随机覆盖来建立丰富的自我监督信号,从而将降水数据转换为有用的数据嵌入。results:实验结果显示,SSIN 能够在两个真实世界降水测站数据集上超过现有方法的性能。此外,SSIN 还在一个大型真实世界交通数据集上取得了最好的表现,证明了我们的方法的有效性和通用性。
    Abstract The acquisition of accurate rainfall distribution in space is an important task in hydrological analysis and natural disaster pre-warning. However, it is impossible to install rain gauges on every corner. Spatial interpolation is a common way to infer rainfall distribution based on available raingauge data. However, the existing works rely on some unrealistic pre-settings to capture spatial correlations, which limits their performance in real scenarios. To tackle this issue, we propose the SSIN, which is a novel data-driven self-supervised learning framework for rainfall spatial interpolation by mining latent spatial patterns from historical observation data. Inspired by the Cloze task and BERT, we fully consider the characteristics of spatial interpolation and design the SpaFormer model based on the Transformer architecture as the core of SSIN. Our main idea is: by constructing rich self-supervision signals via random masking, SpaFormer can learn informative embeddings for raw data and then adaptively model spatial correlations based on rainfall spatial context. Extensive experiments on two real-world raingauge datasets show that our method outperforms the state-of-the-art solutions. In addition, we take traffic spatial interpolation as another use case to further explore the performance of our method, and SpaFormer achieves the best performance on one large real-world traffic dataset, which further confirms the effectiveness and generality of our method.
    摘要 预测雨水分布在空间是 hydrological analysis 和自然灾害预警中的重要任务。但是,无法在每个角落 instal 雨量测量仪。空间 interpolate 是一种常用的方法,以Available raingauge 数据来推断雨水分布。然而,现有的方法假设了一些不切实际的雨水分布模型,这限制了它们在实际场景中的表现。为解决这个问题,我们提出了 SSIN,它是一种数据驱动、自监学习框架,通过 mines 历史观测数据中的 latent 空间模式来预测雨水分布。我们的主要想法是:通过随机覆盖来构建丰富的自我监督信号,SpaFormer 可以学习 raw 数据中的有用嵌入,然后根据雨水空间上下文进行适应性地模型空间相关性。我们的实验表明,SSIN 在两个实际雨量测量 dataset 上比 state-of-the-art 方法更高效。此外,我们将 traffic 空间 interpolate 作为另一个应用场景,并在一个大规模的实际交通数据集上进行了进一步的探索,并证明了我们的方法的有效性和通用性。Here's the translation in Traditional Chinese:预测雨水分布在空间是 hydrological analysis 和自然灾害预警中的重要任务。但是,无法在每个角落 instal 雨量测量器。空间 interpolate 是一种常用的方法,以Available raingauge 数据来推测雨水分布。然而,现有的方法假设了一些不切实际的雨水分布模型,这限制了它们在实际场景中的表现。为解决这个问题,我们提出了 SSIN,它是一种数据驱动、自监学习框架,通过 mines 历史观测数据中的 latent 空间模式来预测雨水分布。我们的主要想法是:通过随机覆盖来建构丰富的自我监督信号,SpaFormer 可以学习 raw 数据中的有用嵌入,然后根据雨水空间上下文进行适应性地模型空间相关性。我们的实验表明,SSIN 在两个实际雨量测量数据上比 state-of-the-art 方法更高效。此外,我们将 traffic 空间 interpolate 作为另一个应用场景,并在一个大规模的实际交通数据集上进行了进一步的探索,并证明了我们的方法的有效性和通用性。

Generation of patient specific cardiac chamber models using generative neural networks under a Bayesian framework for electroanatomical mapping

  • paper_url: http://arxiv.org/abs/2311.16197
  • repo_url: None
  • paper_authors: Sunil Mathew, Jasbir Sra, Daniel B. Rowe
  • For: 用于 cardiac ablation 手术的 diagnosis、规划和实时导航。* Methods: 使用 probabilistic machine learning 模型,在 Bayesian 框架下进行 surface reconstruction of cardiac chamber models。* Results: 可以从少量的 3D 点云数据中生成准确的 cardiac chamber models,减少过程时间和血射测试暴露。 Additionally, the Bayesian approach provides a natural framework for interpretability of the model, allowing for insight into what the neural network learns from the segmented CT/MRI images used to train the network.
    Abstract Electroanatomical mapping is a technique used in cardiology to create a detailed 3D map of the electrical activity in the heart. It is useful for diagnosis, treatment planning and real time guidance in cardiac ablation procedures to treat arrhythmias like atrial fibrillation. A probabilistic machine learning model trained on a library of CT/MRI scans of the heart can be used during electroanatomical mapping to generate a patient-specific 3D model of the chamber being mapped. The use of probabilistic machine learning models under a Bayesian framework provides a way to quantify uncertainty in results and provide a natural framework of interpretability of the model. Here we introduce a Bayesian approach to surface reconstruction of cardiac chamber models from a sparse 3D point cloud data acquired during electroanatomical mapping. We show how probabilistic graphical models trained on segmented CT/MRI data can be used to generate cardiac chamber models from few acquired locations thereby reducing procedure time and x-ray exposure. We show how they provide insight into what the neural network learns from the segmented CT/MRI images used to train the network, which provides explainability to the resulting cardiac chamber models generated by the model.
    摘要 电子 анатомичеMapping是卡диологи中用于创建详细3D图像的电rical活动的心脏的技术。它有用于诊断、治疗规划和实时导航 cardiac ablation 手术来治疗 cardiac arrhythmias like atrial fibrillation。一个基于 probabilities machine learning 模型在 Bayesian 框架下可以在 electroanatomical mapping 过程中使用,生成 Patient-specific 3D 模型。使用 probabilities machine learning 模型可以量化结果中的uncertainty,并提供一个自然的解释性模型。在这篇文章中,我们介绍了 Bayesian 方法 дляcardiac chamber models from sparse 3D point cloud data acquired during electroanatomical mapping。我们表明了如何使用 trained on segmented CT/MRI 数据的 probabilistic graphical models来生成 cardiac chamber models from few acquired locations,从而减少过程时间和X-ray exposure。我们还表明了这些模型如何提供 cardiac chamber models 的解释性。

Active Foundational Models for Fault Diagnosis of Electrical Motors

  • paper_url: http://arxiv.org/abs/2311.15516
  • repo_url: None
  • paper_authors: Sriram Anbalagan, Sai Shashank GP, Deepesh Agarwal, Balasubramaniam Natarajan, Babji Srinivasan
  • for: 这篇研究旨在提高电动机异常探测和诊断的精度和可靠性,以确保各种工业系统的安全和可靠运行。
  • methods: 本研究提出了基于创新的活动学习框架,利用更少量的标签数据,并充分利用大量可用的随机监控数据,通过结合活动学习和对比自愿学习技术。
  • results: 实验评估结果显示,提出的方法在对多种机器异常探测任务进行精确诊断时,表现较前state-of-the-art方法佳,仅使用更少量的标签数据。
    Abstract Fault detection and diagnosis of electrical motors are of utmost importance in ensuring the safe and reliable operation of several industrial systems. Detection and diagnosis of faults at the incipient stage allows corrective actions to be taken in order to reduce the severity of faults. The existing data-driven deep learning approaches for machine fault diagnosis rely extensively on huge amounts of labeled samples, where annotations are expensive and time-consuming. However, a major portion of unlabeled condition monitoring data is not exploited in the training process. To overcome this limitation, we propose a foundational model-based Active Learning framework that utilizes less amount of labeled samples, which are most informative and harnesses a large amount of available unlabeled data by effectively combining Active Learning and Contrastive Self-Supervised Learning techniques. It consists of a transformer network-based backbone model trained using an advanced nearest-neighbor contrastive self-supervised learning method. This approach empowers the backbone to learn improved representations of samples derived from raw, unlabeled vibration data. Subsequently, the backbone can undergo fine-tuning to address a range of downstream tasks, both within the same machines and across different machines. The effectiveness of the proposed methodology has been assessed through the fine-tuning of the backbone for multiple target tasks using three distinct machine-bearing fault datasets. The experimental evaluation demonstrates a superior performance as compared to existing state-of-the-art fault diagnosis methods with less amount of labeled data.
    摘要 fault detection and diagnosis of electrical motors are of utmost importance in ensuring the safe and reliable operation of several industrial systems. Detection and diagnosis of faults at the incipient stage allows corrective actions to be taken in order to reduce the severity of faults. The existing data-driven deep learning approaches for machine fault diagnosis rely extensively on huge amounts of labeled samples, where annotations are expensive and time-consuming. However, a major portion of unlabeled condition monitoring data is not exploited in the training process. To overcome this limitation, we propose a foundational model-based Active Learning framework that utilizes less amount of labeled samples, which are most informative and harnesses a large amount of available unlabeled data by effectively combining Active Learning and Contrastive Self-Supervised Learning techniques. It consists of a transformer network-based backbone model trained using an advanced nearest-neighbor contrastive self-supervised learning method. This approach empowers the backbone to learn improved representations of samples derived from raw, unlabeled vibration data. Subsequently, the backbone can undergo fine-tuning to address a range of downstream tasks, both within the same machines and across different machines. The effectiveness of the proposed methodology has been assessed through the fine-tuning of the backbone for multiple target tasks using three distinct machine-bearing fault datasets. The experimental evaluation demonstrates a superior performance as compared to existing state-of-the-art fault diagnosis methods with less amount of labeled data.Here's the translation in Traditional Chinese:发生和诊断电动机的重要性在于确保多种工业系统的安全和可靠运行。早期检测和诊断缺陷可以避免缺陷严重程度的增加。现有的数据驱动深入学习方法 для机器缺陷诊断仅对巨量标签数据进行了广泛运用,但是大量的条件监控数据未被利用在训练过程中。为解决这个限制,我们提出了基础模型基本的活动学习框架,它可以透过更多的标签数据和更多的可用的无标签数据进行有效结合活动学习和对比自我学习技术。这个框架包括一个基于转换器网络的背景模型,通过进一步的最近邻对照自我学习方法进行训练。这种方法使得背景模型能够从未处理过的Raw、无标签振荡数据中学习改善的表示。然后,背景模型可以进行精细调整,以Address多个下游任务,包括同一台机器上的多个任务和不同机器之间的任务。我们通过调整背景模型以多个目标任务进行评估,使用三个不同的机器当发生缺陷数据集。实验评估显示,我们的方法在使用更少的标签数据情况下具有较高的性能,与现有的状态艺术缺陷诊断方法相比。

Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context

  • paper_url: http://arxiv.org/abs/2311.15507
  • repo_url: None
  • paper_authors: Elijah Rippeth, Marine Carpuat, Kevin Duh, Matt Post
  • for: 解决机器翻译中的语义含义问题(lexical ambiguity)
  • methods: 通过在神经网络机器翻译模型中添加少量EXTRA-sentential context来解决翻译ambiguity
  • results: 比 STRONG sentence-level baselines和相对Document-level baselines更好地翻译含义含量的源语言词汇,同时降低了训练成本。
    Abstract Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.
    摘要 Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.Here's the translation in Traditional Chinese: lexical ambiguity 是机器翻译 (\mt) 中的一个挑战和普遍存在的问题。我们引入一个简单且可扩展的方法,通过将EXTRA-SENTENCE 上的小量额外文本融入到神经 \mt 中,以解决翻译ambiguity的问题。我们的方法不需要标注感知和改变标准模型架构。由于大多数 \mt 训练数据中没有实际文档背景,我们收集每个输入的相关句子,将其融合成 pseudo-documents。然后,将 pseudo-documents 中的焦点词作为每个源句子的前缀,以控制翻译的生成。为了评估,我们发布 \docmucow,一个基于英文-德文 \mucow \cite{raganato-etal-2020-evaluation} 的挑战集,并添加了文档 ID。广泛的实验显示,我们的方法可以更好地翻译不确定的源词,并且与强大的句子级基eline和相近的文档级基eline相比,降低训练成本。

Adaptive Image Registration: A Hybrid Approach Integrating Deep Learning and Optimization Functions for Enhanced Precision

  • paper_url: http://arxiv.org/abs/2311.15497
  • repo_url: None
  • paper_authors: Gabriel De Araujo, Shanlin Sun, Xiaohui Xie
  • for: 这个论文是为了结合学习基于方法和优化基于方法的图像注册方法而写的。
  • methods: 这个论文使用了学习基于方法和优化基于方法两种不同的方法来实现图像注册。
  • results: 研究结果表明,使用最佳性状模型作为框架的情况下,图像注册的测试结果提高了0.3%,而 mantenimiento同样的计算时间和图像投影场景的损失只有0.8%。
    Abstract Image registration has traditionally been done using two distinct approaches: learning based methods, relying on robust deep neural networks, and optimization-based methods, applying complex mathematical transformations to warp images accordingly. Of course, both paradigms offer advantages and disadvantages, and, in this work, we seek to combine their respective strengths into a single streamlined framework, using the outputs of the learning based method as initial parameters for optimization while prioritizing computational power for the image pairs that offer the greatest loss. Our investigations showed that an improvement of 0.3\% in testing when utilizing the best performing state-of-the-art model as the backbone of the framework, while maintaining the same inference time and with only a 0.8\% loss in deformation field smoothness.
    摘要

Optimizing and Fine-tuning Large Language Model for Urban Renewal

  • paper_url: http://arxiv.org/abs/2311.15490
  • repo_url: None
  • paper_authors: Xi Wang, Xianyao Ling, Tom Zhang, Xuecao Li, Shaolan Wang, Zhixing Li, Liang Zhang, Peng Gong
  • for: 这种研究旨在应用大语言模型(LLM)在城市更新领域中的适应应用,并提高其性能和文本生成质量以进行知识问答(QA)任务。
  • methods: 研究人员使用自 instru 式生成 QA 数据集,并通过Prefix和LoRAjoint 精度训练方法来创建城市更新领域的 LLM。
  • results: 实验结果显示,提posed 的共同精度训练方法可以显著提高 LLM 在 QA 任务中的表现,相比 LoRA 精度训练方法,提出的方法在测试集上的 Bleu 和 Rouge 指标上提高约5%;相比模型之前的精度训练方法,提出的方法在测试集上的 Bleu 和 Rouge 指标上提高约15%-20%。
    Abstract This study aims to innovatively explore adaptive applications of large language models (LLM) in urban renewal. It also aims to improve its performance and text generation quality for knowledge question-answering (QA) tasks. Based on the ChatGLM, we automatically generate QA datasets using urban renewal scientific literature corpora in a self-instruct manner and then conduct joint fine-tuning training on the model using the Prefix and LoRA fine-tuning methods to create an LLM for urban renewal. By guiding the LLM to automatically generate QA data based on prompt words and given text, it is possible to quickly obtain datasets in the urban renewal field and provide data support for the fine-tuning training of LLMs. The experimental results show that the joint fine-tuning training method proposed in this study can significantly improve the performance of LLM on the QA tasks. Compared with LoRA fine-tuning, the method improves the Bleu and Rouge metrics on the test by about 5%; compared with the model before fine-tuning, the method improves the Bleu and Rouge metrics by about 15%-20%. This study demonstrates the effectiveness and superiority of the joint fine-tuning method using Prefix and LoRA for ChatGLM in the urban renewal knowledge QA tasks. It provides a new approach for fine-tuning LLMs on urban renewal-related tasks.
    摘要

Global $\mathcal{L}^2$ minimization with certainty via geometrically adapted gradient descent in Deep Learning

  • paper_url: http://arxiv.org/abs/2311.15487
  • repo_url: None
  • paper_authors: Thomas Chen
  • for: 这篇论文主要针对 Deep Learning 网络中的 $\mathcal{L}^2$ 成本函数的最小化问题,并提出了两种修改后的 gradient descent 流程,一种适用于过参数化Setting,另一种适用于下参数化Setting。
  • methods: 这篇论文使用了修改后的 gradient descent 流程,并考虑了 pullback векторBundle 结构在过参数化Setting中,以及 pushforward векторBundle 结构在下参数化Setting中。
  • results: 论文证明,如果一定的核心rank条件成立,那么所有的或bits都会驱动 $\mathcal{L}^2$ 成本函数到全局最小值,并且在全局最小值中,所有的orbits都会 converges 到全局最小值,并且这种 convergence 速率是一致的 exponential 速率。 更进一步,这种结果与 sub-Riemannian geometry 有关。
    Abstract We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate. We point out relations of the latter to sub-Riemannian geometry.
    摘要 我团队考虑了广泛使用在深度学习网络中的梯度下降流动,并提出了两种修改版本,一个适用于过参数化 Setting,另一个适用于 unter Parametrierung Setting。两者都具有明确和自然的拓扑学意义,考虑了在过参数化 Setting中的pullback vector bundle结构,以及在 unter Parametrierung Setting中的pushforward vector bundle结构。在过参数化 Setting中,我们证明,如果一个排名条件成立,那么所有梯度下降轨迹都会导向 $\mathcal{L}^2$ 成本的全球最小值,并且在一个固定的扩张速率下进行均匀的对数减少。我们还指出了这种现象与低于里曼尼geometry的关系。

Automatic Time Signature Determination for New Scores Using Lyrics for Latent Rhythmic Structure

  • paper_url: http://arxiv.org/abs/2311.15480
  • repo_url: None
  • paper_authors: Callie C. Liao, Duoduo Liao, Jesse Guessford
  • for: 这 paper 是为了开发一种基于 lyrics 的自动生成时 signature 的算法,以提高 AI 音乐生成的质量。
  • methods: 这 paper 使用了 explainable machine learning 模型,并提出了多种关于发现 lyrical patterns 和创建新特征的方法,以同时包含 lyrical、rhythmic 和统计信息。
  • results: 这 paper 的实验结果显示,使用这种方法可以达到 97.6% F1 分数和 0.996 AUC ROC 分数的水平。
    Abstract There has recently been a sharp increase in interest in Artificial Intelligence-Generated Content (AIGC). Despite this, musical components such as time signatures have not been studied sufficiently to form an algorithmic determination approach for new compositions, especially lyrical songs. This is likely because of the neglect of musical details, which is critical for constructing a robust framework. Specifically, time signatures establish the fundamental rhythmic structure for almost all aspects of a song, including the phrases and notes. In this paper, we propose a novel approach that only uses lyrics as input to automatically generate a fitting time signature for lyrical songs and uncover the latent rhythmic structure utilizing explainable machine learning models. In particular, we devise multiple methods that are associated with discovering lyrical patterns and creating new features that simultaneously contain lyrical, rhythmic, and statistical information. In this approach, the best of our experimental results reveal a 97.6% F1 score and a 0.996 Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) score. In conclusion, our research directly generates time signatures from lyrics automatically for new scores utilizing machine learning, which is an innovative idea that approaches an understudied component of musicology and therefore contributes significantly to the future of Artificial Intelligence (AI) music generation.
    摘要 Recently, there has been a surge of interest in Artificial Intelligence-Generated Content (AIGC). However, the study of musical components such as time signatures has been insufficient, especially for lyrical songs. This is likely due to the neglect of musical details, which are crucial for establishing a robust framework. Time signatures provide the fundamental rhythmic structure for almost all aspects of a song, including phrases and notes.In this paper, we propose a novel approach that uses lyrics as input to automatically generate a fitting time signature for lyrical songs and uncover the latent rhythmic structure using explainable machine learning models. We devise multiple methods that discover lyrical patterns and create new features that simultaneously contain lyrical, rhythmic, and statistical information.Our experimental results show a 97.6% F1 score and a 0.996 Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) score. In conclusion, our research directly generates time signatures from lyrics automatically for new scores using machine learning, which is an innovative idea that approaches an understudied component of musicology and contributes significantly to the future of Artificial Intelligence (AI) music generation.

Privacy-Preserving Data Sharing in Agriculture: Enforcing Policy Rules for Secure and Confidential Data Synthesis

  • paper_url: http://arxiv.org/abs/2311.15460
  • repo_url: https://github.com/ebiquity/policy_enforced_data_generation
  • paper_authors: Anantaa Kotal, Lavanya Elluri, Deepti Gupta, Varun Mandalapu, Anupam Joshi
  • for: 这个论文旨在推动农业社区利用大数据技术优化资源使用、提高生产力和提高农业实践的可持续性。
  • methods: 该论文使用了大数据技术收集和分析各种数据源,如感知器、卫星和农民调查。同时,该论文还使用了深度学习技术生成隐私数据,以保护数据主题的隐私。
  • results: 该论文通过实验表明,使用隐私保护技术可以在农业领域广泛分享数据,而不违反数据主题的隐私。同时,该论文还提出了一种新的框架,可以在隐私保护技术中强制执行数据隐私政策规则。
    Abstract Big Data empowers the farming community with the information needed to optimize resource usage, increase productivity, and enhance the sustainability of agricultural practices. The use of Big Data in farming requires the collection and analysis of data from various sources such as sensors, satellites, and farmer surveys. While Big Data can provide the farming community with valuable insights and improve efficiency, there is significant concern regarding the security of this data as well as the privacy of the participants. Privacy regulations, such as the EU GDPR, the EU Code of Conduct on agricultural data sharing by contractual agreement, and the proposed EU AI law, have been created to address the issue of data privacy and provide specific guidelines on when and how data can be shared between organizations. To make confidential agricultural data widely available for Big Data analysis without violating the privacy of the data subjects, we consider privacy-preserving methods of data sharing in agriculture. Deep learning-based synthetic data generation has been proposed for privacy-preserving data sharing. However, there is a lack of compliance with documented data privacy policies in such privacy-preserving efforts. In this study, we propose a novel framework for enforcing privacy policy rules in privacy-preserving data generation algorithms. We explore several available agricultural codes of conduct, extract knowledge related to the privacy constraints in data, and use the extracted knowledge to define privacy bounds in a privacy-preserving generative model. We use our framework to generate synthetic agricultural data and present experimental results that demonstrate the utility of the synthetic dataset in downstream tasks. We also show that our framework can evade potential threats and secure data based on applicable regulatory policy rules.
    摘要 大数据为农业社区提供了有关资源使用优化、生产力提高和农业实践可持续性的信息。使用大数据在农业需要收集和分析来自各种来源的数据,包括感知器、卫星和农民调查。虽然大数据可以为农业社区提供有价值的洞察和效率提高,但是存在大量数据隐私和参与者隐私的问题。为解决这个问题,制定了一些隐私法规,如欧盟GDPR、欧盟农业数据分享代码行为协议和提议的欧盟人工智能法规。为确保农业隐私数据广泛可用于大数据分析而不违反数据主体隐私,我们考虑了隐私保护方法的农业数据分享。深度学习基于的隐私保护数据生成已被提议用于农业隐私数据分享。然而,现有的隐私保护努力中存在不符合文档隐私政策的问题。在本研究中,我们提出了一种新的框架,用于在隐私保护数据生成算法中强制执行隐私政策规则。我们利用可用的农业代码行为,提取与数据隐私相关的知识,并使用提取的知识来定义隐私 bound。我们使用我们的框架生成假数据,并提供实验结果,证明假数据的有用性在下游任务中。我们还表明了我们的框架可以避免潜在的威胁和保护数据根据适用的法规规则。

cs.CL - 2023-11-27

Reducing Gender Bias in Machine Translation through Counterfactual Data Generation

  • paper_url: http://arxiv.org/abs/2311.16362
  • repo_url: None
  • paper_authors: Ranjita Naik, Spencer Rarrick, Vishal Chowdhary
  • for: 提高 NMT 系统的准确性和均衡性
  • methods: 使用手工生成的 gender 偏好数据进行精度调整、使用 modified training objectives 或额外模型进行恢复、使用 counterfactual data generation techniques 生成域适应数据
  • results: 可以减少 catastrophic forgetting,提高 NMT 系统在 French、Spanish 和 Italian 等 morphologically rich 语言的准确性,且不导致显著的翻译质量下降
    Abstract Recent advances in neural methods have led to substantial improvement in the quality of Neural Machine Translation (NMT) systems. However, these systems frequently produce translations with inaccurate gender (Stanovsky et al., 2019), which can be traced to bias in training data. Saunders and Byrne (2020) tackle this problem with a handcrafted dataset containing balanced gendered profession words. By using this data to fine-tune an existing NMT model, they show that gender bias can be significantly mitigated, albeit at the expense of translation quality due to catastrophic forgetting. They recover some of the lost quality with modified training objectives or additional models at inference. We find, however, that simply supplementing the handcrafted dataset with a random sample from the base model training corpus is enough to significantly reduce the catastrophic forgetting. We also propose a novel domain-adaptation technique that leverages in-domain data created with the counterfactual data generation techniques proposed by Zmigrod et al. (2019) to further improve accuracy on the WinoMT challenge test set without significant loss in translation quality. We show its effectiveness in NMT systems from English into three morphologically rich languages French, Spanish, and Italian. The relevant dataset and code will be available at Github.
    摘要 近年来,神经方法的进步导致了机器翻译(NMT)系统的质量得到了显著改善。然而,这些系统经常生成含有错误性别的翻译(Stanovsky等,2019),这可以追溯到训练数据的偏见。 Saunders和Byrne(2020)通过使用包含均衡性的职业词汇的手工数据集来解决这个问题。通过使用这个数据集来精度训练现有的 NMT 模型,他们表明了偏见的减少,尽管在翻译质量方面付出了代价。他们通过修改训练目标或添加更多的模型来恢复一些失去的质量。我们发现,只需补充手工数据集中的一个随机样本来减少快速遗忘。我们还提出了一种新的领域适应技术,使用Zmigrod等(2019)所提出的 counterfactual 数据生成技术来创建适应域数据,以提高在 WinoMT 挑战测试集上的准确率,而无损质量。我们在英语到法语、西班牙语和意大利语等 morphologically rich 语言的 NMT 系统中证明了其效果。相关的数据集和代码将在 GitHub 上公开。

Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection

  • paper_url: http://arxiv.org/abs/2311.16302
  • repo_url: None
  • paper_authors: Anusha Sabbineni, Nikhil Anand, Maria Minakova
  • for: 这paper的目的是为了评估数据选择方法在低资源语言设置中的效果,并在这些设置中采用Entropy和Error L2-Norm(EL2N)分数来选择重要的训练示例。
  • methods: 这paper使用Entropy和EL2N分数来评估潜在有用的示例,并使用这些分数来减少预测错误率和领域分类错误率。
  • results: 研究发现,使用分数选择的方法可以在相比随机选择的基eline技术时提高 semantic error rate的减少率为2%,并在领域分类错误率上减少4%-7%。
    Abstract While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.
    摘要 “active learning、数据剪辑和数据增强中的数据选择方法已经得到了广泛的研究,但是在低资源语言的工业规模 Setting 中,这些方法的效果得到了少量的证据。我们的工作探讨了用于评估可能的训练示例的“有用性”或“困难度”的度量方法。我们主要实验了使用 entropy 和 Error L2-Norm(EL2N)分数。我们使用这些指标来筛选高质量的数据集,从大量的弱信号标注数据中提取出来。然后,我们进行了使用这些匿名数据集进行训练数据增强实验,并证明了基于分数选择的方法可以相比基eline技术Random Selection,降低 semantic error rate 2%和domain classification error rate 4%-7%。”Note: "Weak Signal Labeled" data refers to data that assigns no-defect high confidence hypotheses during inference as ground truth labels.

Influence Scores at Scale for Efficient Language Data Sampling

  • paper_url: http://arxiv.org/abs/2311.16298
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Nikhil Anand, Joshua Tan, Maria Minakova
  • for: 本研究旨在探讨语言分类任务中对影响分数的可行性。
  • methods: 本文使用多种存在图像设置中提出的影响分数方法进行评估,包括模型信任度和梯度变化的方法。
  • results: 实验结果表明,使用影响分数方法可以减少训练数据的50%,而无需对性能指标下降。此外,本文还总结了对影响分数的应用实践中的经验教训,以及对噪音和类别偏好数据的影响。
    Abstract Modern ML systems ingest data aggregated from diverse sources, such as synthetic, human-annotated, and live customer traffic. Understanding \textit{which} examples are important to the performance of a learning algorithm is crucial for efficient model training. Recently, a growing body of literature has given rise to various "influence scores," which use training artifacts such as model confidence or checkpointed gradients to identify important subsets of data. However, these methods have primarily been developed in computer vision settings, and it remains unclear how well they generalize to language-based tasks using pretrained models. In this paper, we explore the applicability of influence scores in language classification tasks. We evaluate a diverse subset of these scores on the SNLI dataset by quantifying accuracy changes in response to pruning training data through random and influence-score-based sampling. We then stress-test one of the scores -- "variance of gradients" (VoG) from Agarwal et al. (2022) -- in an NLU model stack that was exposed to dynamic user speech patterns in a voice assistant type of setting. Our experiments demonstrate that in many cases, encoder-based language models can be finetuned on roughly 50% of the original data without degradation in performance metrics. Along the way, we summarize lessons learned from applying out-of-the-box implementations of influence scores, quantify the effects of noisy and class-imbalanced data, and offer recommendations on score-based sampling for better accuracy and training efficiency.
    摘要 现代机器学习系统会处理来自多种不同源的数据,包括 sintetic、人工标注和实时客户流量。了解哪些例子对机器学习算法的性能有重要影响是训练效率的关键。在最近的文献中,一种增长的体系出现了多种“影响分数”,这些分数使用训练 artifacts 如模型信任度或检查点梯度来标识重要的数据 subsets。然而,这些方法主要在计算机视觉设置下进行研究,而language-based任务中使用预训练模型的可行性尚未得到证明。在这篇论文中,我们探讨了影响分数在语言分类任务中的可用性。我们使用SNLI数据集进行评估多种影响分数,并通过随机和影响分数基于的采样来评估模型的准确率变化。然后,我们在一个NLU模型堆栈中对一种“变量梯度”(VoG)分数进行压力测试,该分数来自Agarwal et al. (2022)。我们的实验表明,在许多情况下,使用50%的原始数据进行finetuning可以保持模型的性能指标不下降。在进行这些实验时,我们还总结了将外部实现的影响分数应用于NLU模型的教训,量化噪音和不均衡数据的影响,并提供了基于分数的采样建议以提高准确率和训练效率。

Student Mastery or AI Deception? Analyzing ChatGPT’s Assessment Proficiency and Evaluating Detection Strategies

  • paper_url: http://arxiv.org/abs/2311.16292
  • repo_url: None
  • paper_authors: Kevin Wang, Seth Akins, Abdallah Mohammed, Ramon Lawrence
    for: This paper investigates the performance of ChatGPT in completing introductory computer science assignments and the effectiveness of existing detection methods in identifying AI solutions.methods: The paper evaluates ChatGPT’s performance across three courses (CS1, CS2, and databases) and examines existing detection methods such as MOSS, JPlag, and GPTzero, as well as instructors’ and teaching assistants’ heuristics to distinguish between student and AI code.results: ChatGPT completes almost all introductory assessments perfectly, and existing detection methods have mixed success in identifying AI solutions. Instructors’ and teaching assistants’ heuristics are not sufficiently accurate in distinguishing between student and AI code. The findings emphasize the need for adapting assessments and improved detection methods.
    Abstract Generative AI systems such as ChatGPT have a disruptive effect on learning and assessment. Computer science requires practice to develop skills in problem solving and programming that are traditionally developed using assignments. Generative AI has the capability of completing these assignments for students with high accuracy, which dramatically increases the potential for academic integrity issues and students not achieving desired learning outcomes. This work investigates the performance of ChatGPT by evaluating it across three courses (CS1,CS2,databases). ChatGPT completes almost all introductory assessments perfectly. Existing detection methods, such as MOSS and JPlag (based on similarity metrics) and GPTzero (AI detection), have mixed success in identifying AI solutions. Evaluating instructors and teaching assistants using heuristics to distinguish between student and AI code shows that their detection is not sufficiently accurate. These observations emphasize the need for adapting assessments and improved detection methods.
    摘要 ‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧‧��

Applications of Large Language Models in Data Processing: Innovative Approaches to Segmenting and Renewing Information

  • paper_url: http://arxiv.org/abs/2311.16267
  • repo_url: None
  • paper_authors: Yu-Chen Lin, Akhilesh Kumar, Wen-Liang Zhang, Norman Chang, Muhammad Zakir, Rucha Apte, Chao Wang, Jyh-Shing Roger Jang
  • for: 本研究探讨了特定领域应用程序中有效的代码生成方法,包括使用大型自然语言模型(LLM)进行数据分类和更新,以及通过提示调整来激发更深入的思维。
  • methods: 本研究使用了一个真实的企业产品作为例子,提供了用户手册、API文档和其他数据,并通过将数据转化为semantic vector来更好地反映其实际位置。
  • results: 本研究通过多种提示技术实现了约70%的准确率在简到中等复杂任务中,并通过llama2基于 fine-tuning来测试其效果在专业领域代码生成中。
    Abstract Our paper investigates effective methods for code generation in "specific-domain" applications, including the use of Large Language Models (LLMs) for data segmentation and renewal, as well as stimulating deeper thinking in LLMs through prompt adjustments. Using a real company product as an example, we provide user manuals, API documentation, and other data. The ideas discussed in this paper help segment and then convert this data into semantic vectors to better reflect their true positioning. Subsequently, user requirements are transformed into vectors to retrieve the most relevant content, achieving about 70% accuracy in simple to medium-complexity tasks through various prompt techniques. This paper is the first to enhance specific-domain code generation effectiveness from this perspective. Additionally, we experiment with generating more scripts from a limited number using llama2-based fine-tuning to test its effectiveness in professional domain code generation. This is a challenging and promising field, and once achieved, it will not only lead to breakthroughs in LLM development across multiple industries but also enable LLMs to understand and learn any new knowledge effectively.
    摘要 我们的论文研究了特定领域应用中有效的代码生成方法,包括使用大语言模型(LLM)进行数据分 segmentation和更新,以及通过提示调整来激发更深刻的思考。使用真实公司产品作为例子,我们提供了用户手册、API文档和其他数据。在这篇论文中,我们讨论的想法可以将数据转换为含义Vector,更好地反映其真实位置。然后,用户需求也可以被转换为含义Vector,以从最相关的内容中提取最相关的内容,实现约70%的准确率在简到中等复杂性任务中通过不同的提示技术。这篇论文是特定领域代码生成效果的首次提高。此外,我们还进行了基于llama2的精度调整,以测试其在专业领域代码生成中的效果。这是一项挑战性和推动性的领域,一旦成功,将不仅导致多个industry的LLM发展的突破,还能让LLM学习和理解任何新知识。

An Exploration of Left-Corner Transformations

  • paper_url: http://arxiv.org/abs/2311.16258
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Andreas Opedal, Eleftheria Tsipidi, Tiago Pimentel, Ryan Cotterell, Tim Vieira
  • for: 该 paper 用于提高上下文自由 grammar 的可解性,通过使用 left-corner transformation 和 speculation transformation。
  • methods: 该 paper 使用 generalized left-corner transformation (GLCT),该转换基于 unifying left-corner transformation 和 speculation transformation。 GLCT 可以支持 semi-ring Weighted production rules,并提供更细化的控制,以移除左 recursion。
  • results: 该 paper 通过 empirical investigation 发现,GLCT 可以高效地消除 grammars 中的左 recursion,并且与 speculation 转换等价。
    Abstract The left-corner transformation (Rosenkrantz and Lewis, 1970) is used to remove left recursion from context-free grammars, which is an important step towards making the grammar parsable top-down with simple techniques. This paper generalizes prior left-corner transformations to support semiring-weighted production rules and to provide finer-grained control over which left corners may be moved. Our generalized left-corner transformation (GLCT) arose from unifying the left-corner transformation and speculation transformation (Eisner and Blatz, 2007), originally for logic programming. Our new transformation and speculation define equivalent weighted languages. Yet, their derivation trees are structurally different in an important way: GLCT replaces left recursion with right recursion, and speculation does not. We also provide several technical results regarding the formal relationships between the outputs of GLCT, speculation, and the original grammar. Lastly, we empirically investigate the efficiency of GLCT for left-recursion elimination from grammars of nine languages.
    摘要 左侧转换(Rosenkrantz和Lewis,1970)用于从context-free语法中除除左 recursions,这是一个重要的步骤,以使 grammar 可以从上向下解析,使用简单的技术。这篇论文总结了之前的左侧转换,以支持semiring-weighted生产规则和提供精细的控制,允许选择性地移动左侧转换。我们的总结左侧转换(GLCT)来自于将左侧转换和speculation转换(Eisner和Blatz,2007)综合,原来是适用于逻辑编程。我们的新转换和speculation定义等量的语言,但它们的 derivation 树结构不同于重要的方面:GLCT 将左 recursions 替换为右 recursions,而speculation 不然。我们还提供了一些技术结果,关于GLCT、speculation 和原始语法之间的正式关系。最后,我们对 nine 种语言的 grammar 进行了实验性的研究,以评估 GLCT 的 LEFT-recursion 消除性能。

How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs

  • paper_url: http://arxiv.org/abs/2311.16101
  • repo_url: https://github.com/ucsc-vlaa/vllm-safety-benchmark
  • paper_authors: Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie
  • for: 这种研究探讨了视觉推理语言模型(VLLM)的潜在应用。不同于先前的研究,我们将注意力从评估标准性能shift到了 introduce a comprehensive safety evaluation suite,覆盖了out-of-distribution(OOD)泛化和攻击 robustness。
  • methods: 我们提出了两个新的VQA数据集,每个variant designed to test model performance under challenging conditions。在探索攻击 robustness方面,我们提出了一种简单的攻击策略,用于诱导VLLMs生成视觉无关的响应。此外,我们评估了两种监禁策略,一种targeting either the vision or language component of VLLMs。
  • results: 我们的评估结果显示:1)当前VLLMs在OOD文本上表现不佳,但在图像上表现良好,除非图像信息受限;2)这些VLLMs可以轻松地被骗,只需要欺骗视觉编码器即可,而且它们的视觉语言培训经常违反安全协议。我们将这些评估结果公布在https://github.com/UCSC-VLAA/vllm-safety-benchmark。
    Abstract This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.
    摘要
  1. Current VLLMs struggle with OOD texts but not images, unless the visual information is limited.2. These VLLMs can be easily misled by deceiving vision encoders, and their vision-language training often compromises safety protocols.We release this safety evaluation suite at https://github.com/UCSC-VLAA/vllm-safety-benchmark.

DUnE: Dataset for Unified Editing

  • paper_url: http://arxiv.org/abs/2311.16087
  • repo_url: https://github.com/feyzaakyurek/dune
  • paper_authors: Afra Feyza Akyürek, Eric Pan, Garry Kuwanto, Derry Wijaya
  • for: 本研究旨在扩展现有语言模型的应用范围,通过对模型知识或表示进行修改,以提高模型的输出质量。
  • methods: 本研究使用了多种编辑方法,包括偏见除除和逻辑错误修正,并定义了一个编辑任务DUnE,其中编辑是通过自然语言句子来 solicits 模型的输出变化。
  • results: 经过广泛的实验,研究人员发现,抽取语言模型可以超过专门的编辑技术,并且 neither 这两种方法完全解决了通用编辑问题。
    Abstract Even the most advanced language models remain susceptible to errors necessitating to modify these models without initiating a comprehensive retraining process. Model editing refers to the modification of a model's knowledge or representations in a manner that produces the desired outcomes. Prior research primarily centered around editing factual data e.g. "Messi plays for Inter Miami" confining the definition of an edit to a knowledge triplet i.e. (subject, object, relation). However, as the applications of language models expand, so do the diverse ways in which we wish to edit and refine their outputs. In this study, we broaden the scope of the editing problem to include an array of editing cases such as debiasing and rectifying reasoning errors and define an edit as any natural language expression that solicits a change in the model's outputs. We are introducing DUnE-an editing benchmark where edits are natural language sentences and propose that DUnE presents a challenging yet relevant task. To substantiate this claim, we conduct an extensive series of experiments testing various editing approaches to address DUnE, demonstrating their respective strengths and weaknesses. We show that retrieval-augmented language modeling can outperform specialized editing techniques and neither set of approaches has fully solved the generalized editing problem covered by our benchmark.
    摘要

BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification

  • paper_url: http://arxiv.org/abs/2311.16083
  • repo_url: https://github.com/dminus1/genre
  • paper_authors: Dmitri Roussinov, Serge Sharoff
  • for: 这篇论文探讨了预训言语模型(PLM)在文本分类任务中的性能问题,具体来说是当文本分布发生变化时,PLM的性能仍然存在差距。
  • methods: 作者通过大量文本数据和多个主题来证实了这一点,并验证了这种现象对于经典PLM(如BERT)和现代大型模型(如GPT-3)都存在。作者还提出了一种可能的解决方案,即通过控制主题的文本生成器来增强训练数据,并在一些主题上提高了F1分数 by up to 50%,接近在相应主题上训练的结果,而其他主题则没有显著改善。
  • results: 作者通过实验证实了这种方法的有效性,并指出这种方法可以应用于其他分类任务,如性别、作者性和情感分类。代码和数据可以在https://github.com/dminus1/genre上下载。
    Abstract While performance of many text classification tasks has been recently improved due to Pre-trained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on \textit{political} topics often fails when tested on documents about \textit{sport} or \textit{medicine}. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Consequently, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models, such as GPT-3. We also suggest and successfully test a possible remedy: after augmenting the training dataset with topically-controlled synthetic texts, the F1 score improves by up to 50\% for some topics, nearing on-topic training results, while others show little to no improvement. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification. The code and data to replicate the experiments are available at https://github.com/dminus1/genre
    摘要 “尽管许多文本分类任务的性能在过去几年得到了提高,但我们在这篇论文中表明,它们仍然存在话题变化下的性能差距。例如,一个政治类文本分类器经常在测试时对于体育或医学类文本表现不佳。在这项工作中,我们employs empirical quantification方法,通过大量文本Corpus和多个话题来证明这种现象。然后,我们验证了这种现象,并发现这种领域传输仍然是classic PLMs,如BERT,以及现代大型模型,如GPT-3的挑战。此外,我们还提出了一种可能的解决方案:在训练集中添加话题控制的 sintactic texts,F1得分提高了50%以上,达到了在线上训练结果的水平,而其他话题则表现不明显。我们的实验结果主要关注文本类别分类,但我们的方法可以应用于其他类别分类任务,如性别、作者性和情感分类。代码和数据可以在https://github.com/dminus1/genre上下载。”Note that Simplified Chinese is used in this translation, as it is the more widely used standard for Chinese writing.

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

  • paper_url: http://arxiv.org/abs/2311.15954
  • repo_url: https://github.com/stellali7/ssl_psr
  • paper_authors: Shuyue Stella Li, Beining Xu, Xiangyu Zhang, Hexin Liu, Wenhan Chao, Leibny Paola Garcia
  • for: 本研究探讨英语自动学习(SSL)模型在跨语言场景中提取的特征特性,并提出了一个新的指标来预测特征表示质量。
  • methods: 使用自动语音识别(ASR)作为下游任务,分析SSL模型的模型大小、训练目标和模型结构对其作为特征提取器的性能的影响。
  • results: 研究发现,使用另类损失函数的contrastive loss可以促进跨语言特征提取的更有效性。PSR指标与ASR性能之间存在正相关关系,这表明由单语言SSL模型提取的音频信息可以在跨语言设置下用于下游任务。
    Abstract In this work, we study the features extracted by English self-supervised learning (SSL) models in cross-lingual contexts and propose a new metric to predict the quality of feature representations. Using automatic speech recognition (ASR) as a downstream task, we analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor for a set of topologically diverse corpora. We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations using deep generalized canonical correlation analysis. Results show the contrastive loss in the wav2vec2.0 objective facilitates more effective cross-lingual feature extraction. There is a positive correlation between PSR scores and ASR performance, suggesting that phonetic information extracted by monolingual SSL models can be used for downstream tasks in cross-lingual settings. The proposed metric is an effective indicator of the quality of the representations and can be useful for model selection.
    摘要 在这项研究中,我们研究了英语自动学习(SSL)模型在跨语言上下文中提取的特征特性,并提出了一个新的度量来预测特征表示质量。使用自动语音识别(ASR)作为下游任务,我们分析了模型大小、训练目标和模型结构对模型作为特征提取器的表现的影响。我们开发了一种新的度量,即声音-语法比率(PSR),使用深度泛化相关分析来衡量提取的表示中的声音和 sintactic信息。结果显示了 contrastive loss 在 wav2vec2.0 目标中使得跨语言特征提取更加有效。我们发现 PSR 分数和 ASR 性能之间存在正相关关系,这表明了训练英语 SSL 模型可以在跨语言设置下提取有用的声音信息。我们提出的度量可以用于模型选择和评估特征表示质量。

Leveraging deep active learning to identify low-resource mobility functioning information in public clinical notes

  • paper_url: http://arxiv.org/abs/2311.15946
  • repo_url: None
  • paper_authors: Tuan-Dung Le, Zhuqi Miao, Samuel Alvarado, Brittany Smith, William Paiva, Thanh Thieu
  • for: 这个研究旨在提高临床自然语言处理中Functioning信息的自动抽取和分析,以便更好地评估患者的整体健康。
  • methods: 该研究使用National NLP Clinical Challenges(n2c2)研究数据集,通过关键词扩展construct了候选句子pool,并使用query-by-committee sampling weighted by density representativeness来选择有用的句子进行人工标注。然后,使用BERT和CRF模型进行训练,并使用这些模型的预测来导向选择新的句子进行后续的标注迭代。
  • results: 该研究得到了4,265个句子,共计11,784个实体,包括5,511个Action实体、5,328个Mobility实体、306个Assistance实体和639个Quantification实体。Inter-annotator agreement(IAA)的平均值为0.72 для准确匹配和0.91 для偏 aligned匹配。此外,该研究还训练了常见的BERT模型和State-of-the-art Nested NER模型,其中最高的F1分数分别为0.84、0.7、0.62、0.71。Empirical results表明NER模型在临床文本中可以高精度地提取 mobilility functioning信息。该公共可用的注释 dataset将促进进一步的研究,以全面捕捉EHR中的Functioning信息。
    Abstract Function is increasingly recognized as an important indicator of whole-person health, although it receives little attention in clinical natural language processing research. We introduce the first public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF), aiming to facilitate automatic extraction and analysis of functioning information from free-text clinical notes. We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion. Our active learning approach, using query-by-committee sampling weighted by density representativeness, selects informative sentences for human annotation. We train BERT and CRF models, and use predictions from these models to guide the selection of new sentences for subsequent annotation iterations. Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities. The inter-annotator agreement (IAA), averaged over all entity types, is 0.72 for exact matching and 0.91 for partial matching. We also train and evaluate common BERT models and state-of-the-art Nested NER models. The best F1 scores are 0.84 for Action, 0.7 for Mobility, 0.62 for Assistance, and 0.71 for Quantification. Empirical results demonstrate promising potential of NER models to accurately extract mobility functioning information from clinical text. The public availability of our annotated dataset will facilitate further research to comprehensively capture functioning information in electronic health records (EHRs).
    摘要 “功能”在全人健康中日益被认可为重要指标,但在临床自然语言处理研究中却受到了少量关注。我们介绍了首个公共注释化数据集,专门针对国际健康功能障碍分类法(ICF)的 mobilty域。通过使用国家NPCC(n2c2)研究数据集,我们使用关键词扩展技术construct了候选句子pool。我们采用了活动学习方法,使用query-by-committee抽象 sampling weighted by density representativeness,选择了有用的句子 для人工注释。我们使用BERT和CRF模型,并使用这些模型的预测来导向选择新的句子 для后续注释迭代。我们的最终数据集包括4,265个句子,共计11,784个实体,包括5,511个Action实体,5,328个 mobilty实体,306个Assistance实体,和639个Quantification实体。inter-annotator agreement(IAA),对所有实体类型平均,为0.72 exact matching和0.91 partial matching。我们还训练和评估了常见的BERT模型和状态之ERT模型。最佳F1分数为0.84 дляAction,0.7 для mobilty,0.62 дляAssistance,和0.71 дляQuantification。实验结果表明NER模型在临床文本中可以准确提取 mobilty功能信息。我们公开发布了我们注释化数据集,以便进一步研究在电子健康纪录(EHRs)中全面捕捉功能信息。

Tell2Design: A Dataset for Language-Guided Floor Plan Generation

  • paper_url: http://arxiv.org/abs/2311.15941
  • repo_url: https://github.com/lengsicong/tell2design
  • paper_authors: Sicong Leng, Yang Zhou, Mohammed Haroon Dupty, Wee Sun Lee, Sam Conrad Joyce, Wei Lu
  • for: 这篇论文主要是为了研究如何使用自然语言描述生成建筑设计。
  • methods: 这篇论文使用了语言条件生成模型,并提出了一种新的序列到序列模型来解决这个问题。
  • results: 论文提出了一个大量的loor plan设计数据集(\textit{Tell2Design}),并对这些数据进行了人工评估和分析。
    Abstract We consider the task of generating designs directly from natural language descriptions, and consider floor plan generation as the initial research area. Language conditional generative models have recently been very successful in generating high-quality artistic images. However, designs must satisfy different constraints that are not present in generating artistic images, particularly spatial and relational constraints. We make multiple contributions to initiate research on this task. First, we introduce a novel dataset, \textit{Tell2Design} (T2D), which contains more than $80k$ floor plan designs associated with natural language instructions. Second, we propose a Sequence-to-Sequence model that can serve as a strong baseline for future research. Third, we benchmark this task with several text-conditional image generation models. We conclude by conducting human evaluations on the generated samples and providing an analysis of human performance. We hope our contributions will propel the research on language-guided design generation forward.
    摘要 我们考虑直接从自然语言描述生成设计的任务,初始研究领域是室内设计生成。语言决定性生成模型最近几年非常成功地生成高质量的艺术图像。然而,设计需满足不同的约束,与生成艺术图像不同,特别是空间和关系约束。我们在本研究中做了多个贡献,以INITIATE研究这个任务。首先,我们介绍了一个新的数据集,即 Tell2Design(T2D),该数据集包含超过80,000个室内设计,与自然语言说明相关。其次,我们提出了一种序列-到-序列模型,可作为未来研究的强大基线。最后,我们对这个任务进行了多种文本决定图像生成模型的比较。我们结束时进行了人类评估生成样本,并提供了人类表现分析。我们希望我们的贡献能推动语言引导设计生成的研究进步。

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

  • paper_url: http://arxiv.org/abs/2311.16483
  • repo_url: None
  • paper_authors: Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, Hanwang Zhang
  • for: This paper aims to improve the ability of multi-modal language models to understand and interpret chart figures by creating a high-quality instruction-tuning dataset and training a specialized model, ChartLlama.
  • methods: The authors use a multi-step data generation process to create diverse, high-quality instruction-tuning data, and train ChartLlama using this dataset.
  • results: ChartLlama outperforms all prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation benchmarks, and significantly improves upon the baseline in a specially compiled chart dataset that includes new chart and task types.Here’s the same information in Simplified Chinese:
  • for: 这篇论文目标是提高多模态语言模型对图表figure的理解和解释能力,通过创建高质量的指令调整数据集和训练特定的模型 ChartLlama。
  • methods: 作者使用多步数据生成过程来生成多样化、高质量的指令调整数据集,并使用这个数据集来训练 ChartLlama。
  • results: ChartLlama在 ChartQA、Chart-to-text 和 Chart-extraction 评估比赛中全面性地超过了所有先前的方法,并在特定的图表集中显著超越了基准值。
    Abstract Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to interpreting chart figures. This is mainly due to the lack of relevant multi-modal instruction tuning datasets. In this article, we create a high-quality instruction-tuning dataset leveraging GPT-4. We develop a multi-step data generation process in which different steps are responsible for generating tabular data, creating chart figures, and designing instruction tuning data separately. Our method's flexibility enables us to generate diverse, high-quality instruction-tuning data consistently and efficiently while maintaining a low resource expenditure. Additionally, it allows us to incorporate a wider variety of chart and task types not yet featured in existing datasets. Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset. ChartLlama outperforms all prior methods in ChartQA, Chart-to-text, and Chart-extraction evaluation benchmarks. Additionally, ChartLlama significantly improves upon the baseline in our specially compiled chart dataset, which includes new chart and task types. The results of ChartLlama confirm the value and huge potential of our proposed data generation method in enhancing chart comprehension.
    摘要 文本翻译为简化中文。多Modal大语言模型在视力语言任务中表现出色,但模型对特定领域数据的理解能力受限,尤其是在解读图表数据方面。这主要是因为缺乏相关多Modal指令调整数据集。在本文中,我们创建了高质量指令调整数据集,利用GPT-4。我们开发了多步数据生成过程,每步负责生成表格数据、创建图表数据和设计指令调整数据。我们的方法具有较高的灵活性和效率,能够生成多样化、高质量指令调整数据,同时减少资源投入。此外,它允许我们包括现有数据集中尚未出现的更多图表和任务类型。接下来,我们介绍ChartLlama,我们在我们创建的数据集上训练的多Modal大语言模型。ChartLlama在ChartQA、Chart-to-text和Chart-extraction评估标准准metric中超过了所有前方法。此外,ChartLlama在我们专门编译的图表集中也显著超越了基准值。结果证明我们提议的数据生成方法在提高图表理解方面具有巨大的潜力。

Data Generation for Post-OCR correction of Cyrillic handwriting

  • paper_url: http://arxiv.org/abs/2311.15896
  • repo_url: https://github.com/dbrainio/cyrillichandwritingpoc
  • paper_authors: Evgenii Davydkin, Aleksandr Markelov, Egor Iuldashev, Anton Dudkin, Ivan Krivorotov
    for:This paper addresses the lack of large text corpora for training language-based POC models for handwritten Cyrillic text.methods:The study uses a synthetic handwriting generation engine based on B'ezier curves to generate highly realistic handwritten text in any amounts, and applies a Handwritten Text Recognition (HTR) model to identify OCR errors.results:The approach is evaluated on HWR200 and School_notebooks_RU datasets, and the results show that the POC model can correct OCR errors with high accuracy, as measured by Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR).Here’s the Chinese version:for:这篇论文弥补了当前手写字符识别器(POC)模型训练方法中的一个重要的空白,即手写字符识别器(HTR)模型训练所需的大量文本 corpus。methods:这个研究使用基于B'ezier曲线的人工手写生成引擎,生成了高度真实的手写文本,并将其应用于互联网上获取的俄语文本 corpus上。然后,通过将这些文本 corpus传递 через一个已经预训练的T5架构,来训练一个特殊的POC模型。results:这种方法在HWR200和School_notebooks_RU数据集上进行评估,并显示了高度的Word Accuracy Rate(WAR)和Character Accuracy Rate(CAR)。
    Abstract This paper introduces a novel approach to post-Optical Character Recognition Correction (POC) for handwritten Cyrillic text, addressing a significant gap in current research methodologies. This gap is due to the lack of large text corporas that provide OCR errors for further training of language-based POC models, which are demanding in terms of corpora size. Our study primarily focuses on the development and application of a synthetic handwriting generation engine based on B\'ezier curves. Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset by transforming Russian text corpora sourced from the internet. We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training. The correction model is trained on a 90-symbol input context, utilizing a pre-trained T5 architecture with a seq2seq correction task. We evaluate our approach on HWR200 and School_notebooks_RU datasets as they provide significant challenges in the HTR domain. Furthermore, POC can be used to highlight errors for teachers, evaluating student performance. This can be done simply by comparing sentences before and after correction, displaying differences in text. Our primary contribution lies in the innovative use of B\'ezier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corporas of handwritten Cyrillic text. These results, coupled with our methodology, are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. Paper contributions can be found in https://github.com/dbrainio/CyrillicHandwritingPOC
    摘要 Our approach uses a pre-trained T5 architecture with a seq2seq correction task to correct OCR errors. We evaluate our method on HWR200 and School_notebooks_RU datasets, which provide significant challenges in the HTR domain. Additionally, POC can be used to highlight errors for teachers, evaluating student performance by comparing sentences before and after correction and displaying differences in text.Our primary contribution lies in the innovative use of B\'ezier curves for Cyrillic text generation and subsequent error correction using a specialized POC model. We validate our approach by presenting Word Accuracy Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without post-OCR correction, using real open corpora of handwritten Cyrillic text. Our methodology and results are designed to be reproducible, paving the way for further advancements in the field of OCR and handwritten text analysis. For more information, please refer to our GitHub repository at .

Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges

  • paper_url: http://arxiv.org/abs/2311.15766
  • repo_url: None
  • paper_authors: Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, Weiqiang Zhang
  • for: 这篇论文主要针对大语言模型(LLM)的知识排除问题进行研究,以寻求解决 LLM 在应用时可能存在的危险知识问题。
  • methods: 这篇论文主要介绍了三种知识排除方法,即基于参数优化、基于参数合并和在Context中学习。这些方法可以帮助解除 LLM 中的危险知识,而不会影响其他知识。
  • results: 论文提供了一个审查知识排除在 LLM 时的问题,并分类了现有的知识排除方法。此外,论文还介绍了现有的评价数据集,并结束了这篇论文的总结和未来方向。
    Abstract In recent years, large language models (LLMs) have spurred a new research paradigm in natural language processing. Despite their excellent capability in knowledge-based question answering and reasoning, their potential to retain faulty or even harmful knowledge poses risks of malicious application. The challenge of mitigating this issue and transforming these models into purer assistants is crucial for their widespread applicability. Unfortunately, Retraining LLMs repeatedly to eliminate undesirable knowledge is impractical due to their immense parameters. Knowledge unlearning, derived from analogous studies on machine unlearning, presents a promising avenue to address this concern and is notably advantageous in the context of LLMs. It allows for the removal of harmful knowledge in an efficient manner, without affecting unrelated knowledge in the model. To this end, we provide a survey of knowledge unlearning in the era of LLMs. Firstly, we formally define the knowledge unlearning problem and distinguish it from related works. Subsequently, we categorize existing knowledge unlearning methods into three classes: those based on parameter optimization, parameter merging, and in-context learning, and introduce details of these unlearning methods. We further present evaluation datasets used in existing methods, and finally conclude this survey by presenting the ongoing challenges and future directions.
    摘要 In this survey, we explore the use of knowledge unlearning in the era of LLMs. We define the knowledge unlearning problem and distinguish it from related works. We categorize existing knowledge unlearning methods into three classes: parameter optimization, parameter merging, and in-context learning. We provide details of these methods and the evaluation datasets used in existing research. Finally, we conclude by highlighting the ongoing challenges and future directions in this field.Simplified Chinese:近年来,大型自然语言处理模型(LLMs)已经引发了一新的研究方向。尽管它们在知识基础问答和推理方面表现出色,但它们可能会保留危险或有害的知识,这会导致不良应用。避免这种问题并将这些模型转变为更可靠的助手是核心问题。然而,不断重新训练LLMs以消除不良知识是不实际的,因为它们的参数数量太大。在这篇文章中,我们将explore LLMs era中的知识忘记技术。我们将知识忘记问题进行正式定义,并与相关工作进行区分。我们将现有的知识忘记方法分为三类:参数优化、参数合并和Context学习。我们将提供这些方法的详细介绍,以及已经使用的评估数据集。最后,我们将结束这篇文章,并 highlighting持续的挑战和未来方向。

  • paper_url: http://arxiv.org/abs/2311.15716
  • repo_url: None
  • paper_authors: Sabine Wehnert
  • for: 本研究探讨如何应用大语言模型在法律领域中,并超越它们当前的缺点。尽管它们在成功和普遍acceptance中获得了很大的成就,但它们的不可解性使法律专家们对它们的输出不能置信,这是合理的。
  • methods: 本研究提出了一种新的视角——公正人工智能,而不是强调可解人工智能。我们在本研究中讨论如何通过证据来支持或反对大语言模型的输出,以使其生成的文本更可靠。
  • results: 本研究的结果表明,通过证据来支持或反对大语言模型的输出可以使其生成的文本更加可靠,并且可以帮助法律专家们更好地理解和信任这些模型的输出。
    Abstract In this work, I discuss how Large Language Models can be applied in the legal domain, circumventing their current drawbacks. Despite their large success and acceptance, their lack of explainability hinders legal experts to trust in their output, and this happens rightfully so. However, in this paper, I argue in favor of a new view, Justifiable Artificial Intelligence, instead of focusing on Explainable Artificial Intelligence. I discuss in this paper how gaining evidence for and against a Large Language Model's output may make their generated texts more trustworthy - or hold them accountable for misinformation.
    摘要 在这项工作中,我讨论了大自然语言模型在法律领域的应用,并 circumventing their current drawbacks。虽然它们在成功和普遍acceptance中具有很大的成就,但它们的lack of explainability使法律专家无法信任它们的输出,这是合理的。然而,在这篇论文中,我主张一种新的视角:可辩解人工智能,而不是专注于可辩解人工智能。我在这篇论文中讨论了如何为大自然语言模型生成文本提供证据,以使其更加可靠 - 或者负责它们的误导。

MoDS: Model-oriented Data Selection for Instruction Tuning

  • paper_url: http://arxiv.org/abs/2311.15653
  • repo_url: https://github.com/casia-lm/mods
  • paper_authors: Qianlong Du, Chengqing Zong, Jiajun Zhang
    for:The paper aims to address the problem of selecting appropriate instruction data for fine-tuning large language models (LLMs) to improve their ability to follow user instructions.methods:The authors propose a model-oriented data selection (MoDS) approach, which utilizes a quality evaluation model, coverage-based algorithm, and necessity evaluation model to select a small, high-quality, and broad-coverage subset of instruction data from the original dataset.results:The authors experimentally show that the model fine-tuned with 4,000 instruction pairs selected by their approach performs better than the model fine-tuned with the full original dataset, which includes 214k instruction data.
    Abstract Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction data for a given LLM is still an open problem. To address this problem, in this paper we present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity. First, our approach utilizes a quality evaluation model to filter out the high-quality subset from the original instruction dataset, and then designs an algorithm to further select from the high-quality subset a seed instruction dataset with good coverage. The seed dataset is applied to fine-tune the foundation LLM to obtain an initial instruction-following LLM. Finally, we develop a necessity evaluation model to find out the instruction data which are performed badly in the initial instruction-following LLM and consider them necessary instructions to further improve the LLMs. In this way, we can get a small high-quality, broad-coverage and high-necessity subset from the original instruction datasets. Experimental results show that, the model fine-tuned with 4,000 instruction pairs selected by our approach could perform better than the model fine-tuned with the full original dataset which includes 214k instruction data.
    摘要 现在,训练指令(Instruction Tuning)已成为大型自然语言模型(LLM)具备跟进用户指令能力的准确方法。通常需要数百万或更多的指令跟进对来 finetune基础 LLM。然而,如何选择适合给定 LLM 的指令数据仍然是一个开放问题。为解决这个问题,在这篇论文中,我们提出了一种基于模型的数据选择(MoDS)方法,该方法根据三个方面选择指令数据:质量、覆盖率和必需性。首先,我们的方法使用质量评估模型筛选出基础数据集中高质量的子集,然后设计一个算法选择这个高质量子集中的种子指令数据集,以便在这个种子数据集上练习基础 LLM,以获得初步的指令遵从 LLM。然后,我们开发了必需性评估模型,用于在初步的指令遵从 LLM 中查找不好的指令,并考虑它们是否需要进一步改进 LLM。这样,我们就可以从原始指令数据集中获得一个小型、高质量、广泛覆盖率的子集。实验结果表明,使用我们的方法选择的4000个指令对可以使得模型在指令遵从任务中表现更好,比使用全部原始数据集,包括214k个指令数据。

InfoPattern: Unveiling Information Propagation Patterns in Social Media

  • paper_url: http://arxiv.org/abs/2311.15642
  • repo_url: None
  • paper_authors: Chi Han, Jialiang Xu, Manling Li, Hanning Zhang, Tarek Abdelzaher, Heng Ji
  • for: 本研究旨在探讨社交媒体如何影响公众意识形态,以及语言与人类思想之间的交互关系。
  • methods: 该研究使用了红队演练来模拟敌对意见社区的反应,以及姿态检测来发现每条消息的政治情感。同时,研究还使用信息宣传图表发现各个社区之间的信息传播关系。
  • results: 研究发现,社交媒体可以帮助人们建立和维护对opposite ideology社区的情感连接,并且可以帮助人们更好地理解和了解对手的思想。同时,研究还发现了一些情感和信息传播的 Pattern ,可以帮助人们更好地理解社交媒体上的情感和信息传播。
    Abstract Social media play a significant role in shaping public opinion and influencing ideological communities through information propagation. Our demo InfoPattern centers on the interplay between language and human ideology. The demo (Code: https://github.com/blender-nlp/InfoPattern ) is capable of: (1) red teaming to simulate adversary responses from opposite ideology communities; (2) stance detection to identify the underlying political sentiments in each message; (3) information propagation graph discovery to reveal the evolution of claims across various communities over time. (Live Demo: https://incas.csl.illinois.edu/blender/About )
    摘要 社交媒体对公众意见的形成和思想共享团队产生了重要的影响,通过信息宣传的方式进行信息传递。我们的 demo InfoPattern 关注语言和人类意识的交互关系。我们的 demo(代码:https://github.com/blender-nlp/InfoPattern)可以:(1)模拟反对方的回应,从不同意识 comunities 中获得敌对响应;(2)短语检测,从每封消息中检测出下面政治情感;(3)信息传播图示,揭示时间和不同社区之间的声明的演化。(实时Demo:https://incas.csl.illinois.edu/blender/About)

The WebCrow French Crossword Solver

  • paper_url: http://arxiv.org/abs/2311.15626
  • repo_url: None
  • paper_authors: Giovanni Angelini, Marco Ernandes, Tommaso laquinta, Caroline Stehlé, Fanny Simões, Kamyar Zeinalipour, Andrea Zugarini, Marco Gori
  • for: 这篇论文是为了开发一个自动解题十字头游戏程式(WebCrow 2.0),并将其扩展到法语。
  • methods: 这篇论文使用多个模组(experts),从不同的资源中搜寻候选答案,包括网页、知识库和语言规则。
  • results: 这篇论文证明了WebCrow 2.0可以对新语言进行扩展,并在两个挑战中比人类更快和更准确地解题十字头游戏。
    Abstract Crossword puzzles are one of the most popular word games, played in different languages all across the world, where riddle style can vary significantly from one country to another. Automated crossword resolution is challenging, and typical solvers rely on large databases of previously solved crosswords. In this work, we extend WebCrow 2.0, an automatic crossword solver, to French, making it the first program for crossword solving in the French language. To cope with the lack of a large repository of clue-answer crossword data, WebCrow 2.0 exploits multiple modules, called experts, that retrieve candidate answers from heterogeneous resources, such as the web, knowledge graphs, and linguistic rules. We compared WebCrow's performance against humans in two different challenges. Despite the limited amount of past crosswords, French WebCrow was competitive, actually outperforming humans in terms of speed and accuracy, thus proving its capabilities to generalize to new languages.
    摘要 Translation in Simplified Chinese:批踢词逻辑游戏是全球各地最受欢迎的单词游戏之一,其难度很大,通常解题者需要依靠大量的已解过的十字word数据库。在这项工作中,我们将WebCrow 2.0自动十字解题器扩展到法语,这是首个用于法语十字解题的软件。为了应对缺乏大量的套路-答案十字数据,WebCrow 2.0使用多个模块,称为专家,从不同的资源,如网络、知识图谱和语言规则中提取候选答案。我们在两个不同的挑战中比较了WebCrow的表现和人类的表现。尽管我们手动解题的经验不多,但法语WebCrow仍然能够准确地解题,甚至在速度和准确率方面超过人类,这说明它可以扩展到新语言。

FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models

  • paper_url: http://arxiv.org/abs/2311.15614
  • repo_url: None
  • paper_authors: Ruixuan Xiao, Yiwen Dong, Junbo Zhao, Runze Wu, Minmin Lin, Gang Chen, Haobo Wang
  • for: 提高Zero-shot表现,不需要人工监督
  • methods: 提出了一种协同学习框架FreeAL,LLM作为活动标注器,帮助学习task-specific知识,而下游SLM作为学生, Filter出高质量的in-context样本反馈LLM进行标签精度提高
  • results: 实验表明,FreeAL可以大幅提高zero-shot表现,无需人工监督。
    Abstract Collecting high-quality labeled data for model training is notoriously time-consuming and labor-intensive for various NLP tasks. While copious solutions, such as active learning for small language models (SLMs) and prevalent in-context learning in the era of large language models (LLMs), have been proposed and alleviate the labeling burden to some extent, their performances are still subject to human intervention. It is still underexplored how to reduce the annotation cost in the LLMs era. To bridge this, we revolutionize traditional active learning and propose an innovative collaborative learning framework FreeAL to interactively distill and filter the task-specific knowledge from LLMs. During collaborative training, an LLM serves as an active annotator inculcating its coarse-grained knowledge, while a downstream SLM is incurred as a student to filter out high-quality in-context samples to feedback LLM for the subsequent label refinery. Extensive experiments on eight benchmark datasets demonstrate that FreeAL largely enhances the zero-shot performances for both SLM and LLM without any human supervision. The code is available at https://github.com/Justherozen/FreeAL .
    摘要 collecting high-quality labeled data for model training is notoriously time-consuming and labor-intensive for various NLP tasks. While many solutions, such as active learning for small language models (SLMs) and prevalent in-context learning in the era of large language models (LLMs), have been proposed and alleviate the labeling burden to some extent, their performances are still subject to human intervention. It is still underexplored how to reduce the annotation cost in the LLMs era. To bridge this, we revolutionize traditional active learning and propose an innovative collaborative learning framework FreeAL to interactively distill and filter the task-specific knowledge from LLMs. During collaborative training, an LLM serves as an active annotator inculcating its coarse-grained knowledge, while a downstream SLM is incurred as a student to filter out high-quality in-context samples to feedback LLM for the subsequent label refinery. Extensive experiments on eight benchmark datasets demonstrate that FreeAL largely enhances the zero-shot performances for both SLM and LLM without any human supervision. The code is available at https://github.com/Justherozen/FreeAL.Here's the translation in Traditional Chinese:收集高品质标签数据 для模型训练是不可避免的时间负担和劳动成本,尤其是 для多种NLP任务。许多解决方案,如活动学习 для小语言模型(SLMs)和广泛的内容学习在大语言模型(LLMs) era,已经被提出来和减轻标签负担,但它们的表现仍然受到人类干预。在 LLMs era 中,还是未为人所考虑如何降低标签成本。为了缓解这个问题,我们革新了传统的活动学习,并提出了一个创新的协同学习框架 FreeAL,可以互动地传授和筛选任务特定的知识。在协同训练中,一个 LLM 作为活动标签者,传递它的粗糙知识,而一个下游的 SLM 则是作为学生,将高品质的内容标签反馈给 LLM 进行后续的标签精炼。实验结果显示,FreeAL 可以大幅提高零shot表现,不需要任何人工指导。代码可以在 获取。

Can Vision-Language Models Think from a First-Person Perspective?

  • paper_url: http://arxiv.org/abs/2311.15596
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu
  • for: 评估视Language模型(VLM)在传统下游任务中的表现,以及其在自我 Перспективе下的能力。
  • methods: 使用选择的 Egocentric 视频clip,并 manually 标注问题答案对象,以construct一个包含六大能力的视觉问答 bencmark。
  • results: 评估 eighteen 种Popular VLMs 在 EgoThink 上的表现,结果表明 although GPT-4V 在多个维度处于领先地位,但所有评估 VLMs 仍具有较大的提升空间在自我 Перспективе任务中。
    Abstract Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.
    摘要 现代计算机视觉语言模型(VLM)在传统下渠任务中显示出了有前途的成绩。评估研究出现以评估它们的能力,主要集中在第三人称视角,只有一些关注特定任务的第一人称视角。然而,VLM的“自我思考”能力,是让自主代理人和机器人得到进一步发展的关键属性,尚未得到广泛探讨。为了填补这一研究漏洞,我们介绍了一个新的视觉问答测试 benchmark,称为EgoThink。EgoThink包括六个核心能力和十二个细节,通过选择 egocentric 视频中的选段,并手动标注问题答案包含第一人称信息。为了全面评估 VLM,我们评估了 eighteen 种流行的 VLM。此外,由于问答形式是开放式的,我们使用 GPT-4 作为自动评分器,计算单答度评估。实验结果表明,尽管 GPT-4V 在多个维度领先,但所有评估 VLM 仍然具有大量提高的潜力在第一人称视角任务中。同时,增加可训练参数的数量对 EgoThink 的模型性能产生了最大的影响。总之,EgoThink 作为一个值得一提的评估标准,为未来人工智能和机器人领域的研究提供了不可或缺的资源。

SpotServe: Serving Generative Large Language Models on Preemptible Instances

  • paper_url: http://arxiv.org/abs/2311.15566
  • repo_url: https://github.com/hsword/spotserve
  • paper_authors: Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, Zhihao Jia
  • for: 降低大型语言模型(LLM)的服务成本,使其更加便宜。
  • methods: 利用现代云端的预先可用的GPU实例,并对预先可用实例进行动态重新分配,以减少LLM服务成本。
  • results: 比较以往的LLM服务系统,SpotServe可以降低P99调用延误的数据量,并且可以优化运算成本。在实际的预先可用实例追踪和各种受测LLM中,SpotServe可以节省54%的财务成本。
    Abstract The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them cheaply. This paper aims to reduce the monetary cost for serving LLMs by leveraging preemptible GPU instances on modern clouds, which offer accesses to spare GPUs at a much cheaper price than regular instances but may be preempted by the cloud at any time. Serving LLMs on preemptible instances requires addressing challenges induced by frequent instance preemptions and the necessity of migrating instances to handle these preemptions. This paper presents SpotServe, the first distributed LLM serving system on preemptible instances. Several key techniques in SpotServe realize fast and reliable serving of generative LLMs on cheap preemptible instances. First, SpotServe dynamically adapts the LLM parallelization configuration for dynamic instance availability and fluctuating workload, while balancing the trade-off among the overall throughput, inference latency and monetary costs. Second, to minimize the cost of migrating instances for dynamic reparallelization, the task of migrating instances is formulated as a bipartite graph matching problem, which uses the Kuhn-Munkres algorithm to identify an optimal migration plan that minimizes communications. Finally, to take advantage of the grace period offered by modern clouds, we introduce stateful inference recovery, a new inference mechanism that commits inference progress at a much finer granularity and allows SpotServe to cheaply resume inference upon preemption. We evaluate on real spot instance preemption traces and various popular LLMs and show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
    摘要 高计算和内存需求的生成大语言模型(LLM)使得提供它们的成本很高。这篇论文想要降低提供LLM的财务成本,通过利用现代云备用的GPU实例,这些实例可以在便宜的价格上提供可用的GPU,但可能会在任何时候被云服务器终止。为了在备用实例上提供LLM,需要解决由频繁的实例终止和需要迁移实例而引起的挑战。这篇论文提出了SpotServe,首个在备用实例上分布式执行大语言模型的服务系统。SpotServe使用多种关键技术来实现高效且可靠地在便宜的备用实例上提供生成LLM。首先,SpotServe在动态可用性和工作负荷变化时动态调整LLM并行化配置,并平衡总通过put,推理延迟和经济成本之间的负担。其次,为了最小化迁移实例的成本,我们将实例迁移任务形式化为一个二分图匹配问题,使用库恩-穆尼克斯算法确定最优迁移计划,以最小化通信成本。最后,我们引入了状态归还推理,一种新的推理机制,可以在现代云服务器上较低成本地续传推理进程。我们使用实际的备用实例预测轨迹和各种流行的LLM进行评估,并显示SpotServe可以比最佳现有的LLM服务系统减少P99的末端延迟时间2.4-9.1倍。此外,我们还显示SpotServe可以利用现代云服务器的价格优势,相比Only使用On-Demand实例,可以节省54%的经济成本。

Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval

  • paper_url: http://arxiv.org/abs/2311.15564
  • repo_url: https://github.com/fantabulous-j/bootswitch
  • paper_authors: Fan Jiang, Qiongkai Xu, Tom Drummond, Trevor Cohn
  • for: 提高在零shot Setting中的段落检索性能
  • methods: 使用一个简单 yet effective的无监督方法,通过 dense retriever 从 supervision signal 提供者 reranker 中获得反馈,然后将 reranker 更新基于这些反馈。
  • results: 实验结果表明,我们的无监督 $\texttt{ABEL}$ 模型在 BEIR benchmark 上表现出色,并且在不同的任务和领域中具有强大的适应能力。通过 either fine-tuning $\texttt{ABEL}$ 在标注数据上或将其与现有的监督 dense retriever 集成,我们实现了领先的result。
    Abstract Neural 'dense' retrieval models are state of the art for many datasets, however these models often exhibit limited domain transfer ability. Existing approaches to adaptation are unwieldy, such as requiring explicit supervision, complex model architectures, or massive external models. We present $\texttt{ABEL}$, a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings. Our technique follows a straightforward loop: a dense retriever learns from supervision signals provided by a reranker, and subsequently, the reranker is updated based on feedback from the improved retriever. By iterating this loop, the two components mutually enhance one another's performance. Experimental results demonstrate that our unsupervised $\texttt{ABEL}$ model outperforms both leading supervised and unsupervised retrievers on the BEIR benchmark. Meanwhile, it exhibits strong adaptation abilities to tasks and domains that were unseen during training. By either fine-tuning $\texttt{ABEL}$ on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/BootSwitch}.}
    摘要 神经网络"dense"检索模型目前是许多数据集的状态码,然而这些模型经常表现有限的领域传递能力。现有的适应方法是复杂的,例如需要显式监督、复杂的模型架构或庞大的外部模型。我们提出了 $\texttt{ABEL}$,一种简单 yet effective的无监督方法,用于在零引入设置下提高过程检索性能。我们的技术遵循一个简单的循环:一个紧凑检索器从监督信号提供者(reranker)接受超vision信号,然后reranker根据改进的检索器提供的反馈进行更新。通过这个循环,两个组件互相提高对方的性能。实验结果表明,我们的无监督 $\texttt{ABEL}$ 模型在 BEIR 标准测试集上超过了当前领先的监督和无监督检索器。同时,它在训练不包括任务和领域的情况下也表现出强大的适应能力。通过 either fine-tuning $\texttt{ABEL}$ на标注数据或将其与现有的监督紧凑检索器结合,我们实现了状态码的最佳结果。\footnote{源代码可以在 获取.}

Noisy Self-Training with Synthetic Queries for Dense Retrieval

  • paper_url: http://arxiv.org/abs/2311.15563
  • repo_url: https://github.com/fantabulous-j/self-training-dpr
  • paper_authors: Fan Jiang, Tom Drummond, Trevor Cohn
  • for: 提高 neural retrieval 模型的性能,不需要高质量的注意力数据。
  • methods: 提出了一种含瑕自然语言模型训练方法,结合 sintetic 查询,使用无需外部模型。
  • results: 实验结果表明,我们的方法在通用领域(如 MS-MARCO)和 OUT-OF-DOMAIN 领域(如 BEIR)的检索 bencmarks 中一直提高。额外分析表明,我们的方法在少量标注数据情况下具有数据效率,并在多个领域任务中具有更高的性能。I hope this helps! Let me know if you have any further questions.
    Abstract Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}
    摘要 尽管现有的神经Retrieval模型在训练数据充沛的情况下显示出了扎实的结果,但收集高质量的标注数据是非常成本高的。为此,我们提出了一种新的听风自适应框架,并与生成的查询结合使用,表明神经Retriever可以通过自身进化方式得到改进,无需依赖于任何外部模型。实验结果表明,我们的方法在普通领域(如MS-MARCO)和 OUT-OF-DOMAIN(如BEIR)的检索评价标准上一直提高。尝试分析低资源情况下表明,我们的方法具有数据效率,并在30%的标注训练数据的情况下超越了竞争对手。进一步扩展该框架用于重新排序训练,表明我们的方法是通用的,并在不同领域的任务上带来额外的改进。(Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.)

The effect of source disclosure on evaluation of AI-generated messages: A two-part study

  • paper_url: http://arxiv.org/abs/2311.15544
  • repo_url: None
  • paper_authors: Sue Lim, Ralf Schmälzle
  • for: 这 paper investigate 人们对 AI 生成的消息的评估和偏好是否受源透明度影响。
  • methods: 这 paper 使用了大型语言模型 (LLMs) 生成消息,并对受试者进行了源透明度的标注。
  • results: 研究发现,对于健康预防消息, source disclosure 对消息评估产生了轻度的偏好,但对消息选择没有显著影响。另外,对于具有 moderate 度对 AI 的负面态度的受试者,source disclosure 可能会降低 AI 生成消息的偏好。
    Abstract Advancements in artificial intelligence (AI) over the last decade demonstrate that machines can exhibit communicative behavior and influence how humans think, feel, and behave. In fact, the recent development of ChatGPT has shown that large language models (LLMs) can be leveraged to generate high-quality communication content at scale and across domains, suggesting that they will be increasingly used in practice. However, many questions remain about how knowing the source of the messages influences recipients' evaluation of and preference for AI-generated messages compared to human-generated messages. This paper investigated this topic in the context of vaping prevention messaging. In Study 1, which was pre-registered, we examined the influence of source disclosure on people's evaluation of AI-generated health prevention messages compared to human-generated messages. We found that source disclosure (i.e., labeling the source of a message as AI vs. human) significantly impacted the evaluation of the messages but did not significantly alter message rankings. In a follow-up study (Study 2), we examined how the influence of source disclosure may vary by the participants' negative attitudes towards AI. We found a significant moderating effect of negative attitudes towards AI on message evaluation, but not for message selection. However, for those with moderate levels of negative attitudes towards AI, source disclosure decreased the preference for AI-generated messages. Overall, the results of this series of studies showed a slight bias against AI-generated messages once the source was disclosed, adding to the emerging area of study that lies at the intersection of AI and communication.
    摘要 人工智能(AI)的进步在过去的一个 década 表明了机器可以表现出交流行为,并影响人类的思维、情感和行为。事实上,最近的 ChatGPT 的发展表明了大语言模型(LLMs)可以在各个领域和规模上生成高质量的交流内容,因此它们将在实践中得到更广泛的应用。然而,许多问题仍然存在,包括知道消息来源如何影响接收者对 AI 生成的消息的评价和偏好。这项研究在吸烟预防广播中investigated这个主题。在 Study 1 中,我们查看了消息来源披露对人们对 AI 生成的健康预防消息的评价的影响。我们发现消息来源披露(即标注消息来源为 AI vs. 人类)对消息评价产生了显著影响,但并不对消息排名产生了显著影响。在 Study 2 中,我们检查了消息来源披露对接收者的影响是否受到参与者对 AI 的负面态度的影响。我们发现对 AI 的负面态度对消息评价产生了显著 moderating 效果,但对消息选择不产生显著影响。然而,对于具有moderate 程度的对 AI 的负面态度的参与者,消息来源披露对 AI 生成消息的偏好产生了负面影响。总之,这些研究结果表明了知道消息来源后,对 AI 生成消息的偏好存在轻微偏好,添加到了人工智能和交流之间的emerging 领域。

Overview of the VLSP 2022 – Abmusu Shared Task: A Data Challenge for Vietnamese Abstractive Multi-document Summarization

  • paper_url: http://arxiv.org/abs/2311.15525
  • repo_url: None
  • paper_authors: Mai-Vu Tran, Hoang-Quynh Le, Duy-Cat Can, Quoc-An Nguyen
  • for: 本文提供了2022年越南语言和语音处理会议(VLSP 2022)年度会议报告,描述了越南新闻抽象多文摘要(Abmusu)共同任务的概述。
  • methods: 本文使用多篇新闻文章作为输入,通过自动生成摘要来实现文摘要任务。
  • results: 本文根据\texttt{ROUGE2-F1}分数进行评估和排名参与的模型。
    Abstract This paper reports the overview of the VLSP 2022 - Vietnamese abstractive multi-document summarization (Abmusu) shared task for Vietnamese News. This task is hosted at the 9$^{th}$ annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents on a topic. The model input is multiple news documents on the same topic, and the corresponding output is a related abstractive summary. In the scope of Abmusu shared task, we only focus on Vietnamese news summarization and build a human-annotated dataset of 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories. Participated models are evaluated and ranked in terms of \texttt{ROUGE2-F1} score, the most typical evaluation metric for document summarization problem.
    摘要 这篇文章介绍了VLSP 2022年 Vietnamese abstractive multi-document summarization(Abmusu)共同任务的概述,这是在9年一度的 Vietnamese Language and Speech Processing(VLSP)工作坊上进行的。Abmusu任务的目标是开发一种可以自动生成摘要的报道系统,其输入是一组与某个主题相关的新闻文档,输出是一个相关的摘要。在Abmusu任务的范围内,我们只关注越南语新闻摘要,并建立了1,839份文档的人工标注集,这些文档来自于8个类别的越南语新闻。参与的模型将被评估和排名基于\texttt{ROUGE2-F1}分数,这是文档摘要问题的最常用评估指标。

A Comparative and Experimental Study on Automatic Question Answering Systems and its Robustness against Word Jumbling

  • paper_url: http://arxiv.org/abs/2311.15513
  • repo_url: None
  • paper_authors: Shashidhar Reddy Javaji, Haoran Hu, Sai Sameer Vennam, Vijaya Gajanan Buddhavarapu
  • For: The paper aims to improve the performance of question answer generation models by addressing the issue of human error in the training data.* Methods: The authors use natural language processing techniques to analyze the data and identify the sources of human error, and then propose a method to mitigate the effects of these errors.* Results: The authors evaluate their method on several benchmark datasets and show that it can improve the accuracy of question answer generation models, leading to better performance in real-world applications.
    Abstract Question answer generation using Natural Language Processing models is ubiquitous in the world around us. It is used in many use cases such as the building of chat bots, suggestive prompts in google search and also as a way of navigating information in banking mobile applications etc. It is highly relevant because a frequently asked questions (FAQ) list can only have a finite amount of questions but a model which can perform question answer generation could be able to answer completely new questions that are within the scope of the data. This helps us to be able to answer new questions accurately as long as it is a relevant question. In commercial applications, it can be used to increase customer satisfaction and ease of usage. However a lot of data is generated by humans so it is susceptible to human error and this can adversely affect the model's performance and we are investigating this through our work
    摘要 问答生成使用自然语言处理模型在我们周围非常普遍。它在许多用例中使用,如建立聊天机器人、谷歌搜索提示和银行移动应用程序等。它非常有用,因为常见问题列表只能包含有限多个问题,但一个可以进行问题答案生成的模型可以回答总是新的、 dentro del ámbito del datos 的问题。这使得我们能够准确地回答新的问题,只要它们是有关的。在商业应用中,它可以提高客户满意度和使用的方便。然而,由人类生成的数据很容易受到人类错误的影响,这可能会降低模型的性能,我们正在研究这一点。

A Corpus for Named Entity Recognition in Chinese Novels with Multi-genres

  • paper_url: http://arxiv.org/abs/2311.15509
  • repo_url: None
  • paper_authors: Hanjie Zhao, Jinge Xie, Yuchen Yan, Yuxiang Jia, Yawen Ye, Hongying Zan
  • for: 这 paper 的目的是提高Literary Named Entity Recognition (NER)的研究进度,建立多种文学领域NER corpus。
  • methods: 这 paper 使用了多种基eline NER模型,对不同的文学领域进行了跨领域和跨领域实验。
  • results: 实验结果表明,文学领域NER的性能与新闻领域NER的性能有所不同,文学领域NER仍需要进一步改进,OOV问题更加困难由于文学作品中的高度多样化实体。
    Abstract Entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.
    摘要 entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.Here's the translation in Traditional Chinese: entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.

Function-constrained Program Synthesis

  • paper_url: http://arxiv.org/abs/2311.15500
  • repo_url: None
  • paper_authors: Patrick Hajali, Ignas Budvytis
  • for: 该 paper 的目的是提出一种方法,让大型语言模型(LLMs)可以利用用户提供的代码来解决编程任务,并且可以自动生成模块化子函数来帮助未来的代码生成尝试。
  • methods: 该 paper 使用的方法包括:(1) 使用用户提供的代码来生成代码,(2) 生成模块化子函数来帮助未来的代码生成尝试,(3) 使用 “half-shot” 评估方法来评估 LLMS 的编程能力。
  • results: 该 paper 的结果表明,使用该方法可以提高 LLMS 的编程能力,并且可以生成模块化子函数来帮助未来的代码生成尝试。同时,该方法还可以帮助建立一个库 OF reusable sub-functions,可以解决相关的编程任务。
    Abstract This work introduces (1) a technique that allows large language models (LLMs) to leverage user-provided code when solving programming tasks and (2) a method to iteratively generate modular sub-functions that can aid future code generation attempts when the initial code generated by the LLM is inadequate. Generating computer programs in general-purpose programming languages like Python poses a challenge for LLMs when instructed to use code provided in the prompt. Code-specific LLMs (e.g., GitHub Copilot, CodeLlama2) can generate code completions in real-time by drawing on all code available in a development environment. However, restricting code-specific LLMs to use only in-context code is not straightforward, as the model is not explicitly instructed to use the user-provided code and users cannot highlight precisely which snippets of code the model should incorporate into its context. Moreover, current systems lack effective recovery methods, forcing users to iteratively re-prompt the model with modified prompts until a sufficient solution is reached. Our method differs from traditional LLM-powered code-generation by constraining code-generation to an explicit function set and enabling recovery from failed attempts through automatically generated sub-functions. When the LLM cannot produce working code, we generate modular sub-functions to aid subsequent attempts at generating functional code. A by-product of our method is a library of reusable sub-functions that can solve related tasks, imitating a software team where efficiency scales with experience. We also introduce a new "half-shot" evaluation paradigm that provides tighter estimates of LLMs' coding abilities compared to traditional zero-shot evaluation. Our proposed evaluation method encourages models to output solutions in a structured format, decreasing syntax errors that can be mistaken for poor coding ability.
    摘要 这个研究引入了两种技术来使大型自然语言模型(LLM)在编程任务中利用用户提供的代码。首先,我们引入了一种方法,使得LLM可以在用户提供的代码基础上进行解决编程任务。其次,我们提出了一种方法,可以在初始代码生成失败时,通过自动生成的子函数来恢复。在LLM无法生成工作代码时,我们可以生成可重用的子函数,以帮助后续的代码生成尝试。这种方法与传统的LLM-力导航编程不同,因为它约束代码生成到显式函数集中,并且可以自动生成子函数来恢复失败的尝试。这种方法的一个侧效是生成可重用的子函数库,可以解决相关的任务,效率随经验增长。此外,我们还引入了一种新的“半截”评估方法,可以更紧密地评估LLM的编程能力。这种评估方法鼓励模型输出结构化的解决方案,从而减少语法错误,避免因语法错误被误判为编程能力不佳。

cs.LG - 2023-11-27

Learning Multimodal Latent Dynamics for Human-Robot Interaction

  • paper_url: http://arxiv.org/abs/2311.16380
  • repo_url: None
  • paper_authors: Vignesh Prasad, Lea Heitlinger, Dorothea Koert, Ruth Stock-Homburg, Jan Peters, Georgia Chalvatzaki
  • for: 这个论文是为了学习人机共同互动(Human-Robot Interaction,HRI)而写的。
  • methods: 这个论文使用了隐马尔可夫模型(Hidden Markov Models,HMMs)作为幂论空间的先验知识,并使用变量自动编码器来模型交互代理人的联合分布。在训练过程中,我们利用了从人类间互动中学习的交互动力,以便学习HRI,并将 conditional generation of robot motions from human observations 包含在训练中。
  • results: 我们的方法可以更准确地预测机器人的轨迹,并且可以根据人类的观察来适应机器人的动作。对于接触强度的交互,我们可以使用HMM segmentation来调整机器人的刚度,以实现和人类的弹性互动。我们通过用户研究证明了我们的方法的有效性,并发现用户对我们的方法有更高的人性化、时间化和准确性的评价,并将其与其他基准值进行比较。
    Abstract This article presents a method for learning well-coordinated Human-Robot Interaction (HRI) from Human-Human Interactions (HHI). We devise a hybrid approach using Hidden Markov Models (HMMs) as the latent space priors for a Variational Autoencoder to model a joint distribution over the interacting agents. We leverage the interaction dynamics learned from HHI to learn HRI and incorporate the conditional generation of robot motions from human observations into the training, thereby predicting more accurate robot trajectories. The generated robot motions are further adapted with Inverse Kinematics to ensure the desired physical proximity with a human, combining the ease of joint space learning and accurate task space reachability. For contact-rich interactions, we modulate the robot's stiffness using HMM segmentation for a compliant interaction. We verify the effectiveness of our approach deployed on a Humanoid robot via a user study. Our method generalizes well to various humans despite being trained on data from just two humans. We find that Users perceive our method as more human-like, timely, and accurate and rank our method with a higher degree of preference over other baselines.
    摘要

Bayesian Formulations for Graph Spectral Denoising

  • paper_url: http://arxiv.org/abs/2311.16378
  • repo_url: None
  • paper_authors: Sam Leone, Xingzhi Sun, Michael Perlmutter, Smita Krishnaswamy
  • for: 用于处理含有噪声的信号,包括高斯噪声、dropout噪声和 uniformly distributed噪声。
  • methods: 使用最大 posteriori估计(M.A.P)方法,与三种噪声生成模型结合,以估计真实信号在噪声数据中的存在。
  • results: 能够有效地还原白噪声在图像数据中,并在toy和EHR数据中修复严重的dropout噪声。
    Abstract We consider noisy signals which are defined on the vertices of a graph and present smoothing algorithms for the cases of Gaussian, dropout, and uniformly distributed noise. The signals are assumed to follow a prior distribution defined in the frequency domain which favors signals which are smooth across the edges of the graph. By pairing this prior distribution with our three models of noise generation, we propose \textit{Maximum A Posteriori} (M.A.P.) estimates of the true signal in the presence of noisy data and provide algorithms for computing the M.A.P. Finally, we demonstrate the algorithms' ability to effectively restore white noise on image data, and from severe dropout in toy \& EHR data.
    摘要 我们考虑了含有噪声的信号,这些信号定义在图形Vertex上,并提出了对 Gaussian、dropout 和uniformally distributed噪声进行简化处理的算法。我们假设信号在频域中具有一个偏好简洁信号在图形的边缘平滑的先前分布,并将这个先前分布与我们的三种噪声生成模型结合使用,提出了最大 posteriori(M.A.P)估计真实信号在噪声数据存在下的存在。最后,我们展示了这些算法在图像数据上能够有效地还原白噪声,并在娯 Toy 和 EHR 数据中严重的dropout中还原数据。Note: "Simplified Chinese" is a romanization of the Chinese language that uses a simplified set of characters and grammar rules. It is commonly used in mainland China and Singapore.

Physics-Informed Neural Network for Discovering Systems with Unmeasurable States with Application to Lithium-Ion Batteries

  • paper_url: http://arxiv.org/abs/2311.16374
  • repo_url: None
  • paper_authors: Yuichi Kajiura, Jorge Espin, Dong Zhang
  • for: 本研究旨在提高physics-informed neural network(PINN)的训练效果,以便更好地用于不可观测系统(如锂离子电池)的模拟和控制。
  • methods: 本研究提出了一种使用 fewer loss terms 的robust PINN训练方法,通过将动力学中的 differential equation embed into a loss function,以便同时优化 PINN 的参数和系统参数。
  • results: 本研究通过应用这种方法于一个锂离子电池模型,同时估算了其状态和参数。结果表明,这种方法可以更好地处理不可观测系统的模拟和控制问题。
    Abstract Combining machine learning with physics is a trending approach for discovering unknown dynamics, and one of the most intensively studied frameworks is the physics-informed neural network (PINN). However, PINN often fails to optimize the network due to its difficulty in concurrently minimizing multiple losses originating from the system's governing equations. This problem can be more serious when the system's states are unmeasurable, like lithium-ion batteries (LiBs). In this work, we introduce a robust method for training PINN that uses fewer loss terms and thus constructs a less complex landscape for optimization. In particular, instead of having loss terms from each differential equation, this method embeds the dynamics into a loss function that quantifies the error between observed and predicted system outputs. This is accomplished by numerically integrating the predicted states from the neural network(NN) using known dynamics and transforming them to obtain a sequence of predicted outputs. Minimizing such a loss optimizes the NN to predict states consistent with observations given the physics. Further, the system's parameters can be added to the optimization targets. To demonstrate the ability of this method to perform various modeling and control tasks, we apply it to a battery model to concurrently estimate its states and parameters.
    摘要 使用机器学习与物理结合是一种流行的方法,以发现未知动力学,而physics-informed neural network(PINN)是最受推广的框架之一。然而,PINN经常因同时最小化多个来源于系统的方程式中的损失而难以优化网络。这个问题可能更加严重当系统的状态不可观测,如锂离子电池(LiBs)。在这种情况下,我们提出了一种可靠的PINN训练方法,使用 fewer loss terms,并构建了一个较为简单的优化地图。具体来说,这种方法将动力学 embedding到一个loss函数中,以衡量预测和观测系统输出之间的错误。这是通过 numerically integrating the predicted states from the neural network(NN)使用已知动力学,并将其转换为获得一个序列的预测输出。最小化这种损失可以优化NN,以预测与观测系统输出一致的状态, giventhe physics。此外,系统的参数也可以添加到优化目标中。为了证明这种方法在不同的模型和控制任务中的能力,我们在一个锂离子电池模型上应用了这种方法,并同时估算了其状态和参数。

Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling

  • paper_url: http://arxiv.org/abs/2311.16361
  • repo_url: None
  • paper_authors: Weicheng Zhu, Sheng Liu, Carlos Fernandez-Granda, Narges Razavian
  • for: 这篇论文探讨了自动学习(SSL)在具有关系的数据上的应用,以及在这些数据上学习的问题。
  • methods: 本文使用了SSL方法,并考虑了关系性的问题。
  • results: 本文发现,在关系性的情况下,SSL训练损失可以通过捕捉一 subset of 显著的特征来降低,即使其他重要预测特征存在。 另外,本文建议了一个具有学习速度意识的SSL方法(LA-SSL),并评估了这个方法在三个显示关系性的数据集上的性能。
    Abstract Self-supervised learning (SSL) has emerged as a powerful technique for learning rich representations from unlabeled data. The data representations are able to capture many underlying attributes of data, and be useful in downstream prediction tasks. In real-world settings, spurious correlations between some attributes (e.g. race, gender and age) and labels for downstream tasks often exist, e.g. cancer is usually more prevalent among elderly patients. In this paper, we investigate SSL in the presence of spurious correlations and show that the SSL training loss can be minimized by capturing only a subset of the conspicuous features relevant to those sensitive attributes, despite the presence of other important predictive features for the downstream tasks. To address this issue, we investigate the learning dynamics of SSL and observe that the learning is slower for samples that conflict with such correlations (e.g. elder patients without cancer). Motivated by these findings, we propose a learning-speed aware SSL (LA-SSL) approach, in which we sample each training data with a probability that is inversely related to its learning speed. We evaluate LA-SSL on three datasets that exhibit spurious correlations between different attributes, demonstrating that it improves the robustness of pretrained representations on downstream classification tasks.
    摘要

Cross Entropy in Deep Learning of Classifiers Is Unnecessary – ISBE Error is All You Need

  • paper_url: http://arxiv.org/abs/2311.16357
  • repo_url: None
  • paper_authors: Wladyslaw Skarbek
  • for: 该 paper 主要是为了探讨深度学习类ifier 中的Cost函数,以及可以避免计算 entropy 的方法。
  • methods: 该 paper 使用了一种新的ISBE功能,用于替代传统的 CrossEntropy 计算。在 back-propagation 阶段,不需要将错误传递给 normalization unit,而是直接传递给模型网络。
  • results: 该 paper 的实验结果表明,使用 ISBE 功能可以减少计算时间,并且不会影响结果的准确性。具体来说,在使用 SoftMax 函数时,可以避免计算 entropy,并且在 MNIST 集合上的例子中,可以 saves up to three percent of time within the total time of forward and backward stages。
    Abstract In deep learning classifiers, the cost function usually takes the form of a combination of SoftMax and CrossEntropy functions. The SoftMax unit transforms the scores predicted by the model network into assessments of the degree (probabilities) of an object's membership to a given class. On the other hand, CrossEntropy measures the divergence of this prediction from the distribution of target scores. This work introduces the ISBE functionality, justifying the thesis about the redundancy of cross entropy computation in deep learning of classifiers. Not only can we omit the calculation of entropy, but also, during back-propagation, there is no need to direct the error to the normalization unit for its backward transformation. Instead, the error is sent directly to the model's network. Using examples of perceptron and convolutional networks as classifiers of images from the MNIST collection, it is observed for ISBE that results are not degraded with SoftMax only, but also with other activation functions such as Sigmoid, Tanh, or their hard variants HardSigmoid and HardTanh. Moreover, up to three percent of time is saved within the total time of forward and backward stages. The article is addressed mainly to programmers and students interested in deep model learning. For example, it illustrates in code snippets possible ways to implement ISBE units, but also formally proves that the softmax trick only applies to the class of softmax functions with relocations.
    摘要 在深度学习类ifiers中,成本函数通常是软Max和交叉熵函数的组合。软Max单元将模型网络预测得分转换成对象属于给定类的度量(概率)。而交叉熵则测量预测与目标分布之间的差异。这篇文章介绍了ISBE功能,并证明了深度学习类ifiers中的交叉熵计算是 redundant。不仅可以 omitted 计算熵,而且在反向传播阶段,错误也不需要直接传递给Normalization单元进行反向变换。而是直接传递给模型网络。使用MNIST集合中的图像分类器(perceptron和卷积网络),实验观察到,在使用ISBE时,结果不会受到SoftMaxonly的影响,而且也不会受到其他活化函数,如Sigmoid、Tanh或其硬变体HardSigmoid和HardTanh的影响。此外,在前向和反向阶段的总时间中,可以 saves up to 3%的时间。这篇文章主要面向程序员和深度学习模型学习的学生。例如,它通过代码示例详细介绍了ISBE单元的实现方式,并正式证明了软Max把逻辑只适用于软Max函数的重定位类型。

From Reactive to Proactive Volatility Modeling with Hemisphere Neural Networks

  • paper_url: http://arxiv.org/abs/2311.16333
  • repo_url: https://github.com/theaionxgit/aionx
  • paper_authors: Philippe Goulet Coulombe, Mikael Frenette, Karin Klieber
  • for: 该 paper 是为了提高 macroeconomic density forecasting 的最大化可信度估计 (MLE) 的研究。
  • methods: 该 paper 使用了一种新的神经网络架构,称为 Hemisphere Neural Network (HNN),其中包括了几种关键的元素,以使 MLE 在这个上下文中工作。这些元素包括:1. 两个半球共享一个核心,以处理不同时期的错误卷积变化;2. 引入了一种强调方差约束,以解决mean/variance 不确定性;3. 使用了阻塞的 out-of-bag 现实检查,以避免过拟合;4. 使用了标准的深度学习软件,可以处理大量数据,并且可以从 computationally 和 statistically 两个方面处理。
  • results: 该 paper 的实验结果表明,HNN 可以提供高精度的 mean/variance 预测,并且可以在不同的目标和时间上进行预测。在对 extensive out-of-sample 实验中,HNN 与其他模型进行比较,并表明其在所有目标和时间上都能够提供高精度的预测。此外,研究结果还表明,HNN 可以提供高可靠性的 probabilistic 预测,并且可以在不同的情况下进行适应。最后, paper 还探讨了如何将这种机器学习模型与其他结构化的深度学习模型结合使用。
    Abstract We reinvigorate maximum likelihood estimation (MLE) for macroeconomic density forecasting through a novel neural network architecture with dedicated mean and variance hemispheres. Our architecture features several key ingredients making MLE work in this context. First, the hemispheres share a common core at the entrance of the network which accommodates for various forms of time variation in the error variance. Second, we introduce a volatility emphasis constraint that breaks mean/variance indeterminacy in this class of overparametrized nonlinear models. Third, we conduct a blocked out-of-bag reality check to curb overfitting in both conditional moments. Fourth, the algorithm utilizes standard deep learning software and thus handles large data sets - both computationally and statistically. Ergo, our Hemisphere Neural Network (HNN) provides proactive volatility forecasts based on leading indicators when it can, and reactive volatility based on the magnitude of previous prediction errors when it must. We evaluate point and density forecasts with an extensive out-of-sample experiment and benchmark against a suite of models ranging from classics to more modern machine learning-based offerings. In all cases, HNN fares well by consistently providing accurate mean/variance forecasts for all targets and horizons. Studying the resulting volatility paths reveals its versatility, while probabilistic forecasting evaluation metrics showcase its enviable reliability. Finally, we also demonstrate how this machinery can be merged with other structured deep learning models by revisiting Goulet Coulombe (2022)'s Neural Phillips Curve.
    摘要 我们重新激活最大可能性估计(MLE),通过一种新的神经网络架构,以提高macro经济密度预测。我们的架构包括以下几个关键元素,使MLE在这个上下文中工作:1. 半球预测和变量预测共享一个共同核心,以满足不同时间变化的错误方差问题。2. 我们引入了一个强调方差约束,以解决非线性模型中的均值/方差不确定性问题。3. 我们使用阻塞的out-of-bag实验来约束过拟合。4. 算法使用标准的深度学习软件,可以处理大量数据,并且从计算和统计上来说都是可行的。因此,我们的半球神经网络(HNN)可以提供活跃的方差预测,基于领先指标的情况下采取主动预测,当必须时则采取反应预测,基于过去预测错误的大小。我们通过大量的out-of-sample实验和对一系列模型进行比较,发现HNN在所有目标和时间距离上都能够提供准确的均值/方差预测。研究生成的方差轨迹表明其灵活性,而 probabilistic 预测评价指标则表明其 enviable 可靠性。最后,我们还示例了如何将这种机器结合其他结构深度学习模型,例如Goulet Coulombe (2022)的神经帕拉杯 Curve。

Target-Free Compound Activity Prediction via Few-Shot Learning

  • paper_url: http://arxiv.org/abs/2311.16328
  • repo_url: None
  • paper_authors: Peter Eckmann, Jake Anderson, Michael K. Gilson, Rose Yu
  • for: 预测药物活性 against 蛋白质或生物化学测试,使用只有几个知道的药物和其活性。
  • methods: 使用神经网络架构,meta-学着训练不同生物活性数据集上的药物活性预测。
  • results: 比传统相似性基于技术和其他状态的几 shot 学习方法在目标无法药物找到设置和数据集上表现出色。
    Abstract Predicting the activities of compounds against protein-based or phenotypic assays using only a few known compounds and their activities is a common task in target-free drug discovery. Existing few-shot learning approaches are limited to predicting binary labels (active/inactive). However, in real-world drug discovery, degrees of compound activity are highly relevant. We study Few-Shot Compound Activity Prediction (FS-CAP) and design a novel neural architecture to meta-learn continuous compound activities across large bioactivity datasets. Our model aggregates encodings generated from the known compounds and their activities to capture assay information. We also introduce a separate encoder for the unknown compound. We show that FS-CAP surpasses traditional similarity-based techniques as well as other state of the art few-shot learning methods on a variety of target-free drug discovery settings and datasets.
    摘要 <>用一些已知的化合物和其活性信息预测其他化合物的活性是target-free drug discovery中常见的任务。现有的几个shot学习方法只能预测二分类标签(活动/不活动)。然而,实际的药物探索中,化合物活性的度量非常重要。我们研究了Few-Shot Compound Activity Prediction(FS-CAP),并设计了一种新的神经网络架构,通过在大规模生物活性数据集上meta-学习来预测化合物活性。我们的模型将知道的化合物和其活性信息生成的编码集成在一起,以捕捉试验信息。我们还引入了一个独立的未知化合物编码器。我们显示,FS-CAP在多种target-free drug discovery设置和数据集上超越了传统相似性基于技术以及其他状态的最佳几个shot学习方法。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Quantum-classical simulation of quantum field theory by quantum circuit learning

  • paper_url: http://arxiv.org/abs/2311.16297
  • repo_url: None
  • paper_authors: Kazuki Ikeda
  • for: simulate quantum field theories (QFTs) on quantum computers
  • methods: quantum circuit learning, compact configuration of qubits, low-depth quantum circuits
  • results: accurate predictions of quench dynamics, chiral dynamics, and jet production, close alignment with classical calculations, high degree of accuracy
    Abstract We employ quantum circuit learning to simulate quantum field theories (QFTs). Typically, when simulating QFTs with quantum computers, we encounter significant challenges due to the technical limitations of quantum devices when implementing the Hamiltonian using Pauli spin matrices. To address this challenge, we leverage quantum circuit learning, employing a compact configuration of qubits and low-depth quantum circuits to predict real-time dynamics in quantum field theories. The key advantage of this approach is that a single-qubit measurement can accurately forecast various physical parameters, including fully-connected operators. To demonstrate the effectiveness of our method, we use it to predict quench dynamics, chiral dynamics and jet production in a 1+1-dimensional model of quantum electrodynamics. We find that our predictions closely align with the results of rigorous classical calculations, exhibiting a high degree of accuracy. This hybrid quantum-classical approach illustrates the feasibility of efficiently simulating large-scale QFTs on cutting-edge quantum devices.
    摘要 我们使用量子环路学习来模拟量子场论(QFT)。通常,在量子计算机上模拟QFT时,我们会遇到技术限制,例如使用保ロ套矩阵实现哈密顿Operator。为解决这个挑战,我们利用量子环路学习,使用高度嵌入的量子环路和快速的量子环路来预测量子场论的实时动力学。这种方法的优点是单个量子比特测量可以准确预测多种物理参数,包括完全相连的操作数。为证明我们的方法的有效性,我们使用它来预测干扰动力学、旋转动力学和喷气生成在1+1维量子电磁学中。我们发现我们的预测与高精度的类型计算结果高度相符,表明我们的方法可以高效地模拟大规模QFT在前沿量子设备上。这种量子-классиical混合方法表明可以高效地模拟大规模QFT。

A statistical approach to latent dynamic modeling with differential equations

  • paper_url: http://arxiv.org/abs/2311.16286
  • repo_url: https://github.com/maren-ha/latentdynamics.jl
  • paper_authors: Maren Hackenberg, Astrid Pechmann, Clemens Kreutz, Janbernd Kirschner, Harald Binder
  • for: 这篇论文是为了探讨 temporally local changes of processes 的机制模型,并使用 ODEs 进行模型化。
  • methods: 本论文使用 ODEs 进行模型化,并使用 neural networks 获得低维 latent space 和 patient-specific ODE parameters。
  • results: 本论文提出了一种使用 each observation in the course of time as the initial value to obtain multiple local ODE solutions,并建立一个 combined estimator of the underlying dynamics。Please note that the above information is in Simplified Chinese text, as requested.
    Abstract Ordinary differential equations (ODEs) can provide mechanistic models of temporally local changes of processes, where parameters are often informed by external knowledge. While ODEs are popular in systems modeling, they are less established for statistical modeling of longitudinal cohort data, e.g., in a clinical setting. Yet, modeling of local changes could also be attractive for assessing the trajectory of an individual in a cohort in the immediate future given its current status, where ODE parameters could be informed by further characteristics of the individual. However, several hurdles so far limit such use of ODEs, as compared to regression-based function fitting approaches. The potentially higher level of noise in cohort data might be detrimental to ODEs, as the shape of the ODE solution heavily depends on the initial value. In addition, larger numbers of variables multiply such problems and might be difficult to handle for ODEs. To address this, we propose to use each observation in the course of time as the initial value to obtain multiple local ODE solutions and build a combined estimator of the underlying dynamics. Neural networks are used for obtaining a low-dimensional latent space for dynamic modeling from a potentially large number of variables, and for obtaining patient-specific ODE parameters from baseline variables. Simultaneous identification of dynamic models and of a latent space is enabled by recently developed differentiable programming techniques. We illustrate the proposed approach in an application with spinal muscular atrophy patients and a corresponding simulation study. In particular, modeling of local changes in health status at any point in time is contrasted to the interpretation of functions obtained from a global regression. This more generally highlights how different application settings might demand different modeling strategies.
    摘要 常微分方程(ODEs)可以提供机制模型,描述时间地方变化的过程,其参数通常由外部知识提供。 although ODEs are popular in systems modeling, they are less established for statistical modeling of longitudinal cohort data, such as in a clinical setting. However, modeling local changes could also be attractive for assessing an individual's trajectory in a cohort over the immediate future, given their current status, where ODE parameters could be informed by further characteristics of the individual. However, several challenges limit the use of ODEs, such as the potentially higher level of noise in cohort data, which can affect the shape of the ODE solution, and the fact that larger numbers of variables can multiply these problems and be difficult to handle for ODEs.To address this, we propose using each observation in the course of time as the initial value to obtain multiple local ODE solutions and build a combined estimator of the underlying dynamics. In addition, neural networks are used to obtain a low-dimensional latent space for dynamic modeling from a potentially large number of variables, and to obtain patient-specific ODE parameters from baseline variables. Recently developed differentiable programming techniques enable the simultaneous identification of dynamic models and of a latent space. We illustrate the proposed approach in an application with spinal muscular atrophy patients and a corresponding simulation study. In particular, modeling local changes in health status at any point in time is contrasted with the interpretation of functions obtained from a global regression. This highlights how different application settings may demand different modeling strategies.

Practical Layout-Aware Analog/Mixed-Signal Design Automation with Bayesian Neural Networks

  • paper_url: http://arxiv.org/abs/2311.17073
  • repo_url: None
  • paper_authors: Ahmet F. Budak, Keren Zhu, David Z. Pan
  • for: 提高实际混合信号设计自动化的效率,适用于临时expensive的 simulatecircuits。
  • methods: 使用学习算法,通过少量数据训练,可扩展到高价值的 simulatecircuits。我们使用 Bayesian Neural Networks 作为回归模型,以优化环境。
  • results: 我们的方法比 conventonal baselines 和 state-of-the-art algorithms 更高效,可以快速优化混合信号设计。我们在三个案例中进行了证明。
    Abstract The high simulation cost has been a bottleneck of practical analog/mixed-signal design automation. Many learning-based algorithms require thousands of simulated data points, which is impractical for expensive to simulate circuits. We propose a learning-based algorithm that can be trained using a small amount of data and, therefore, scalable to tasks with expensive simulations. Our efficient algorithm solves the post-layout performance optimization problem where simulations are known to be expensive. Our comprehensive study also solves the schematic-level sizing problem. For efficient optimization, we utilize Bayesian Neural Networks as a regression model to approximate circuit performance. For layout-aware optimization, we handle the problem as a multi-fidelity optimization problem and improve efficiency by exploiting the correlations from cheaper evaluations. We present three test cases to demonstrate the efficiency of our algorithms. Our tests prove that the proposed approach is more efficient than conventional baselines and state-of-the-art algorithms.
    摘要 高效的数值计算成本是实际混合信号设计自动化的瓶颈。许多学习基于算法需要数千个数据点,这对于贵重的计算circuit来说是不实际的。我们提出了一种学习基于算法,可以通过小量数据进行训练,因此可扩展到任务中的贵重计算。我们的高效算法解决了后置布局性能优化问题,在 simulations known to be expensive 中。我们的全面研究还解决了架构级别的大小问题。为了有效地优化,我们利用 Bayesian Neural Networks 作为回归模型来近似Circuit性能。为了提高效率,我们在评估中处理问题为多iddleFIdelity optimization problem,并利用较便宜的评估来捕捉相关性。我们在三个测试案例中展示了我们的方法的效率,并证明了我们的方法比普通基准和现状算法更高效。

Have we built machines that think like people?

  • paper_url: http://arxiv.org/abs/2311.16093
  • repo_url: https://github.com/lsbuschoff/multimodal
  • paper_authors: Luca M. Schulze Buschoff, Elif Akata, Matthias Bethge, Eric Schulz
  • for: 这篇论文探讨了现代视觉语言模型在INTUITIVE PHYSICS、CAUSAL REASONING和INTUITIVE PSYCHOLOGY领域的表现。
  • methods: 通过一系列控制实验,该论文研究了现代视觉语言模型在处理和解释视觉数据方面的能力。
  • results: 研究发现,虽然这些模型在处理视觉数据方面表现出色,但在理解物理法律和 causal 关系以及人类社交认知方面 still fall short of human capabilities。
    Abstract A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of large language models, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. The models exhibit a rudimentary understanding of physical laws and causal relationships, but their performance is hindered by a lack of deeper insights-a key aspect of human cognition. Furthermore, in tasks requiring an intuitive theory of mind, the models fail altogether. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based language models, and point out the importance of cognitively-inspired benchmarks.
    摘要 主要目标之一的人工智能是建立机器人思维如人的能力。然而,一些研究人员指出,深度神经网络架构不足以实现这一目标。研究人员认为,这些模型在逻辑推理、物理概念和人类心理学方面有限制。然而,在大语言模型的出现后,有些研究人员又重新关注了模拟人类认知能力的可能性。本文评估当今视觉基于大语言模型在逻辑推理、物理概念和人类心理学方面的现状。通过一系列控制实验,我们调查了这些现代模型处理复杂物理交互和 causal 关系的能力,以及理解他人偏好的直觉。我们发现,虽然这些模型在处理和解释视觉数据方面表现出色,但在人类能力方面仍然缺乏深入的理解。它们对物理法律和 causal 关系有一定的理解,但在完成任务时,它们缺乏更深入的认知机制。此外,在需要直觉人类心理的任务中,这些模型完全失败。我们的结果强调了现代视觉基于大语言模型需要更加强大的物理学、物理概念和社会认知机制,以及需要更加认知驱动的benchmark。

XLB: Distributed Multi-GPU Lattice Boltzmann Simulation Framework for Differentiable Scientific Machine Learning

  • paper_url: http://arxiv.org/abs/2311.16080
  • repo_url: https://github.com/autodesk/xlb
  • paper_authors: Mohammadmehdi Ataei, Hesam Salehipour
  • for: 这篇论文旨在介绍一个基于Python的分子 Boltzmann方法(LBM)库,即XLB框架,该框架可以在CPU、多卡GPU和分布式多卡GPU系统上进行可扩展的扩展和高性能的计算。
  • methods: 该论文使用了JAX框架,并提供了一些可扩展的边界条件、碰撞模型和计算能力,以确保框架的可用性、可扩展性和计算性能。
  • results: 该论文成功地扩展了LBM的可扩展性和计算性能,并实现了对 billions of cells 的模拟,达到了兆级的离散 lattice 更新速率。
    Abstract The lattice Boltzmann method (LBM) has emerged as a prominent technique for solving fluid dynamics problems due to its algorithmic potential for computational scalability. We introduce XLB framework, a Python-based differentiable LBM library which harnesses the capabilities of the JAX framework. The architecture of XLB is predicated upon ensuring accessibility, extensibility, and computational performance, enabling scaling effectively across CPU, multi-GPU, and distributed multi-GPU systems. The framework can be readily augmented with novel boundary conditions, collision models, or simulation capabilities. XLB offers the unique advantage of integration with JAX's extensive machine learning echosystem, and the ability to utilize automatic differentiation for tackling physics-based machine learning, optimization, and inverse problems. XLB has been successfully scaled to handle simulations with billions of cells, achieving giga-scale lattice updates per second. XLB is released under the permissive Apache-2.0 license and is available on GitHub at https://github.com/Autodesk/XLB.
    摘要 《离散辐射方法(LBM)》已经成为流体力学问题的解决方法之一,这主要归功于其算法的可扩展性和计算可扩展性。我们介绍了一个名为XLB框架,它是基于Python的可微分LBM库,利用JAX框架的能力。XLB框架的设计强调了可访问性、可扩展性和计算性能,以便效率地在CPU、多个GPU和分布式多个GPU系统上进行扩展。此外,XLB框架可以轻松地添加新的边界条件、Collision模型或计算能力。XLB具有与JAX的广泛机器学习生态系统集成的优势,以及使用自动微分来解决物理基于机器学习、优化和逆问题。XLB已经成功实现了对百亿个细胞的模拟,实现了 гига级离散辐射更新每秒。XLB是根据Apache-2.0许可证发布,可以在GitHub上获取:https://github.com/Autodesk/XLB。

DGR: Tackling Drifted and Correlated Noise in Quantum Error Correction via Decoding Graph Re-weighting

  • paper_url: http://arxiv.org/abs/2311.16214
  • repo_url: None
  • paper_authors: Hanrui Wang, Pengyu Liu, Yilian Liu, Jiaqi Gu, Jonathan Baker, Frederic T. Chong, Song Han
  • for: 提高量子硬件的逻辑错误率和噪声强度,通过量子错误 correction (QEC) 技术,将量子信息分布式存储在多个数据 qubits 上,并使用 syndrome qubits 检查公差。
  • methods: 提出了一种名为 DGR 的高效的 decoding graph edge re-weighting 策略,无需量子质量。该策略利用了实际量子硬件上错误的统计特征,通过统计计算出 edges 和 edge pairs 在解码过程中的出现频率,以估计实际上的权重。
  • results: 对 surface code 和 honeycomb code 进行了广泛的评估,发现 DGR 可以降低逻辑错误率,比如在average-case noise mismatch 下降低了3.6倍,并在 worst-case mismatch 下超过5000倍提高。
    Abstract Quantum hardware suffers from high error rates and noise, which makes directly running applications on them ineffective. Quantum Error Correction (QEC) is a critical technique towards fault tolerance which encodes the quantum information distributively in multiple data qubits and uses syndrome qubits to check parity. Minimum-Weight-Perfect-Matching (MWPM) is a popular QEC decoder that takes the syndromes as input and finds the matchings between syndromes that infer the errors. However, there are two paramount challenges for MWPM decoders. First, as noise in real quantum systems can drift over time, there is a potential misalignment with the decoding graph's initial weights, leading to a severe performance degradation in the logical error rates. Second, while the MWPM decoder addresses independent errors, it falls short when encountering correlated errors typical on real hardware, such as those in the 2Q depolarizing channel. We propose DGR, an efficient decoding graph edge re-weighting strategy with no quantum overhead. It leverages the insight that the statistics of matchings across decoding iterations offer rich information about errors on real quantum hardware. By counting the occurrences of edges and edge pairs in decoded matchings, we can statistically estimate the up-to-date probabilities of each edge and the correlations between them. The reweighting process includes two vital steps: alignment re-weighting and correlation re-weighting. The former updates the MWPM weights based on statistics to align with actual noise, and the latter adjusts the weight considering edge correlations. Extensive evaluations on surface code and honeycomb code under various settings show that DGR reduces the logical error rate by 3.6x on average-case noise mismatch with exceeding 5000x improvement under worst-case mismatch.
    摘要 量子硬件受到高错率和噪声的影响,直接运行应用程序在其上效率低下。量子错误 corrention (QEC) 是一种关键的技术,它在多个数据 qubits 中分布式地编码量子信息,并使用 syndrome qubits 检查平衡。 however,MWPM decoder 存在两个主要挑战。 first,由于实际量子系统中的噪声可以随时间的推移而变化,这可能会导致 decoding 图的初始 веса与实际噪声不匹配,从而导致逻辑错误率的严重下降。 second,MWPM decoder 只能处理独立的错误,而不能处理实际上的相关错误,如在二量子 depolarizing 通道中的 typical 错误。我们提出了 DGR,一种高效的 decoding 图边重量调整策略,无需量子负担。它利用了 decoding 图中 matchings 的统计信息,可以统计地错误的 probabilities 和相关性。 re-weighting 过程包括两个关键步骤:对齐重量调整和相关重量调整。 former 将 MWPM 重量调整到实际噪声的基础上,而 latter 将重量调整considering 边的相关性。我们对 surface code 和 honeycomb code 进行了广泛的评估,并在不同的设置下得到了结果。 average-case noise mismatch 下,DGR 可以降低逻辑错误率的3.6倍。在 worst-case mismatch 下,DGR 可以提高逻辑错误率的5000倍。

Metric Space Magnitude for Evaluating Unsupervised Representation Learning

  • paper_url: http://arxiv.org/abs/2311.16054
  • repo_url: None
  • paper_authors: Katharina Limbeck, Rayna Andreeva, Rik Sarkar, Bastian Rieck
  • for: 本研究目的是提出一种新的维度减少方法,通过捕捉数据的几何和topological特征,以Addressing challenges in unsupervised representation learning tasks.
  • methods: 本研究使用了一种新的距离度量,称为“magnitude”,用于衡量数据的几何和topological特征。这个度量可以减少维度,并且可以快速计算和稳定地计算。
  • results: 本研究通过一系列实验表明,这种距离度量可以减少数据的维度,并且可以用于比较不同维度的数据 embeddings。此外,这种距离度量还可以适用于不同的数据领域和任务,如数据可见化。
    Abstract The magnitude of a metric space was recently established as a novel invariant, providing a measure of the `effective size' of a space across multiple scales. By capturing both geometrical and topological properties of data, magnitude is poised to address challenges in unsupervised representation learning tasks. We formalise a novel notion of dissimilarity between magnitude functions of finite metric spaces and use them to derive a quality measure for dimensionality reduction tasks. Our measure is provably stable under perturbations of the data, can be efficiently calculated, and enables a rigorous multi-scale comparison of embeddings. We show the utility of our measure in an experimental suite that comprises different domains and tasks, including the comparison of data visualisations.
    摘要 metric 空间的大小最近被确定为一个新的不变量,提供了多个尺度下数据的`有效大小'的度量。通过捕捉数据的几何和topological特性,大小能够解决无监督学习任务中的挑战。我们提出了一种新的差异量方法,用于衡量finite metric space中的差异函数的质量。我们的度量可以在数据上进行有效的计算,并且可以在多个尺度下进行比较。我们通过实验表明,我们的度量可以快速地计算,并且在数据上进行多个尺度下的比较。Note: Simplified Chinese is used here, as it is the most widely used version of Chinese. However, if you prefer Traditional Chinese, I can provide that as well.

A Neural Framework for Generalized Causal Sensitivity Analysis

  • paper_url: http://arxiv.org/abs/2311.16026
  • repo_url: None
  • paper_authors: Dennis Frauen, Fergus Imrie, Alicia Curth, Valentyn Melnychuk, Stefan Feuerriegel, Mihaela van der Schaar
  • for: 该 paper 用于提出一种基于神经网络的通用 causal sensitivity analysis 方法,以帮助在不观测潜在干扰的情况下进行 causal inference。
  • methods: 该 paper 使用 two conditional normalizing flows 学习 latent distribution shift,以实现对不同敏感模型、不同治疗类型和不同 causal query 的通用适用性。
  • results: 该 paper 提供了理论保证,证明 NeuralCSA 能够生成有效的 bound 以描述 causal query of interest,并通过验证使用 simulated 和实际数据来证明这一点。
    Abstract Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. The generality of \frameworkname is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data.
    摘要 “不观察的杂化是许多应用中的常见问题,使得从观察数据中提取 causal inference 变得困难。为了解决这问题, causal 敏感性分析 是一种重要的工具,可以在不观察的杂化情况下提供数学保证。在这篇论文中,我们提出了 NeuralCSA,一种基于神经网络的通用 causal 敏感性分析框架。与前一代工作不同,我们的框架可以与(i)一种广泛的敏感模型集合,包括边缘敏感模型、f-敏感模型和 Розен巴ум的敏感模型;(ii)不同的治疗类型(即二进制和连续);以及(iii)不同的 causal 查询,包括(条件)均值治疗效果和同时对多个结果的效果进行同时分析。我们的框架通过两个 Conditional 正常化流来学习对待治疗干扰的潜在分布变化,并提供了理论保证,证明 NeuralCSA 可以bounds 中的 causal 查询。我们还通过验证使用 simulations 和实际数据来证明这一点。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Scheduling and Communication Schemes for Decentralized Federated Learning

  • paper_url: http://arxiv.org/abs/2311.16021
  • repo_url: None
  • paper_authors: Bahaa-Eldin Ali Abdelghany, Ana Fernández-Vilas, Manuel Fernández-Veiga, Nashwa El-Bendary, Ammar M. Hassan, Walid M. Abdelmoez
  • for: 提高分布式机器学习性能(Federated Learning,FL)在多客户端协同学习中的扩展性和扩展性。
  • methods: 提出了一种分布式 Federated Learning(DFL)模型,使用 Stochastic Gradient Descent(SGD)算法,并提出了三种调度策略来优化客户端与平行服务器之间的通信。
  • results: 通过 Totally Decentralized 实现 SGD,测试结果显示提出的调度策略对速度收敛和最终全局模型产生影响。
    Abstract Federated learning (FL) is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. One central server is not enough, due to problems of connectivity with clients. In this paper, a decentralized federated learning (DFL) model with the stochastic gradient descent (SGD) algorithm has been introduced, as a more scalable approach to improve the learning performance in a network of agents with arbitrary topology. Three scheduling policies for DFL have been proposed for communications between the clients and the parallel servers, and the convergence, accuracy, and loss have been tested in a totally decentralized mplementation of SGD. The experimental results show that the proposed scheduling polices have an impact both on the speed of convergence and in the final global model.
    摘要 Federated learning(FL)是一种分布式机器学习理念,其中大量客户端与中央服务器协同学习模型,无需分享自己的训练数据。然而,由于客户端与服务器之间的连接问题,单个中央服务器无法满足需求。这篇论文提出了一种分布式 federated learning(DFL)模型,使用渐进逻辑 descent(SGD)算法,以更可扩展的方式提高网络上代理的学习性能。论文还提出了三种调度策略,用于在客户端与平行服务器之间进行通信,并测试了在 Totally Decentralized 实现中的SGD的速度和最终全球模型的影响。实验结果表明,提议的调度策略对速度增长和最终模型的影响具有影响。Here's a breakdown of the translation:* Federated learning (FL) is a distributed machine learning paradigm where a large number of clients coordinate with a central server to learn a model without sharing their own training data.* In this paper, a decentralized federated learning (DFL) model with the stochastic gradient descent (SGD) algorithm has been introduced as a more scalable approach to improve the learning performance in a network of agents with arbitrary topology.* The paper proposes three scheduling policies for communications between the clients and the parallel servers, and tests the convergence, accuracy, and loss in a totally decentralized implementation of SGD.* The experimental results show that the proposed scheduling policies have an impact both on the speed of convergence and in the final global model.

Using Decentralized Aggregation for Federated Learning with Differential Privacy

  • paper_url: http://arxiv.org/abs/2311.16008
  • repo_url: None
  • paper_authors: Hadeel Abd El-Kareem, Abd El-Moaty Saleh, Ana Fernández-Vilas, Manuel Fernández-Veiga, asser El-Sonbaty
  • for: 本研究旨在提供一个实验环境,以测试 Federated Learning(FL)与数据隐私(Differential Privacy,DP)的应用,以确保在搭配交换通信、大数据库和分布式机器学习(P2P)方面维护用户隐私。
  • methods: 本研究使用了 FL 技术,并将 DP 应用于这个 scenarioto提供更高的隐私水平。研究者选择了一些适当的参数和技术来实现这个目的。
  • results: 研究结果显示,选择的参数和技术对于隐私和功能之间的调整是关键的。例如,通过对一个分类例子进行训练,研究者发现了一些关键的独立特征,并且可以通过调整参数来增强隐私和功能之间的调整。
    Abstract Nowadays, the ubiquitous usage of mobile devices and networks have raised concerns about the loss of control over personal data and research advance towards the trade-off between privacy and utility in scenarios that combine exchange communications, big databases and distributed and collaborative (P2P) Machine Learning techniques. On the other hand, although Federated Learning (FL) provides some level of privacy by retaining the data at the local node, which executes a local training to enrich a global model, this scenario is still susceptible to privacy breaches as membership inference attacks. To provide a stronger level of privacy, this research deploys an experimental environment for FL with Differential Privacy (DP) using benchmark datasets. The obtained results show that the election of parameters and techniques of DP is central in the aforementioned trade-off between privacy and utility by means of a classification example.
    摘要 Here is the text in Simplified Chinese:现在,移动设备和网络的普遍使用已经引发了人们对个人数据控制的 Concerns,而研究正在进行交互、大数据和分布式和合作机器学习技术的权衡。一方面, Federated Learning(FL)可以提供一定的隐私性,因为它在本地节点上保留数据,并在本地训练来增强全球模型,但这种场景仍然容易受到隐私泄露的威胁,如成员推理攻击。为了提供更高的隐私性,这些研究在 FL 环境中部署了 diffe 隐私技术,使用标准数据集进行测试。结果表明,选择参数和技术的 diffe 隐私技术对于隐私和实用之间的权衡非常重要,以示例 classification 中所示。

Improved Data Generation for Enhanced Asset Allocation: A Synthetic Dataset Approach for the Fixed Income Universe

  • paper_url: http://arxiv.org/abs/2311.16004
  • repo_url: None
  • paper_authors: Szymon Kubiak, Tillman Weyde, Oleksandr Galkin, Dan Philps, Ram Gopal
  • for: 这个论文是为了评估资产配置方法和构建固定收益宇宙中的投资组合而开发的一种新的数据生成过程。
  • methods: 该方法首先使用CorrGAN模型生成合成相关矩阵,然后提出一种Encoder-Decoder模型,通过对给定相关矩阵进行样本生成。
  • results: 该方法可以生成具有多样化资产宇宙的合成数据集,并且可以用于深入分析资产配置方法的性能。例如,通过在一个基于 simulateAssetAllocation 的资产配置过程中使用该合成数据集,可以提高投资组合的性能。
    Abstract We present a novel process for generating synthetic datasets tailored to assess asset allocation methods and construct portfolios within the fixed income universe. Our approach begins by enhancing the CorrGAN model to generate synthetic correlation matrices. Subsequently, we propose an Encoder-Decoder model that samples additional data conditioned on a given correlation matrix. The resulting synthetic dataset facilitates in-depth analyses of asset allocation methods across diverse asset universes. Additionally, we provide a case study that exemplifies the use of the synthetic dataset to improve portfolios constructed within a simulation-based asset allocation process.
    摘要 我们提出了一种新的过程,用于生成适用于评估资产配置方法和构建Fixed income宇宙中的股票组合的 sintetic 数据集。我们的方法开始于增强 CorrGAN 模型,生成 sintetic 相关矩阵。然后,我们提议一种 Encoder-Decoder 模型,通过给定相关矩阵进行采样,从而生成更多的数据。这些 sintetic 数据集可以帮助深入分析资产配置方法在多个资产宇宙中的表现。此外,我们还提供了一个案例研究,演示了如何使用 sintetic 数据集来改进在基于 simulations 的资产配置过程中的股票组合。

Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation

  • paper_url: http://arxiv.org/abs/2311.15996
  • repo_url: None
  • paper_authors: Teo Deveney, Jan Stanczuk, Lisa Maria Kreusser, Chris Budd, Carola-Bibiane Schönlieb
  • for: 这个论文旨在描述Score-based diffusion模型的训练过程中出现的各种动力学和估计方法,以及这些方法之间的关系。
  • methods: 这个论文使用了Stochastic Differential Equations (SDEs)和Ordinary Differential Equations (ODEs)作为数学基础,并使用了神经网络来近似这些方程。
  • results: 研究人员发现,在训练Score-based diffusion模型时,使用ODE和SDE之间存在显著的差异,这种差异可以通过一个Fokker-Planck方程来描述。此外,研究人员还发现,通过添加一个额外的正则化项来减少Fokker-Planck差异,可以提高ODE采样质量,但是这可能会导致SDE采样质量下降。
    Abstract Score-based diffusion models have emerged as one of the most promising frameworks for deep generative modelling, due to their state-of-the art performance in many generation tasks while relying on mathematical foundations such as stochastic differential equations (SDEs) and ordinary differential equations (ODEs). Empirically, it has been reported that ODE based samples are inferior to SDE based samples. In this paper we rigorously describe the range of dynamics and approximations that arise when training score-based diffusion models, including the true SDE dynamics, the neural approximations, the various approximate particle dynamics that result, as well as their associated Fokker--Planck equations and the neural network approximations of these Fokker--Planck equations. We systematically analyse the difference between the ODE and SDE dynamics of score-based diffusion models, and link it to an associated Fokker--Planck equation. We derive a theoretical upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker--Planck residual. We also show numerically that conventional score-based diffusion models can exhibit significant differences between ODE- and SDE-induced distributions which we demonstrate using explicit comparisons. Moreover, we show numerically that reducing the Fokker--Planck residual by adding it as an additional regularisation term leads to closing the gap between ODE- and SDE-induced distributions. Our experiments suggest that this regularisation can improve the distribution generated by the ODE, however that this can come at the cost of degraded SDE sample quality.
    摘要 Score-based diffusion models 已经被认为是深度生成模型中的一种最有前途的框架,因为它们在许多生成任务中表现出了state-of-the-art水平,同时具有数学基础 such as Stochastic Differential Equations (SDEs) 和 Ordinary Differential Equations (ODEs)。实际上,有研究发现 ODE 基本样本比 SDE 基本样本低质量。在这篇论文中,我们系统地描述了训练 score-based diffusion models 时出现的动力学和近似,包括真实的 SDE 动力学、神经网络近似、不同的假样动力学以及其相应的 Fokker-Planck 方程和神经网络近似。我们系统分析了 ODE 和 SDE 动力学之间的差异,并将其联系到一个 Fokker-Planck residual。我们也 derive 了一个理论上的 Wasserstein 2-distance 上限,用于比较 ODE- 和 SDE-induced 分布的差异。数值分析表明, conventinal score-based diffusion models 可能会出现 significan differences between ODE- 和 SDE-induced 分布,我们通过显式比较来证明。此外,我们还表明了通过添加 Fokker-Planck residual 作为额外正则项来降低这些差异的效果。我们的实验表明,这种正则可以提高 ODE 生成的分布质量,但可能会导致 SDE 样本质量下降。

Sensitivity-Based Layer Insertion for Residual and Feedforward Neural Networks

  • paper_url: http://arxiv.org/abs/2311.15995
  • repo_url: https://github.com/leoniekreis/layer_insertion_sensitivity_based
  • paper_authors: Evelyn Herberg, Roland Herzog, Frederik Köhne, Leonie Kreis, Anton Schiela
  • for: 提高神经网络训练的效率和自动化性。
  • methods: 利用束定优化技术和首频敏感信息,在训练过程中逐渐插入新层,不需手动调整网络结构。
  • results: 在数字实验中,提出的敏感性基于插入层技术可以提高训练衰减,同时减少插入层的计算成本。
    Abstract The training of neural networks requires tedious and often manual tuning of the network architecture. We propose a systematic method to insert new layers during the training process, which eliminates the need to choose a fixed network size before training. Our technique borrows techniques from constrained optimization and is based on first-order sensitivity information of the objective with respect to the virtual parameters that additional layers, if inserted, would offer. We consider fully connected feedforward networks with selected activation functions as well as residual neural networks. In numerical experiments, the proposed sensitivity-based layer insertion technique exhibits improved training decay, compared to not inserting the layer. Furthermore, the computational effort is reduced in comparison to inserting the layer from the beginning. The code is available at \url{https://github.com/LeonieKreis/layer_insertion_sensitivity_based}.
    摘要 neural network 训练需要 tedious 和常常是手动调整网络结构的。我们提出了一种系统的方法,在训练过程中插入新层,从而消除了选择固定网络大小之前训练的需求。我们的技术借鉴了受限制优化的技术,基于目标函数对虚拟参数(如果插入层会提供的附加层)的首次敏感性信息。我们考虑了全连接Feedforward网络和剩余神经网络。在数学实验中,我们的敏感性基于层插入技术 exhibits 改善训练衰减,相比不插入层。此外,在插入层的开始之前,计算时间的减少。代码可以在 \url{https://github.com/LeonieKreis/layer_insertion_sensitivity_based} 上获取。

Should We Learn Most Likely Functions or Parameters?

  • paper_url: http://arxiv.org/abs/2311.15990
  • repo_url: https://github.com/activatedgeek/function-space-map
  • paper_authors: Shikai Qiu, Tim G. J. Rudner, Sanyam Kapoor, Andrew Gordon Wilson
  • for: 本研究旨在探讨maximum a posteriori (MAP) estimation的标准训练方法,以及直接估计模型适用函数空间中的最可能函数。
  • methods: 本研究使用了重要的函数空间估计技术,以及一种扩展的梯度下降算法,以估计模型适用函数空间中的最可能函数。
  • results: 研究发现,使用函数空间MAP estimation可以带来更好的泛化、更好的稳定性和更好的鲁棒性,并且可以避免一些常见的过拟合问题。
    Abstract Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting.
    摘要 标准化常规训练过程相应于最大化参数 posterior 分布的最大化,也就是最大化 posterior 的MAP 估计。然而,模型参数仅仅是在与模型函数的形式结合下具有参数的意义。此外,参数 posterior 中最可能的参数通常不是模型函数induced by parameter posterior中最可能的函数。实际上,我们可以重新Parametrize a model such that any setting of parameters can maximize the parameter posterior。作为替代,我们研究直接估计模型和数据中最可能的函数的优缺点。我们发现这种方法在使用神经网络时会导致Pathological solutions,并证明了这种方法在满足某些条件下是可行的,以及一种可扩展的估计方法。在这些条件下,我们发现了函数空间MAP估计可以导致更平坦的极小值、更好的泛化和减少过拟合。

Towards Transfer Learning for Large-Scale Image Classification Using Annealing-based Quantum Boltzmann Machines

  • paper_url: http://arxiv.org/abs/2311.15966
  • repo_url: None
  • paper_authors: Daniëlle Schuman, Leo Sünkel, Philipp Altmann, Jonas Stein, Christoph Roch, Thomas Gabor, Claudia Linnhoff-Popien
  • for: This paper is written for image classification tasks, specifically using quantum transfer learning (QTL) and quantum annealing (QA) to improve the performance of classification on large-scale data such as medical images.
  • methods: The paper proposes using annealing-based Quantum Boltzmann Machines as part of a hybrid quantum-classical pipeline for supervised training to learn the classification of real-world data. Simulated Annealing is used as a stand-in for actual QA.
  • results: The paper demonstrates the approach on the three-class COVID-CT-MD dataset, a collection of lung Computed Tomography (CT) scan slices, and compares the performance of the quantum-classical approach to a classical baseline. The results show that the proposed approach consistently outperforms the classical baseline in terms of test accuracy and AUC-ROC-Score, and requires less training epochs to achieve this.Here’s the information in Simplified Chinese text:
  • for: 这篇论文是为了图像分类任务,特别是使用量子传输学习(QTL)和量子渐进(QA)来提高大规模数据,如医学图像的分类性能。
  • methods: 论文提议使用渐进性量子波尔兹曼为量子-классиical管道的supervised训练,以学习真实世界数据的分类。模拟渐进用于实际QA。
  • results: 论文在三类COVID-CT-MD dataset上进行了应用,该 dataset包含肺Computed Tomography(CT)扫描图像。与类型基线相比,提出的方法在测试准确率和AUC-ROC-Score方面表现出色,并且需要 fewer training epochs。
    Abstract Quantum Transfer Learning (QTL) recently gained popularity as a hybrid quantum-classical approach for image classification tasks by efficiently combining the feature extraction capabilities of large Convolutional Neural Networks with the potential benefits of Quantum Machine Learning (QML). Existing approaches, however, only utilize gate-based Variational Quantum Circuits for the quantum part of these procedures. In this work we present an approach to employ Quantum Annealing (QA) in QTL-based image classification. Specifically, we propose using annealing-based Quantum Boltzmann Machines as part of a hybrid quantum-classical pipeline to learn the classification of real-world, large-scale data such as medical images through supervised training. We demonstrate our approach by applying it to the three-class COVID-CT-MD dataset, a collection of lung Computed Tomography (CT) scan slices. Using Simulated Annealing as a stand-in for actual QA, we compare our method to classical transfer learning, using a neural network of the same order of magnitude, to display its improved classification performance. We find that our approach consistently outperforms its classical baseline in terms of test accuracy and AUC-ROC-Score and needs less training epochs to do this.
    摘要

Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift

  • paper_url: http://arxiv.org/abs/2311.15961
  • repo_url: None
  • paper_authors: Jiawei Ge, Shange Tang, Jianqing Fan, Cong Ma, Chi Jin
  • For: 本研究证明了在covariate shift下的最优化算法是什么?(what are the most effective algorithms for OOD generalization under covariate shift?)* Methods: 本研究使用了Maximum Likelihood Estimation(MLE)方法,不需要任何修改,并证明了其在well-specified setting下是最优的。(what methods does the paper use? The paper uses Maximum Likelihood Estimation (MLE) method without any modification, and proves its optimality in the well-specified setting.)* Results: 本研究证明了MLE在covariate shift下是最优的(up to a constant factor),并且不需要density ratio的准确性。此外,研究还证明了在misspecified setting下,MWLE在某些场景下是最优的。(what results does the paper get? The paper proves that MLE is optimal in the covariate shift setting (up to a constant factor), and does not require the accuracy of the density ratio. Additionally, the study also shows that MWLE is optimal in certain scenarios under the misspecified setting.)
    Abstract A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.
    摘要 现代机器学习系统面临一个关键挑战,即在不同分布下进行 OUT-OF-DISTRIBUTION(OOD)泛化。尽管这个问题的重要性很大,但是标准设定下的基本问题——“哪些算法可以最好地实现OOD泛化”——还没有得到解答。这篇论文回答了这个基本问题,证明了经典的最大 LIKELIHOOD 估计(MLE)可以在偏移设定下实现最优性,即在 source data 上使用 MLE 后,无需进行任何修改,就能够在 covariate shift 下实现最优性。我们的结果适用于非常富有的参数模型,并不需要density ratio 的准确性假设。我们在 Linear Regression、Logistic Regression 和 Phase Retrieval 等三个具体的例子中证明了我们的框架的广泛适用性。此外,我们还补充了在不正确设定下的研究,证明在这些情况下,MLE 不再是最优选择,而 Maximum Weighted Likelihood Estimator(MWLE)在某些情况下成为最优选择。

GloNets: Globally Connected Neural Networks

  • paper_url: http://arxiv.org/abs/2311.15947
  • repo_url: https://github.com/antoniodicecco/glonet
  • paper_authors: Antonio Di Cecco, Carlo Metta, Marco Fantozzi, Francesco Morandin, Maurizio Parton
  • for: 提高深度神经网络的性能,超越深度相关的性能降低问题
  • methods: 引入全连接神经网络(GloNet),可以覆盖任何模型,提高网络深度而不增加复杂性或降低性能
  • results: GloNet可以自动调节信息流动,减少深度相关的学习挑战,实现稳定的训练,并且在深度相关的问题下表现出优异的稳定性和自适应性。
    Abstract Deep learning architectures suffer from depth-related performance degradation, limiting the effective depth of neural networks. Approaches like ResNet are able to mitigate this, but they do not completely eliminate the problem. We introduce Globally Connected Neural Networks (GloNet), a novel architecture overcoming depth-related issues, designed to be superimposed on any model, enhancing its depth without increasing complexity or reducing performance. With GloNet, the network's head uniformly receives information from all parts of the network, regardless of their level of abstraction. This enables GloNet to self-regulate information flow during training, reducing the influence of less effective deeper layers, and allowing for stable training irrespective of network depth. This paper details GloNet's design, its theoretical basis, and a comparison with existing similar architectures. Experiments show GloNet's self-regulation ability and resilience to depth-related learning challenges, like performance degradation. Our findings suggest GloNet as a strong alternative to traditional architectures like ResNets.
    摘要 深度学习架构受到深度相关性的性能降低问题限制了神经网络的有效深度。如ResNet等方法可以减轻这个问题,但不能完全消除其。我们介绍全网络连接神经网络(GloNet),一种新型架构,可以在任何模型上叠加,提高其深度无需增加复杂性或降低性能。GloNet的网络头 uniformly从整个网络中收集信息,不受各级别抽象层的限制。这使得GloNet在训练时可以自适应信息流,减少较深层次的影响,使得训练不受深度相关性学习挑战的影响。这篇论文介绍GloNet的设计、理论基础以及与现有相似架构的比较。实验表明GloNet具有自适应能力和对深度相关性学习挑战的抗衰假性。我们的发现表明GloNet可以作为传统架构如ResNet的强大替代品。

Over-Squashing in Riemannian Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2311.15945
  • repo_url: None
  • paper_authors: Julia Balla
  • for: 这个论文旨在解决图 neural network (GNN) 中的过度压缩问题,图中节点特征因为远程节点信息的压缩而失去敏感度。
  • methods: 这篇论文使用了 embedding space 的扩展,对 Riemannian manifold 的变化进行探索,以确定节点特征在不同层数下的敏感度。
  • results: 论文提出了一种基于 Hyperbolic GNNs (HGNNs) 的方法,可以在图中减轻过度压缩问题,并在负曲率图上实现了 theoretically 和 empirically 有望的结果。
    Abstract Most graph neural networks (GNNs) are prone to the phenomenon of over-squashing in which node features become insensitive to information from distant nodes in the graph. Recent works have shown that the topology of the graph has the greatest impact on over-squashing, suggesting graph rewiring approaches as a suitable solution. In this work, we explore whether over-squashing can be mitigated through the embedding space of the GNN. In particular, we consider the generalization of Hyperbolic GNNs (HGNNs) to Riemannian manifolds of variable curvature in which the geometry of the embedding space is faithful to the graph's topology. We derive bounds on the sensitivity of the node features in these Riemannian GNNs as the number of layers increases, which yield promising theoretical and empirical results for alleviating over-squashing in graphs with negative curvature.
    摘要 大多数图 neuronal networks (GNNs) 容易出现过 compress 现象,在图中节点特征变得不敏感于远程节点的信息。 latest works 表明 graph 的topology 对过 compress 有最大的影响,因此提出了图重新连接方法作为解决方案。 在这个工作中,我们研究了 Whether over-squashing 可以通过 GNN 的嵌入空间来缓解。 Specifically, we consider the generalization of Hyperbolic GNNs (HGNNs) to Riemannian manifolds of variable curvature, in which the geometry of the embedding space is faithful to the graph's topology. We derive bounds on the sensitivity of the node features in these Riemannian GNNs as the number of layers increases, which yield promising theoretical and empirical results for alleviating over-squashing in graphs with negative curvature.Here's a word-for-word translation of the text into Simplified Chinese:大多数图 neuronal networks (GNNs) 容易出现过 compress 现象,在图中节点特征变得不敏感于远程节点的信息。 latest works 表明 graph 的topology 对过 compress 有最大的影响,因此提出了图重新连接方法作为解决方案。 在这个工作中,我们研究了 Whether over-squashing 可以通过 GNN 的嵌入空间来缓解。 Specifically, we consider the generalization of Hyperbolic GNNs (HGNNs) to Riemannian manifolds of variable curvature, in which the geometry of the embedding space is faithful to the graph's topology. We derive bounds on the sensitivity of the node features in these Riemannian GNNs as the number of layers increases, which yield promising theoretical and empirical results for alleviating over-squashing in graphs with negative curvature.

Physics-informed neural networks for transformed geometries and manifolds

  • paper_url: http://arxiv.org/abs/2311.15940
  • repo_url: https://github.com/samuelburbulla/trafo-pinn
  • paper_authors: Samuel Burbulla
  • for: 用于模型化复杂的物理系统,特别是具有变形 геометрии的系统。
  • methods: integrate geometric transformations within physics-informed neural networks (PINNs) to robustly accommodate geometric variations, including diffeomorphism as a mapping of a reference domain and adapting the derivative computation of the physics-informed loss function.
  • results: enhance the flexibility of PINNs under geometric variations, demonstrated through several examples including Eikonal equation on Archimedean spiral, Poisson problem on surface manifold, Incompressible Stokes flow in deformed tube, and shape optimization with Laplace operator.
    Abstract Physics-informed neural networks (PINNs) effectively embed physical principles into machine learning, but often struggle with complex or alternating geometries. We propose a novel method for integrating geometric transformations within PINNs to robustly accommodate geometric variations. Our method incorporates a diffeomorphism as a mapping of a reference domain and adapts the derivative computation of the physics-informed loss function. This generalizes the applicability of PINNs not only to smoothly deformed domains, but also to lower-dimensional manifolds and allows for direct shape optimization while training the network. We demonstrate the effectivity of our approach on several problems: (i) Eikonal equation on Archimedean spiral, (ii) Poisson problem on surface manifold, (iii) Incompressible Stokes flow in deformed tube, and (iv) Shape optimization with Laplace operator. Through these examples, we demonstrate the enhanced flexibility over traditional PINNs, especially under geometric variations. The proposed framework presents an outlook for training deep neural operators over parametrized geometries, paving the way for advanced modeling with PDEs on complex geometries in science and engineering.
    摘要 物理学 Informed Neural Networks (PINNs) 能够嵌入物理原理到机器学习中,但经常遇到复杂或交替的几何结构。我们提议一种将几何变换 incorporated 到 PINNs 中,以强健地承受几何变化。我们的方法通过 diffeomorphism 将参照领域映射到Reference domain,并修改物理 Informed 损失函数的导数计算。这将 PINNs 扩展到不仅是平滑做变形领域,还有 Lower-dimensional manifold 和直接在训练网络时进行形状优化。我们通过以下几个问题的示例来证明我们的方法的有效性:(i)Archimedean spiral 上的 Eikonal 方程,(ii)Surface manifold 上的 Poisson 问题,(iii)弯曲管道上的压缩流动,(iv)Laplace 算子上的形状优化。通过这些示例,我们证明了我们的方法在几何变化时的强大灵活性,特别是比传统 PINNs 更加灵活。我们的框架开出了训练深度神经操作符的可能性,为科学和工程中的复杂几何问题培育出了新的模型化方法。

Towards Responsible Governance of Biological Design Tools

  • paper_url: http://arxiv.org/abs/2311.15936
  • repo_url: None
  • paper_authors: Richard Moulange, Max Langenkamp, Tessa Alexanian, Samuel Curtis, Morgan Livingston
  • for: 这些研究旨在适应生物设计工具(BDT)的快速进步,以提高蛋白质结构和序列预测模型的预测精度和设计能力。
  • methods: 这些研究使用了生成式机器学习技术,以提高BDT的预测精度和设计能力。
  • results: 这些研究获得了对BDT的预测精度和设计能力的新的发现和设计能力,但同时也存在 dual-use 风险,需要更好地考虑公共安全和创新之间的 equilibrio。
    Abstract Recent advancements in generative machine learning have enabled rapid progress in biological design tools (BDTs) such as protein structure and sequence prediction models. The unprecedented predictive accuracy and novel design capabilities of BDTs present new and significant dual-use risks. For example, their predictive accuracy allows biological agents, whether vaccines or pathogens, to be developed more quickly, while the design capabilities could be used to discover drugs or evade DNA screening techniques. Similar to other dual-use AI systems, BDTs present a wicked problem: how can regulators uphold public safety without stifling innovation? We highlight how current regulatory proposals that are primarily tailored toward large language models may be less effective for BDTs, which require fewer computational resources to train and are often developed in an open-source manner. We propose a range of measures to mitigate the risk that BDTs are misused, across the areas of responsible development, risk assessment, transparency, access management, cybersecurity, and investing in resilience. Implementing such measures will require close coordination between developers and governments.
    摘要 Current regulatory proposals, which are primarily tailored towards large language models, may be less effective for BDTs, which require fewer computational resources to train and are often developed in an open-source manner. To mitigate the risk of misuse, we propose a range of measures, including:1. Responsible development: Ensuring that BDTs are developed in a responsible and ethical manner, with consideration for their potential risks and benefits.2. Risk assessment: Conducting thorough risk assessments to identify potential hazards and vulnerabilities in BDTs, and implementing appropriate mitigation measures.3. Transparency: Ensuring that the development and use of BDTs are transparent, so that the public and regulators can understand the potential risks and benefits.4. Access management: Implementing appropriate access controls to prevent unauthorized access to BDTs and their associated data.5. Cybersecurity: Ensuring that BDTs are secure from cyber threats, and implementing appropriate incident response plans.6. Investing in resilience: Investing in the development of resilience strategies to address potential risks and vulnerabilities in BDTs.Implementing these measures will require close coordination between developers and governments, as well as a willingness to adapt to the rapidly evolving landscape of generative machine learning and biological design tools.

The Graph Convolutional Network with Multi-representation Alignment for Drug Synergy Prediction

  • paper_url: http://arxiv.org/abs/2311.16207
  • repo_url: None
  • paper_authors: Xinxing Yang, Genke Yang, Jian Chu
  • for: 预测药物组合的 synergy 效果
  • methods: 使用图 convolutional neural network with multi-representation alignment (GCNMRA) 模型
  • results: 提出了一种适于药物组合 synergy 预测任务的多表示匹配函数,并考虑了 vector 模轮的辐射优化策略,以提高预测结果的准确性和模型的收敛速度。
    Abstract Drug combination refers to the use of two or more drugs to treat a specific disease at the same time. It is currently the mainstream way to treat complex diseases. Compared with single drugs, drug combinations have better efficacy and can better inhibit toxicity and drug resistance. The computational model based on deep learning concatenates the representation of multiple drugs and the corresponding cell line feature as input, and the output is whether the drug combination can have an inhibitory effect on the cell line. However, this strategy of concatenating multiple representations has the following defects: the alignment of drug representation and cell line representation is ignored, resulting in the synergistic relationship not being reflected positionally in the embedding space. Moreover, the alignment measurement function in deep learning cannot be suitable for drug synergy prediction tasks due to differences in input types. Therefore, in this work, we propose a graph convolutional network with multi-representation alignment (GCNMRA) for predicting drug synergy. In the GCNMRA model, we designed a multi-representation alignment function suitable for the drug synergy prediction task so that the positional relationship between drug representations and cell line representation is reflected in the embedding space. In addition, the vector modulus of drug representations and cell line representation is considered to improve the accuracy of calculation results and accelerate model convergence. Finally, many relevant experiments were run on multiple drug synergy datasets to verify the effectiveness of the above innovative elements and the excellence of the GCNMRA model.
    摘要 药物组合指的是同时使用两或多种药物治疗特定疾病。目前,药物组合是复杂疾病的主流治疗方式。相比单一药物,药物组合具有更好的效果和更好地避免药物抗药和药物毒性。在深度学习基于的计算模型中,我们 concatenates multiple drug representations and corresponding cell line features as input, and the output is whether the drug combination can inhibit the cell line. However, this strategy has the following defects: the alignment of drug representation and cell line representation is ignored, resulting in the synergistic relationship not being reflected positionally in the embedding space. Moreover, the alignment measurement function in deep learning cannot be suitable for drug synergy prediction tasks due to differences in input types. Therefore, in this work, we propose a graph convolutional network with multi-representation alignment (GCNMRA) for predicting drug synergy. In the GCNMRA model, we designed a multi-representation alignment function suitable for the drug synergy prediction task so that the positional relationship between drug representations and cell line representation is reflected in the embedding space. In addition, the vector modulus of drug representations and cell line representation is considered to improve the accuracy of calculation results and accelerate model convergence. Finally, many relevant experiments were run on multiple drug synergy datasets to verify the effectiveness of the above innovative elements and the excellence of the GCNMRA model.

FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters

  • paper_url: http://arxiv.org/abs/2311.15887
  • repo_url: None
  • paper_authors: D. M. Bot, J. Peeters, J. Liesenborgs, J. Aerts
  • for: 本文提出了一种针对爆发性 clustering 的算法 FLASC,用于检测数据中爆发性的分布模式。
  • methods: 该算法基于 HDBSCAN*,通过后处理步骤 diferenciate 分支在检测到的群集 manifold 中,以添加一种可以发现的模式。 本文提出了两种变体,其中一种是更加计算成本高,另一种是更加具有噪声鲁棒性。
  • results: 作者们通过 synthetic 数据集和实际数据集的实验,证明了 FLASC 算法可以具有类似于 HDBSCAN* 的计算成本和稳定性,并且在数据探索中比 HDBSCAN* 更有优势。
    Abstract We present FLASC, an algorithm for flare-sensitive clustering. Our algorithm builds upon HDBSCAN* -- which provides high-quality density-based clustering performance -- through a post-processing step that differentiates branches within the detected clusters' manifold, adding a type of pattern that can be discovered. Two variants of the algorithm are presented, which trade computational cost for noise robustness. We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs using synthetic data sets, resulting in an efficient flare-sensitive clustering algorithm. In addition, we demonstrate the algorithm's benefit in data exploration over HDBSCAN* clustering on two real-world data sets.
    摘要 我们介绍FLASC算法,用于敏感折叠 clustering。我们的算法基于HDBSCAN*,提供高质量浸泡基于分布 clustering性能,通过一个后处理步骤, differentiate clusters的分支在检测到的群集 manifold 中,添加一种可以发现的模式。我们提出了两个变体的算法,这两个变体在计算成本和噪声鲁棒性之间进行了交换。我们表明,这两个变体在计算成本方面与HDBSCAN*相当,并且在数据集中提供了稳定的输出。此外,我们还证明了该算法在数据探索中的优势,比HDBSCAN* clustering在两个真实世界数据集中表现出更好的效果。

Nodal Hydraulic Head Estimation through Unscented Kalman Filter for Data-driven Leak Localization in Water Networks

  • paper_url: http://arxiv.org/abs/2311.15875
  • repo_url: None
  • paper_authors: Luis Romero-Ben, Paul Irofti, Florin Stoican, Vicenç Puig
  • for: 这个论文是为了提出一种基于Unscented Kalman Filter(UKF)的水分配网络(WDN)中节点 hidraulic head估算方法,并应用于泄漏定位。
  • methods: 该方法使用UKF方法来精细地估算水网的 hidraulic state,并考虑了预测模型以及可用的压力和需求测量。
  • results: 在模拟实际情况下测试表明,该方法可以有效地提高水网的状态估算和数据驱动的泄漏定位。
    Abstract In this paper, we present a nodal hydraulic head estimation methodology for water distribution networks (WDN) based on an Unscented Kalman Filter (UKF) scheme with application to leak localization. The UKF refines an initial estimation of the hydraulic state by considering the prediction model, as well as available pressure and demand measurements. To this end, it provides customized prediction and data assimilation steps. Additionally, the method is enhanced by dynamically updating the prediction function weight matrices. Performance testing on the Modena benchmark under realistic conditions demonstrates the method's effectiveness in enhancing state estimation and data-driven leak localization.
    摘要 在这篇论文中,我们提出了基于Unscented Kalman Filter(UKF)方法的水配网(WDN)节点液压头估算方法,用于泄露定位。UKF利用预测模型以及可用的压力和需求测量,进一步细化初始估算的液压状态。为此,它提供了个性化预测和数据吸收步骤。此外,方法还通过动态更新预测函数权重矩阵来增强性。在模拟实际条件下测试 Modena bench mark,表明该方法可以有效地提高状态估算和数据驱动的泄露定位。

A precise symbolic emulator of the linear matter power spectrum

  • paper_url: http://arxiv.org/abs/2311.15865
  • repo_url: https://github.com/deaglanbartlett/symbolic_pofk
  • paper_authors: Deaglan J. Bartlett, Lukas Kammerer, Gabriel Kronberger, Harry Desmond, Pedro G. Ferreira, Benjamin D. Wandelt, Bogdan Burlacu, David Alonso, Matteo Zennaro
  • for: 这个论文主要是为了计算宇宙学参数的物质能谱,以便更好地进行宇宙学分析。
  • methods: 这个论文使用了生物进化程序基于 симвоlic regression 框架,以探索宇宙学参数下的可能的数学表达,以approximate 宇宙学参数和 $\sigma_8$。
  • results: 这个论文得到了一个精度为0.2%的分数的线性能谱表达,以及一个简单的分数表达来 aproximate $\sigma_8$,两者具有类似的精度。这些表达可以 Physical interpretations of various terms in the expression, and provide a simple analytic approximation for $\sigma_8$ with a root mean squared fractional error of just 0.4% when evaluated across the same range of cosmologies.
    Abstract Computing the matter power spectrum, $P(k)$, as a function of cosmological parameters can be prohibitively slow in cosmological analyses, hence emulating this calculation is desirable. Previous analytic approximations are insufficiently accurate for modern applications, so black-box, uninterpretable emulators are often used. We utilise an efficient genetic programming based symbolic regression framework to explore the space of potential mathematical expressions which can approximate the power spectrum and $\sigma_8$. We learn the ratio between an existing low-accuracy fitting function for $P(k)$ and that obtained by solving the Boltzmann equations and thus still incorporate the physics which motivated this earlier approximation. We obtain an analytic approximation to the linear power spectrum with a root mean squared fractional error of 0.2% between $k = 9\times10^{-3} - 9 \, h{\rm \, Mpc^{-1}$ and across a wide range of cosmological parameters, and we provide physical interpretations for various terms in the expression. We also provide a simple analytic approximation for $\sigma_8$ with a similar accuracy, with a root mean squared fractional error of just 0.4% when evaluated across the same range of cosmologies. This function is easily invertible to obtain $A_{\rm s}$ as a function of $\sigma_8$ and the other cosmological parameters, if preferred. It is possible to obtain symbolic approximations to a seemingly complex function at a precision required for current and future cosmological analyses without resorting to deep-learning techniques, thus avoiding their black-box nature and large number of parameters. Our emulator will be usable long after the codes on which numerical approximations are built become outdated.
    摘要 计算物质能谱,$P(k)$, 作为 cosmological 参数的函数可能非常慢,因此使用模拟这个计算是有利可图。先前的分析方法不够精确,因此通常使用透明度不高的黑盒模拟器。我们使用高效的进化生物学计算框架来探索可能的数学表达空间,以估算 $P(k)$ 和 $\sigma_8$。我们学习了现有的低精度适应函数与解析 Boltzmann 方程得到的函数之间的比例,从而仍然包含物理学的核心,这使得我们可以提供高精度的分析。我们在 $k = 9\times10^{-3} - 9 \, h{\rm \, Mpc^{-1}$ 内获得了一个分析式的线性能谱,其误差为 0.2%,并且在各种 cosmological 参数下具有广泛的应用范围。我们还提供了一个简单的分析式方法来估算 $\sigma_8$,其误差为 0.4%,并且可以 invertible 地将其转换为 $A_{\rm s}$ 函数的函数,如果您希望。我们的 emulator 可以在现代和未来的 cosmological 分析中使用,而不需要使用深度学习技术,从而避免黑盒模型的问题和大量参数。我们的模拟器将在数字代码上的数字方法变得过时后仍然可用。

Multi-Agent Reinforcement Learning for Power Control in Wireless Networks via Adaptive Graphs

  • paper_url: http://arxiv.org/abs/2311.15858
  • repo_url: None
  • paper_authors: Lorenzo Mario Amorosa, Marco Skocaj, Roberto Verdone, Deniz Gündüz
  • for: 提高无线网络中的高质量和多样化通信服务的需求,促进了广泛的动态优化策略研究。
  • methods: 使用多智能深度学习(MADRL)方法来解决无线网络中复杂的优化问题,如电力控制。
  • results: 通过使用图 структуры来实现分布式代理之间的交互,提高了MADRL方法的稳定性和普适性。
    Abstract The ever-increasing demand for high-quality and heterogeneous wireless communication services has driven extensive research on dynamic optimization strategies in wireless networks. Among several possible approaches, multi-agent deep reinforcement learning (MADRL) has emerged as a promising method to address a wide range of complex optimization problems like power control. However, the seamless application of MADRL to a variety of network optimization problems faces several challenges related to convergence. In this paper, we present the use of graphs as communication-inducing structures among distributed agents as an effective means to mitigate these challenges. Specifically, we harness graph neural networks (GNNs) as neural architectures for policy parameterization to introduce a relational inductive bias in the collective decision-making process. Most importantly, we focus on modeling the dynamic interactions among sets of neighboring agents through the introduction of innovative methods for defining a graph-induced framework for integrated communication and learning. Finally, the superior generalization capabilities of the proposed methodology to larger networks and to networks with different user categories is verified through simulations.
    摘要 随着无线通信服务的需求不断增长和多样化,研究在无线网络中的动态优化策略已经得到了广泛的关注。其中,多代理深度学习(MADRL)被认为是解决许多复杂优化问题的有效方法。然而,在应用MADRL到各种网络优化问题时,融合问题的挑战仍然存在。在这篇论文中,我们提出使用图structure来实现分布式代理之间的通信协调,以解决这些挑战。具体来说,我们利用图神经网络(GNNs)作为政策参数化的神经网络,以引入关系拟合假设,从而在集体决策过程中引入对称性。此外,我们还提出了定义图strucure-induced框架,以便在集成通信和学习过程中模型多个邻居代理之间的动态互动。最后,我们通过实验证明了我们的方法在更大的网络和不同用户类型的网络中的普适性。

A systematic study comparing hyperparameter optimization engines on tabular data

  • paper_url: http://arxiv.org/abs/2311.15854
  • repo_url: None
  • paper_authors: Balazs Kegl
  • for: 这个论文主要是为了比较所有可用于 hyperparameter 优化(hyperopt)的引擎,并evaluate它们的性能。
  • methods: 作者使用了 Ray Tune 库中available的所有 hyperopt 引擎,并引入了两种 normalize 和综合统计方法,一种是rank-based,另一种是将得分与随机搜索得分和全格子搜索得分之间进行比较。
  • results: 作者发现大多数引擎都超过随机搜索,但只有 HEBO、AX 和 BlendSearch 三个引擎明显出众。此外,作者发现一些引擎似乎专门适用于hyperopt 某些学习算法,这使得在比较研究中使用 hyperopt 技术可能会带来一定的偏袋。
    Abstract We run an independent comparison of all hyperparameter optimization (hyperopt) engines available in the Ray Tune library. We introduce two ways to normalize and aggregate statistics across data sets and models, one rank-based, and another one sandwiching the score between the random search score and the full grid search score. This affords us i) to rank the hyperopt engines, ii) to make generalized and statistically significant statements on how much they improve over random search, and iii) to make recommendations on which engine should be used to hyperopt a given learning algorithm. We find that most engines beat random search, but that only three of them (HEBO, AX, and BlendSearch) clearly stand out. We also found that some engines seem to specialize in hyperopting certain learning algorithms, which makes it tricky to use hyperopt in comparison studies, since the choice of the hyperopt technique may favor some of the models in the comparison.
    摘要 我们对RAY Tune库中的所有优化引擎进行独立比较。我们引入了两种Normalize和综合统计方法,一种基于排名,另一种将得分置于随机搜索得分和全格子搜索得分之间。这使得我们能够i) 将优化引擎排名,ii) 对优化引擎进行通用和统计siginficant的评价,iii) 为给定学习算法进行优化引擎选择。我们发现大多数引擎超过随机搜索,但只有HEBO、AX和BlendSearch三个引擎明显出众。我们还发现了一些引擎在hyperopt中特化于某些学习算法,这使得在比较研究中使用hyperopt变得复杂,因为选择优化技术可能会偏爱某些模型。

Temporal Action Localization for Inertial-based Human Activity Recognition

  • paper_url: http://arxiv.org/abs/2311.15831
  • repo_url: None
  • paper_authors: Marius Bock, Michael Moeller, Kristof Van Laerhoven
  • for: 这篇论文是为了应用机器学习概念到其他领域而写的,特别是将TAL模型应用于穿戴式各种传感器的人体活动识别。
  • methods: 这篇论文使用了现有的TAL模型,并将其应用于 raw 的惯性数据,以进行人体活动识别。
  • results: 论文的结果表明,使用TAL模型可以在 4 个 из 6 个穿戴式活动识别数据集上表现出色,与各种传感器模型相比,提高了 F1 分数的25%。此外,引入 TAL 社区最受欢迎的度量,即平均准确率,表明 TAL 模型能够生成更准确的分段,并且在所有数据集上提高了 NULL 类准确率。
    Abstract A persistent trend in Deep Learning has been the applicability of machine learning concepts to other areas than originally introduced for. As of today, state-of-the-art activity recognition from wearable sensors relies on classifiers being trained on fixed windows of data. Contrarily, video-based Human Activity Recognition has followed a segment-based prediction approach, localizing activity occurrences from start to end. This paper is the first to systematically demonstrate the applicability of state-of-the-art TAL models for wearable Human Activity Recongition (HAR) using raw inertial data as input. Our results show that state-of-the-art TAL models are able to outperform popular inertial models on 4 out of 6 wearable activity recognition benchmark datasets, with improvements ranging as much as 25% in F1-score. Introducing the TAL community's most popular metric to inertial-based HAR, namely mean Average Precision, our analysis shows that TAL models are able to produce more coherent segments along with an overall higher NULL-class accuracy across all datasets. Being the first to provide such an analysis, the TAL community offers an interesting new perspective to inertial-based HAR with yet to be explored design choices and training concepts, which could be of significant value for the inertial-based HAR community.
    摘要 历史上,深度学习领域的一个持续趋势是将机器学习概念应用到原来没有预期的领域。到今天为止,最新的活动识别从佩戴式传感器中使用的批处理方法仍然基于固定窗口的数据进行训练。相反,视频基于人体活动识别采用了分割预测方法,将活动发生时间段化为开始和结束。这篇论文是首次系统地证明了现有的TAL模型在佩戴式人体活动识别(HAR)中使用原始各种数据作为输入时的可靠性。我们的结果显示,现有的TAL模型在4个出于6个佩戴式活动识别数据集上能够超越流行的各种各样的模型,改进率可达25%。引入TAL社区最受欢迎的度量,即平均准确率,我们的分析表明TAL模型能够生成更具一致性的分割,并在所有数据集上提高NULL类准确率。这是首次提供这种分析,TAL社区提供了新的视角,可能对无力感器基的HAR社区具有很大的价值。

Exploring Artificial Intelligence Methods for Energy Prediction in Healthcare Facilities: An In-Depth Extended Systematic Review

  • paper_url: http://arxiv.org/abs/2311.15807
  • repo_url: None
  • paper_authors: Marjan FatehiJananloo, Helen Stopps, J. J. McArthur
  • for: 这个研究旨在描述用机器学习和人工智能技术预测医院建筑物的能源消耗。
  • methods: 该研究通过PRISMA框架进行了全面的文献综述,检查了1884篇文献中的17篇,以确定预测能源的状态 искус智能技术的现状和未来研究方向。
  • results: 该综述发现了影响能源预测的多种数据输入,其中occupancy和气象数据是重要的预测因素。然而,许多研究未能深入探讨数据选择的影响,并且存在了时间动态、运行状态和数据整合方面的缺失。机器学习,特别是深度学习模型如ANNs,在这个领域有潜力,但也存在解释性和计算占用的挑战。
    Abstract Hospitals, due to their complexity and unique requirements, play a pivotal role in global energy consumption patterns. This study conducted a comprehensive literature review, utilizing the PRISMA framework, of articles that employed machine learning and artificial intelligence techniques for predicting energy consumption in hospital buildings. Of the 1884 publications identified, 17 were found to address this specific domain and have been thoroughly reviewed to establish the state-of-the-art and identify gaps where future research is needed. This review revealed a diverse range of data inputs influencing energy prediction, with occupancy and meteorological data emerging as significant predictors. However, many studies failed to delve deep into the implications of their data choices, and gaps were evident regarding the understanding of time dynamics, operational status, and preprocessing methods. Machine learning, especially deep learning models like ANNs, have shown potential in this domain, yet they come with challenges, including interpretability and computational demands. The findings underscore the immense potential of AI in optimizing hospital energy consumption but also highlight the need for more comprehensive and granular research. Key areas for future research include the optimization of ANN approaches, new optimization and data integration techniques, the integration of real-time data into Intelligent Energy Management Systems, and increasing focus on long-term energy forecasting.
    摘要 医院由于其复杂性和特殊需求,在全球能源消耗模式中扮演着重要的角色。这项研究通过使用PRISMA框架,对使用机器学习和人工智能技术预测医院建筑物的能源消耗进行了全面的文献评估。从1884篇文献中,17篇文献addressed这个具体领域,并进行了住持阅读,以确定领域的现状和未来研究方向。这项评估发现了各种数据输入对能源预测的影响,其中占用和气象数据是重要的预测因素。然而,许多研究未能深入探讨数据选择的影响,并且存在时间动态、运行状态和数据处理方法等方面的缺口。机器学习,特别是深度学习模型如人工神经网络,在这个领域表现出了潜力,但同时也存在解释性和计算需求等挑战。发现结果表明,AI在优化医院能源消耗方面存在巨大的潜力,但也需要进一步的全面和细化研究。未来研究的关键领域包括ANN方法优化、新的优化和数据集成技术、实时数据 integration到智能能源管理系统,以及更多关注长期能源预测。

Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective

  • paper_url: http://arxiv.org/abs/2311.15792
  • repo_url: None
  • paper_authors: Lukas Wutschitz, Boris Köpf, Andrew Paverd, Saravan Rajmohan, Ahmed Salem, Shruti Tople, Santiago Zanella-Béguelin, Menglin Xia, Victor Rühle
  • for: 本研究旨在提出一种基于信息流控制的机器学习系统模型,以便利用 metadata 来提供明确的隐私和安全保证,并且可以在多个参与者之间共享敏感信息并实现细化的访问控制。
  • methods: 本研究使用了两种方法来实现用户级不互相侵犯:1)精细调整每个用户的模型,2)在推理时使用检索增强的模型来访问用户特定的数据集。与基线相比,这两种方法可以提供更好的实用性、扩展性和灵活性,同时坚持严格的隐私和安全保证。
  • results: 在两个科学文献数据集上进行了评估,结果显示,使用检索增强的模型可以提供最好的实用性、扩展性和灵活性,同时坚持严格的隐私和安全保证。
    Abstract Modern machine learning systems use models trained on ever-growing corpora. Typically, metadata such as ownership, access control, or licensing information is ignored during training. Instead, to mitigate privacy risks, we rely on generic techniques such as dataset sanitization and differentially private model training, with inherent privacy/utility trade-offs that hurt model performance. Moreover, these techniques have limitations in scenarios where sensitive information is shared across multiple participants and fine-grained access control is required. By ignoring metadata, we therefore miss an opportunity to better address security, privacy, and confidentiality challenges. In this paper, we take an information flow control perspective to describe machine learning systems, which allows us to leverage metadata such as access control policies and define clear-cut privacy and confidentiality guarantees with interpretable information flows. Under this perspective, we contrast two different approaches to achieve user-level non-interference: 1) fine-tuning per-user models, and 2) retrieval augmented models that access user-specific datasets at inference time. We compare these two approaches to a trivially non-interfering zero-shot baseline using a public model and to a baseline that fine-tunes this model on the whole corpus. We evaluate trained models on two datasets of scientific articles and demonstrate that retrieval augmented architectures deliver the best utility, scalability, and flexibility while satisfying strict non-interference guarantees.
    摘要 现代机器学习系统通常使用已经增长到极限的训练集来训练模型。通常情况下,模型训练过程中忽略Metadata如拥有、访问控制和许可信息。相反,以保护隐私为由,我们通常采用通用技术如数据清洁和不同化公民模型训练,这些技术具有内置的隐私/实用负担,这会增加模型性能的限制。此外,这些技术在多个参与者之间共享敏感信息并需要细化访问控制时存在限制。由于忽略Metadata,我们因此失去了利用Metadata来更好地解决安全、隐私和机密性挑战的机会。在本文中,我们采用信息流控制 perspective来描述机器学习系统,这allow us可以利用Metadata如访问控制策略来定义可读性的信息流。根据这种视角,我们比较了两种不同的方法来实现用户级非互斥:1)精细调整每个用户的模型,和2)在推理时使用用户特定的数据来提高模型性能。我们与这两种方法相比较了一个极其不互斥的零 shot baseline,使用公共模型,以及一个基于整个训练集来练习这个模型的基eline。我们在两个科学文献 dataset 上训练模型,并证明了使用提取式扩展的架构可以提供最好的实用性、扩展性和灵活性,同时满足严格的非互斥保证。

Attend Who is Weak: Enhancing Graph Condensation via Cross-Free Adversarial Training

  • paper_url: http://arxiv.org/abs/2311.15772
  • repo_url: None
  • paper_authors: Xinglin Li, Kun Wang, Hanhui Deng, Yuxuan Liang, Di Wu
  • for: 本研究targets the graph condensation problem, aiming to compress large and complex graphs into a concise and synthetic representation that preserves the most essential and discriminative information of structure and features.
  • methods: 我们提出了一种 Shock Absorber 的概念 (a type of perturbation),它可以增强原始图的稳定性和Robustness against changes by selectively perturbing the underrepresented or insufficiently informative parts of the graph. 我们在 adversarial training 中使用了这种 Shock Absorber,通过在 regularly spaced intervals 强制匹配 GNNs 的梯度,以及在每个更新前 Synthetic graph point 上使用 Shock Absorber 来增大 Synthetic dataset 与原始图的距离。
  • results: 我们在 8 个 dataset (3 个图数据集和 5 个节点分类数据集) 上验证了我们的框架,并获得了显著的结果:比如在 Cora、Citeseer 和 Ogbn-Arxiv 上,我们可以获得 nearly 1.13% to 5.03% 的提升,与 SOTA 模型相比。此外,我们的算法增加了只有 0.2% to 2.2% 的额外时间开销,相比于普通的对抗训练,我们的方法可以提高时间效率,大约是 4 倍。
    Abstract In this paper, we study the \textit{graph condensation} problem by compressing the large, complex graph into a concise, synthetic representation that preserves the most essential and discriminative information of structure and features. We seminally propose the concept of Shock Absorber (a type of perturbation) that enhances the robustness and stability of the original graphs against changes in an adversarial training fashion. Concretely, (I) we forcibly match the gradients between pre-selected graph neural networks (GNNs) trained on a synthetic, simplified graph and the original training graph at regularly spaced intervals. (II) Before each update synthetic graph point, a Shock Absorber serves as a gradient attacker to maximize the distance between the synthetic dataset and the original graph by selectively perturbing the parts that are underrepresented or insufficiently informative. We iteratively repeat the above two processes (I and II) in an adversarial training fashion to maintain the highly-informative context without losing correlation with the original dataset. More importantly, our shock absorber and the synthesized graph parallelly share the backward process in a free training manner. Compared to the original adversarial training, it introduces almost no additional time overhead. We validate our framework across 8 datasets (3 graph and 5 node classification datasets) and achieve prominent results: for example, on Cora, Citeseer and Ogbn-Arxiv, we can gain nearly 1.13% to 5.03% improvements compare with SOTA models. Moreover, our algorithm adds only about 0.2% to 2.2% additional time overhead over Flicker, Citeseer and Ogbn-Arxiv. Compared to the general adversarial training, our approach improves time efficiency by nearly 4-fold.
    摘要 在这篇论文中,我们研究了图像压缩(graph condensation)问题,通过压缩大型复杂图进而生成一个简洁、摘要的表示,保留原始结构和特征的关键信息。我们提出了一种叫做冲动吸收器(Shock Absorber)的概念,该概念可以增强原始图的稳定性和Robustness。具体来说,我们在批处理频率 régulièrement spaced intervals中强制匹配预选的图神经网络(GNNs)在原始训练图和生成图之间的梯度。(II)在每个生成图点之前,冲动吸收器会选择性地干扰原始图中不足缺乏信息的部分,以最大化生成图与原始图之间的距离。我们在对抗训练方式下重复这两个过程,以保持高度有用的上下文,而不失去与原始数据集的相关性。另外,我们的冲动吸收器和生成图在自由训练方式下并行进行反向过程,从而减少了对原始对抗训练的额外时间开销。我们在8个数据集(3个图数据集和5个节点分类数据集)上验证了我们的框架,并取得了显著的成果:例如,在Cora、Citeseer和Ogbn-Arxiv上,我们可以相对于SOTA模型提高约1.13%到5.03%。此外,我们的算法增加了约0.2%到2.2%的额外时间开销,相比于Flicker、Citeseer和Ogbn-Arxiv。相比于通用对抗训练,我们的方法提高了时间效率约4倍。

Learning Multi-Frequency Partial Correlation Graphs

  • paper_url: http://arxiv.org/abs/2311.15756
  • repo_url: https://github.com/officiallydac/bspcg
  • paper_authors: Gabriele D’Acunto, Paolo Di Lorenzo, Francesco Bonchi, Stefania Sardellitti, Sergio Barbarossa
  • for: 提高时间序列之间的相互关系学习的研究,并能够在不同频率带中划分相互关系。
  • methods: 提出了一种块稀、频率依赖的偏 correlate图,其中层次分别对应不同的频率带,并且可以在一些层次上存在偏 correlate。
  • results: 对synthetic数据进行了数值实验,并证明了提出的方法可以超过当前状态的艺术。此外,对金融时间序列进行分析,证明了在特定的频率带内存在偏 correlate,表明了我们的方法可以提供有价值的洞察。
    Abstract Despite the large research effort devoted to learning dependencies between time series, the state of the art still faces a major limitation: existing methods learn partial correlations but fail to discriminate across distinct frequency bands. Motivated by many applications in which this differentiation is pivotal, we overcome this limitation by learning a block-sparse, frequency-dependent, partial correlation graph, in which layers correspond to different frequency bands, and partial correlations can occur over just a few layers. To this aim, we formulate and solve two nonconvex learning problems: the first has a closed-form solution and is suitable when there is prior knowledge about the number of partial correlations; the second hinges on an iterative solution based on successive convex approximation, and is effective for the general case where no prior knowledge is available. Numerical results on synthetic data show that the proposed methods outperform the current state of the art. Finally, the analysis of financial time series confirms that partial correlations exist only within a few frequency bands, underscoring how our methods enable the gaining of valuable insights that would be undetected without discriminating along the frequency domain.
    摘要 尽管有大量研究投入了时间序列之间的学习依赖关系,现状下的最佳方法仍面临一个主要限制:现有方法只学习部分相关性,而不能区分不同频率带。为了解决这个限制,我们超越了现状,通过学习块简单、频率相关的部分相关图,其层次对应不同频率带,而且部分相关性只能在几层之间发生。为达到这个目标,我们提出并解决了两个非核心学习问题:第一个问题有关闭式解决方案,适用于有相关性数量的先验知识的情况;第二个问题基于Successive Convex Approximation的迭代解法,适用于一般情况下,无先验知识的情况。数据分析表明,我们提出的方法在synthetic数据上超过当前状态的表现。最后,对金融时间序列的分析表明,partial相关性只存在于一些频率带中,这 подтвержда了我们的方法可以获得不可靠的启示,而不可能通过不区分频率域来获得。

Tabular Two-Dimensional Correlation Analysis for Multifaceted Characterization Data

  • paper_url: http://arxiv.org/abs/2311.15703
  • repo_url: None
  • paper_authors: Shun Muroga, Satoshi Yamazaki, Koji Michishio, Hideaki Nakajima, Takahiro Morimoto, Nagayasu Oshima, Kazufumi Kobashi, Toshiya Okazaki
  • for: 本研究旨在提出二维相关分析方法,用于从多方面Characterization数据中提取特征,以更好地理解材料性能。
  • methods: 该方法使用 Hierarchical clustering 和异步相关分析,将结构参数变化的相似性和阶段延迟 visualized 为热图,以便更好地理解材料的层次结构。
  • results: 通过应用该方法于碳纳米管(CNTs)膜在不同温度下的数据集,发现了这些材料的层次结构的复杂性,包括void、bundle 和杂质碳。分析结果解决了对 strucutral 变化的顺序问题,尤其是在多方面Characterization 数据中,11 个结构参数由 8 种测量方法产生的复杂行为。
    Abstract We propose tabular two-dimensional correlation analysis for extracting features from multifaceted characterization data, essential for understanding material properties. This method visualizes similarities and phase lags in structural parameter changes through heatmaps, combining hierarchical clustering and asynchronous correlations. We applied the proposed method to datasets of carbon nanotube (CNTs) films annealed at various temperatures and revealed the complexity of their hierarchical structures, which include elements like voids, bundles, and amorphous carbon. Our analysis addresses the challenge of attempting to understand the sequence of structural changes, especially in multifaceted characterization data where 11 structural parameters derived from 8 characterization methods interact with complex behavior. The results show how phase lags (asynchronous changes from stimuli) and parameter similarities can illuminate the sequence of structural changes in materials, providing insights into phenomena like the removal of amorphous carbon and graphitization in annealed CNTs. This approach is beneficial even with limited data and holds promise for a wide range of material analyses, demonstrating its potential in elucidating complex material behaviors and properties.
    摘要 我们提议使用二维 correlate 分析方法,从多方面Characterization数据中提取特征,这些特征是物质性质的理解所必需的。这种方法通过热图显示相似性和阶段延迟,在层次 clustering 和 asynchronous 相关性之间进行组合。我们对碳纳米管(CNTs)电泳膜的不同温度处理数据进行了应用,并揭示了这些结构的层次结构的复杂性,包括void、 bundle 和杂质碳。我们的分析解决了对 strucutral 变化的顺序问题,特别是在多方面Characterization数据中,11个结构参数由8种测量方法的复杂行为相互作用。结果显示了阶段延迟(异步变化)和参数相似性可以照明材料中的结构变化的顺序,提供了对于如果摘除杂质碳和Graphitization在电泳膜中的phenomena的深入理解。这种方法具有有限数据的优点,并且在各种材料分析中具有潜在的潜力,因此可以用于解释复杂的材料行为和性质。

Automated discovery of trade-off between utility, privacy and fairness in machine learning models

  • paper_url: http://arxiv.org/abs/2311.15691
  • repo_url: None
  • paper_authors: Bogdan Ficiu, Neil D. Lawrence, Andrei Paleyes
  • for: 这个论文的目的是为了解决机器学习模型在做出决策和政策时是如何保证它们的决策是公正的和遵守政府法规的问题。
  • methods: 这个论文使用了 bayesian 优化来解决机器学习模型中的公正性、隐私和性能之间的贸易决策问题,并提出了 PFairDP 管道来找到这些模型的 pareto 优点点。
  • results: 论文通过实验表明,PFairDP 可以成功地找到机器学习模型中的公正性、隐私和性能之间的平衡点,并且可以用来复现手动设置约束的结果。
    Abstract Machine learning models are deployed as a central component in decision making and policy operations with direct impact on individuals' lives. In order to act ethically and comply with government regulations, these models need to make fair decisions and protect the users' privacy. However, such requirements can come with decrease in models' performance compared to their potentially biased, privacy-leaking counterparts. Thus the trade-off between fairness, privacy and performance of ML models emerges, and practitioners need a way of quantifying this trade-off to enable deployment decisions. In this work we interpret this trade-off as a multi-objective optimization problem, and propose PFairDP, a pipeline that uses Bayesian optimization for discovery of Pareto-optimal points between fairness, privacy and utility of ML models. We show how PFairDP can be used to replicate known results that were achieved through manual constraint setting process. We further demonstrate effectiveness of PFairDP with experiments on multiple models and datasets.
    摘要 机器学习模型在决策和政策操作中作为中心组件,直接影响个人生活。为了 acted ethically 和遵循政府法规,这些模型需要做出公平的决策并保护用户隐私。然而,这些需求可能会导致模型性能下降,与可能带有偏见和隐私泄露的模型相比。因此,机器学习模型的公平、隐私和性能之间存在负面的贸易,需要一种方法来衡量这个贸易,以便进行部署决策。在这种情况下,我们将这个贸易解释为多目标优化问题,并提出了PFairDP,一个使用 Bayesian 优化的管道,用于发现机器学习模型的公平、隐私和用户之间的 pareto 优点点。我们证明了PFairDP可以用来复制通过手动约束设定过程实现的知名结果。此外,我们还通过多个模型和数据集的实验证明了PFairDP的有效性。

The Battleship Approach to the Low Resource Entity Matching Problem

  • paper_url: http://arxiv.org/abs/2311.15685
  • repo_url: https://github.com/bargenossar/the-battleship-approach-to-al-of-em-problem
  • paper_authors: Bar Genossar, Avigdor Gal, Roee Shraga
  • for: 解决low resource entity matching问题,即寻找有限数据量下的实体匹配问题。
  • methods: 提出了一种新的活动学习方法,基于 tuple pair 的分布式表示,以优化 entity matching 的匹配率。
  • results: 对比 state-of-the-art 活动学习解决方案,提出的算法在低资源实体匹配问题上表现出色,并且使用 fewer samples 可以与 fully trained 算法匹配。
    Abstract Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving entity matching. Although demonstrating unprecedented results, these solutions suffer from a major drawback as they require large amounts of labeled data for training, and, as such, are inadequate to be applied to low resource entity matching problems. To overcome the challenge of obtaining sufficient labeled data we offer a new active learning approach, focusing on a selection mechanism that exploits unique properties of entity matching. We argue that a distributed representation of a tuple pair indicates its informativeness when considered among other pairs. This is used consequently in our approach that iteratively utilizes space-aware considerations. Bringing it all together, we treat the low resource entity matching problem as a Battleship game, hunting indicative samples, focusing on positive ones, through awareness of the latent space along with careful planning of next sampling iterations. An extensive experimental analysis shows that the proposed algorithm outperforms state-of-the-art active learning solutions to low resource entity matching, and although using less samples, can be as successful as state-of-the-art fully trained known algorithms.
    摘要 实体匹配问题是数据集合中核心的数据集成问题,即判断两个数据元素是否对应于真实世界中的同一个实体。 current deep learning methods 使用预训练语言模型,已经提出了解决实体匹配问题的新方法,但这些方法受到一个主要的缺点,即需要大量的标注数据进行训练,因此对低资源实体匹配问题不适用。 为了解决获得充足的标注数据的挑战,我们提出了一种新的活动学习方法,关注选择机制,利用实体匹配的特有性。我们 argue That a distributed representation of a tuple pair indicates its informativeness when considered among other pairs。这种方法 iteratively utilizes space-aware considerations。将其总结起来,我们对低资源实体匹配问题进行了一种战舰游戏的尝试,猎捕指示性样本,专注于正面样本,通过观察隐藏空间以及精心规划下一轮采样迭代。 experimental analysis 表明,我们提出的算法可以在低资源实体匹配问题中超越现有的活动学习解决方案,并且使用更少的样本,可以与完全训练的知道算法匹配。

Information theoretic study of the neural geometry induced by category learning

  • paper_url: http://arxiv.org/abs/2311.15682
  • repo_url: None
  • paper_authors: Laurent Bonnasse-Gahot, Jean-Pierre Nadal
  • for: 本研究探讨了生物 neural network 和人工 neural network 中 categorization 的重要性,通过信息理论的方法来评估表示induced的效率。
  • methods: 本研究使用了信息理论的方法,将相关的 Bayesian 成本 decomposes 为两部分,一部分是代码部分,另一部分是解码部分。最小化代码成本意味着 maximize neural activities 和类别之间的共识信息。我们analytically 表明,这个共识信息可以写作两个项目,第一项是找到合适的表示空间,第二项是在这个空间上建立一个合适的表示,基于神经 Fisher 信息。
  • results: 一个重要结论是,category learning 会导致决策边界附近的神经空间扩展。此外,我们提供了数据示例,表明 Fisher 信息 coding 神经 populations 与类别边界对齐。
    Abstract Categorization is an important topic both for biological and artificial neural networks. Here, we take an information theoretic approach to assess the efficiency of the representations induced by category learning. We show that one can decompose the relevant Bayesian cost into two components, one for the coding part and one for the decoding part. Minimizing the coding cost implies maximizing the mutual information between the set of categories and the neural activities. We analytically show that this mutual information can be written as the sum of two terms that can be interpreted as (i) finding an appropriate representation space, and, (ii) building a representation with the appropriate metrics, based on the neural Fisher information on this space. One main consequence is that category learning induces an expansion of neural space near decision boundaries. Finally, we provide numerical illustrations that show how Fisher information of the coding neural population aligns with the boundaries between categories.
    摘要 categorization 是生物和人工神经网络中的重要话题。我们采用信息学方法来评估分类学习所导致的表示的效率。我们表明,可以将相关的 bayesian 成本 decomposing 为两个部分:一部分是编码成本,另一部分是解码成本。最小化编码成本意味着最大化 neural activities 和 category set 之间的 mutual information。我们analytically 表明,这个 mutual information 可以写作在 representation space 上找到适当的表示空间,并在这个空间上建立适当的 metric 基于神经 Fisher information。这一点的主要后果是,分类学习会导致决策边界附近的神经空间扩展。最后,我们提供了数字图示,显示了编码神经 популяция的 Fisher information 与分类边界align。

Accelerating Hierarchical Associative Memory: A Deep Equilibrium Approach

  • paper_url: http://arxiv.org/abs/2311.15673
  • repo_url: https://github.com/cgoemaere/hamdeq
  • paper_authors: Cédric Goemaere, Johannes Deleu, Thomas Demeester
  • for: 提高 Hierarchical Associative Memory 模型在数字硬件上的实现效率,以便未来的研究和应用。
  • methods: 提议两种加速 Hierarchical Associative Memory 模型的记忆检索方法,包括将其映射到深度平衡模型,以及循环优化偶数和奇数层。
  • results: 通过使用这两种方法,可以大幅提高 Hierarchical Associative Memory 模型的能量最小化速度,并提供证明性实验结果。
    Abstract Hierarchical Associative Memory models have recently been proposed as a versatile extension of continuous Hopfield networks. In order to facilitate future research on such models, especially at scale, we focus on increasing their simulation efficiency on digital hardware. In particular, we propose two strategies to speed up memory retrieval in these models, which corresponds to their use at inference, but is equally important during training. First, we show how they can be cast as Deep Equilibrium Models, which allows using faster and more stable solvers. Second, inspired by earlier work, we show that alternating optimization of the even and odd layers accelerates memory retrieval by a factor close to two. Combined, these two techniques allow for a much faster energy minimization, as shown in our proof-of-concept experimental results. The code is available at https://github.com/cgoemaere/hamdeq
    摘要 Hierarchical Associative Memory models 最近被提议作为连续式豪维尔网络的扩展。为便于未来研究这些模型,特别是在大规模上,我们主要关注增加其在数字硬件上的 simulate 效率。具体来说,我们提出了两种加快 память检索的策略,这与它们在推理中的使用相同,但 equally important during training。首先,我们表明它们可以被转换为深度平衡模型,这使得使用更快和稳定的解决方案。其次, drawing on earlier work, we show that alternating optimization of the even and odd layers can accelerate memory retrieval by a factor close to two。总的来说,这两种技术可以使得能量减少得到了很大的加速,如我们的证明性实验结果所示。代码可以在 https://github.com/cgoemaere/hamdeq 中找到。

Universal Event Detection in Time Series

  • paper_url: http://arxiv.org/abs/2311.15654
  • repo_url: https://github.com/menouarazib/eventdetector
  • paper_authors: Menouar Azib, Benjamin Renard, Philippe Garnier, Vincent Génot, Nicolas André
  • for: 这 paper 是为了检测多变量时间序列数据中的事件而写的。
  • methods: 这 paper 使用了一种监督式深度学习方法,而不是二分类。这种简化可以避免整个数据集中每个点的标签需要 manually 标注,而是仅仅使用时间点或时间间隔的ground truth事件。
  • results: 这 paper 证明了这种方法是 universally 适用的,可以准确地检测任何类型的事件,只要时间序列具备某些稳定性假设。这些事件可以包括变点、诈骗、异常、物理现象和更多。这些结论得到了 FFN 的 universality 应用 theorem 的支持,并且通过实验来证明。
    Abstract In our previously published work, we introduced a supervised deep learning method for event detection in multivariate time series data, employing regression instead of binary classification. This simplification avoids the need for point-wise labels throughout the entire dataset, relying solely on ground truth events defined as time points or intervals. In this paper, we establish mathematically that our method is universal, and capable of detecting any type of event with arbitrary precision under mild continuity assumptions on the time series. These events may encompass change points, frauds, anomalies, physical occurrences, and more. We substantiate our theoretical results using the universal approximation theorem for feed-forward neural networks (FFN). Additionally, we provide empirical validations that confirm our claims, demonstrating that our method, with a limited number of parameters, outperforms other deep learning approaches, particularly for rare events and imbalanced datasets from different domains.
    摘要 在我们之前发表的工作中,我们介绍了一种监督式深度学习方法用于多变量时间序列数据中的事件检测,使用回归而不是二分类。这种简化可以避免整个数据集中每个点的点wise标签,仅仅基于时间点或时间间隔的真实事件标注。在这篇论文中,我们证明了我们的方法是通用的,能够检测任何类型的事件,并且可以在某些假设下保证有任何程度的准确性。这些事件可以包括时间变化点、诈骗、异常、物理现象和更多。我们通过卷积神经网络的通用approximation定理(FFN)证明了我们的理论结果。此外,我们还提供了实际验证,证明我们的方法,只需少量参数,可以超越其他深度学习方法,特别是对于罕见事件和不均衡数据集。

Bandits Meet Mechanism Design to Combat Clickbait in Online Recommendation

  • paper_url: http://arxiv.org/abs/2311.15647
  • repo_url: None
  • paper_authors: Thomas Kleine Buening, Aadirupa Saha, Christos Dimitrakakis, Haifeng Xu
  • for: 这个论文研究了一种战略性多臂带刺问题,即战略点击带刺问题。这个问题是基于在线推荐,每个臂的点击率是由臂自己选择的,以 maximize 点击数。
  • methods: 作者提出了一种奖励相关的学习算法,称为 UCB-S,可以同时完成两个目标:(a)奖励愿意行为的臂;(b)将未知参数学习到最佳值。
  • results: 作者证明了在 UCB-S 下,所有近似尼亚希尔均衡中的臂都可以具有低偏差。同时,作者也证明了不具有奖励相关性的算法通常无法实现低偏差。
    Abstract We study a strategic variant of the multi-armed bandit problem, which we coin the strategic click-bandit. This model is motivated by applications in online recommendation where the choice of recommended items depends on both the click-through rates and the post-click rewards. Like in classical bandits, rewards follow a fixed unknown distribution. However, we assume that the click-rate of each arm is chosen strategically by the arm (e.g., a host on Airbnb) in order to maximize the number of times it gets clicked. The algorithm designer does not know the post-click rewards nor the arms' actions (i.e., strategically chosen click-rates) in advance, and must learn both values over time. To solve this problem, we design an incentive-aware learning algorithm, UCB-S, which achieves two goals simultaneously: (a) incentivizing desirable arm behavior under uncertainty; (b) minimizing regret by learning unknown parameters. We characterize all approximate Nash equilibria among arms under UCB-S and show a $\tilde{\mathcal{O} (\sqrt{KT})$ regret bound uniformly in every equilibrium. We also show that incentive-unaware algorithms generally fail to achieve low regret in the strategic click-bandit. Finally, we support our theoretical results by simulations of strategic arm behavior which confirm the effectiveness and robustness of our proposed incentive design.
    摘要 我们研究一种策略变体的多臂枪 Problem,我们称之为策略Click Bandit。这个模型是在线推荐中Click through rate和后Click reward的选择相互关联的应用所 inspirited。与 классиical bandits一样,奖励随机随机分布。然而,我们假设每个臂的Click rate是由臂(例如,Airbnb上的主机)选择的,以 Maximize the number of times it gets clicked。算法设计者不知道后Click reward nor arms' actions(即选择的Click rate)在先期,必须逐渐学习这些值。为解决这个问题,我们设计了一种注意力相关的学习算法,UCBS,它同时实现了两个目标:(a)吸引欲望的臂行为的激励;(b)避免 regret by learning unknown parameters。我们描述了所有approximate Nash equilibria among arms under UCB-S,并证明了 $\tilde{\mathcal{O} (\sqrt{KT})$ regret bound,这个 bound uniform in every equilibrium。我们还证明了不注意力相关的算法通常无法实现低 regret in the strategic click-bandit。最后,我们通过对策略臂的 simulations of behavior confirm the effectiveness and robustness of our proposed incentive design.

Leveraging Out-of-Domain Data for Domain-Specific Prompt Tuning in Multi-Modal Fake News Detection

  • paper_url: http://arxiv.org/abs/2311.16496
  • repo_url: None
  • paper_authors: Debarshi Brahma, Amartya Bhattacharya, Suraj Nagaje Mahadev, Anmol Asati, Vikas Verma, Soma Biswas
  • for: 本研究旨在解决现代信息泛洪中假新闻散布的问题,即使只有有限的注释数据。
  • methods: 本研究提出了一种名为DPOD(域pecificPrompt-tuning using Out-of-Domain data)的新框架,通过修改CLIP图像语言模型,以实现图像和相应的文本描述的标签意识对齐。此外,还提出了一种基于所有可用域的域pecific prompt学习技术,以利用这些域的训练样本来提高检测性能。
  • results: 对于大规模的NewsClippings benchmark dataset,DPOD Framework实现了状态的最佳性能,明显超过现有的方法。
    Abstract The spread of fake news using out-of-context images has become widespread and is a challenging task in this era of information overload. Since annotating huge amounts of such data requires significant time of domain experts, it is imperative to develop methods which can work in limited annotated data scenarios. In this work, we explore whether out-of-domain data can help to improve out-of-context misinformation detection (termed here as multi-modal fake news detection) of a desired domain, eg. politics, healthcare, etc. Towards this goal, we propose a novel framework termed DPOD (Domain-specific Prompt-tuning using Out-of-Domain data). First, to compute generalizable features, we modify the Vision-Language Model, CLIP to extract features that helps to align the representations of the images and corresponding text captions of both the in-domain and out-of-domain data in a label-aware manner. Further, we propose a domain-specific prompt learning technique which leverages the training samples of all the available domains based on the the extent they can be useful to the desired domain. Extensive experiments on a large-scale benchmark dataset, namely NewsClippings demonstrate that the proposed framework achieves state of-the-art performance, significantly surpassing the existing approaches for this challenging task.
    摘要 “评伪新闻使用无对话的图像变得普遍,是信息混乱时代中的挑战。由于验证大量这种数据需要专家的大量时间,因此需要发展可以在有限组 annotated 数据下运作的方法。在这个工作中,我们调查 Whether out-of-domain 数据可以帮助改善 politics, healthcare 等领域的跨域评伪检测(称为多modal fake news detection)。为达到这个目标,我们提出了一个名为 DPOD 的新框架(Domain-specific Prompt-tuning using Out-of-Domain data)。首先,我们修改了 Vision-Language Model, CLIP 以生成应用于这些领域的通用特征。然后,我们提出了一种领域专门的启发学习技术,借由所有可用的领域的训练样本,根据它们是否可以帮助所需的领域。实验结果显示,我们的方法在 NewsClippings 大规模 benchmark 数据集上取得了国际领先的性能,与现有的方法相比,表现优异。”

VeryFL: A Verify Federated Learning Framework Embedded with Blockchain

  • paper_url: http://arxiv.org/abs/2311.15617
  • repo_url: https://github.com/gtmllab/veryfl
  • paper_authors: Yihao Li, Yanyi Lai, Chuan Chen, Zibin Zheng
  • For: This paper is written for researchers and developers who are interested in exploring the application of blockchain technology in federated learning. It aims to provide a blockchain-based federated learning framework that is compatible with existing federated learning training tasks.* Methods: The paper proposes a blockchain-based federated learning framework called VeryFL, which embeds the Ethereum network and provides a code practice paradigm for combining federated learning with blockchain. The framework also includes mechanisms for model ownership authentication and watermarking to protect intellectual property rights.* Results: The paper presents the overall structure of the VeryFL framework and demonstrates its feasibility by implementing some blockchain federated learning algorithms on smart contracts. The framework is designed to provide a verifiable training, aggregation, and incentive distribution procedure, which can help ensure the integrity and security of the federated learning process.
    Abstract Blockchain-empowered federated learning (FL) has provoked extensive research recently. Various blockchain-based federated learning algorithm, architecture and mechanism have been designed to solve issues like single point failure and data falsification brought by centralized FL paradigm. Moreover, it is easier to allocate incentives to nodes with the help of the blockchain. Various centralized federated learning frameworks like FedML, have emerged in the community to help boost the research on FL. However, decentralized blockchain-based federated learning framework is still missing, which cause inconvenience for researcher to reproduce or verify the algorithm performance based on blockchain. Inspired by the above issues, we have designed and developed a blockchain-based federated learning framework by embedding Ethereum network. This report will present the overall structure of this framework, which proposes a code practice paradigm for the combination of FL with blockchain and, at the same time, compatible with normal FL training task. In addition to implement some blockchain federated learning algorithms on smart contract to help execute a FL training, we also propose a model ownership authentication architecture based on blockchain and model watermarking to protect the intellectual property rights of models. These mechanism on blockchain shows an underlying support of blockchain for federated learning to provide a verifiable training, aggregation and incentive distribution procedure and thus we named this framework VeryFL (A Verify Federated Learninig Framework Embedded with Blockchain). The source code is avaliable on https://github.com/GTMLLab/VeryFL.
    摘要 带有区块链 empowered 联合学习 (FL) 在最近几年内得到了广泛的研究。不同的区块链基础的联合学习算法、架构和机制被设计用于解决中央式 FL 模式中的问题,如单点故障和数据forge。此外,通过区块链可以更方便地分配激励。社区中出现了多种中央式联合学习框架,如 FedML,以帮助提高 FL 的研究。然而,基于区块链的分布式联合学习框架仍然缺失,这使得研究人员无法根据区块链来重现或验证算法性能。以上问题为我们提供了灵感,我们设计了一个基于区块链的联合学习框架,并将 Ethereum 网络 embedding 到该框架中。本报告将介绍该框架的总结ucture,包括一种将 FL 与区块链结合的代码实践范式,同时与传统的 FL 训练任务相容。此外,我们还在智能合约中实现了一些区块链联合学习算法,以帮助执行 FL 训练任务。此外,我们还提出了一种基于区块链的模型所有权鉴定体系和模型纹理,以保护模型的知识产权。这些基于区块链的机制表明了区块链对联合学习的支持,并为训练、聚合和激励分布式进行验证可重复的过程提供了下layers。因此,我们将这个框架称为 VeryFL(一个验证联合学习框架,嵌入区块链)。源代码可以在 中找到。

Bayesian Approach to Linear Bayesian Networks

  • paper_url: http://arxiv.org/abs/2311.15610
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Seyong Hwang, Kyoungjae Lee, Sunmin Oh, Gunwoong Park
  • for: 本研究提出了第一个 bayesian 方法来学习高维线性 bayesian 网络。
  • methods: 该方法 iteratively 估算每个 topological ordering 元素从后向前,使用 inverse partial covariance matrix。
  • results: 该方法可以成功恢复底层结构,并且在 bayesian 正则化下应用 unequal shrinkage 时显示了有效性。 Specifically, 研究发现,要求样本数 $n = \Omega( d_M^2 \log p)$ 和 $n = \Omega(d_M^2 p^{2/m})$,才能使用提posed algorithm 学习线性 bayesian 网络。
    Abstract This study proposes the first Bayesian approach for learning high-dimensional linear Bayesian networks. The proposed approach iteratively estimates each element of the topological ordering from backward and its parent using the inverse of a partial covariance matrix. The proposed method successfully recovers the underlying structure when Bayesian regularization for the inverse covariance matrix with unequal shrinkage is applied. Specifically, it shows that the number of samples $n = \Omega( d_M^2 \log p)$ and $n = \Omega(d_M^2 p^{2/m})$ are sufficient for the proposed algorithm to learn linear Bayesian networks with sub-Gaussian and 4m-th bounded-moment error distributions, respectively, where $p$ is the number of nodes and $d_M$ is the maximum degree of the moralized graph. The theoretical findings are supported by extensive simulation studies including real data analysis. Furthermore the proposed method is demonstrated to outperform state-of-the-art frequentist approaches, such as the BHLSM, LISTEN, and TD algorithms in synthetic data.
    摘要 Note:* "linear Bayesian networks" should be translated as "直线条件概率网络" (straight-line conditional probability networks)* "Bayesian approach" should be translated as " bayesian方法" (Bayesian method)* "topological ordering" should be translated as "顶点排序" (topological sorting)* "partial covariance matrix" should be translated as "部分协力矩阵" (partial covariance matrix)* "unequal shrinkage" should be translated as "不均等压缩" (unequal shrinkage)* "sub-Gaussian" should be translated as "sub-Gaussian" (sub-Gaussian)* "4m-th bounded-moment" should be translated as "4m-th bounded-moment" (4m-th bounded-moment)* "moralized graph" should be translated as "优化过的图" (moralized graph)* "frequentist approaches" should be translated as "频率主义方法" (frequentist methods)* "synthetic data" should be translated as " sintetic数据" (synthetic data)

Optimal Clustering of Discrete Mixtures: Binomial, Poisson, Block Models, and Multi-layer Networks

  • paper_url: http://arxiv.org/abs/2311.15598
  • repo_url: None
  • paper_authors: Zhongyuan Lyu, Ting Li, Dong Xia
  • For: This paper studies the fundamental limit of clustering networks when a multi-layer network is present, and proposes a novel two-stage clustering method to achieve the minimax optimal clustering error rate.* Methods: The proposed method uses a tensor-based initialization algorithm involving both node and sample splitting, followed by a refinement procedure using likelihood-based Lloyd algorithm. The method can handle extreme network sparsity under the mixture multi-layer stochastic block model (MMSBM).* Results: The proposed method outperforms existing methods in terms of clustering error rate, and can handle count-type weights on edges. The optimal clustering error rates in discrete mixtures can also be achieved by the proposed method.
    Abstract In this paper, we first study the fundamental limit of clustering networks when a multi-layer network is present. Under the mixture multi-layer stochastic block model (MMSBM), we show that the minimax optimal network clustering error rate, which takes an exponential form and is characterized by the Renyi divergence between the edge probability distributions of the component networks. We propose a novel two-stage network clustering method including a tensor-based initialization algorithm involving both node and sample splitting and a refinement procedure by likelihood-based Lloyd algorithm. Network clustering must be accompanied by node community detection. Our proposed algorithm achieves the minimax optimal network clustering error rate and allows extreme network sparsity under MMSBM. Numerical simulations and real data experiments both validate that our method outperforms existing methods. Oftentimes, the edges of networks carry count-type weights. We then extend our methodology and analysis framework to study the minimax optimal clustering error rate for mixture of discrete distributions including Binomial, Poisson, and multi-layer Poisson networks. The minimax optimal clustering error rates in these discrete mixtures all take the same exponential form characterized by the Renyi divergences. These optimal clustering error rates in discrete mixtures can also be achieved by our proposed two-stage clustering algorithm.
    摘要 在这篇论文中,我们首先研究多层网络归一化时的基本限制。在多层随机块模型(MMSBM)下,我们显示出的最优化网络归一化错误率是一种指数形式,并且由多层网络边频分布之间的Renyi差异特征化。我们提出了一种新的两stage网络归一化方法,包括一个基于张量的初始化算法,以及一个基于概率的Lloyd算法进行细化。网络归一化必须同时进行节点社区探测。我们的提posed算法实现了最优化网络归一化错误率,并允许极高的网络稀疏性。在MMSBM下,我们对实际数据进行了实验,并证明了我们的方法可以超过现有方法。通常情况下,网络边的权重是ount-type的。我们然后扩展了我们的方法和分析框架,对权重为ount-type的多层Poisson网络进行研究。在这些粒子混合中,最优化的归一化错误率都是一种指数形式,并且可以通过我们提出的两stage归一化算法实现。

Quantum Langevin Dynamics for Optimization

  • paper_url: http://arxiv.org/abs/2311.15587
  • repo_url: None
  • paper_authors: Zherui Chen, Yuchen Lu, Hao Wang, Yizhou Liu, Tongyang Li
  • for: 解决非对称目标函数优化问题,特别是使用量子融合链动(QLD)解决这类问题。
  • methods: 使用量子融合链动(QLD)方法,利用无穷热泵对系统的交互,使系统靠向全局最小值。
  • results: theoretically prove QLD的收敛性在凸领域,并证明系统的平均能量在低温下逐渐逼近零,减少了对象函数的能量损耗。 numerically show QLD的能量泄漏能力,并对每个参数的影响进行详细讨论。 finally, propose a time-dependent QLD方法,可以在更广泛的非凸领域中更好地收敛,并且在许多非凸领域中超越了一些现有的量子和类型优化算法。
    Abstract We initiate the study of utilizing Quantum Langevin Dynamics (QLD) to solve optimization problems, particularly those non-convex objective functions that present substantial obstacles for traditional gradient descent algorithms. Specifically, we examine the dynamics of a system coupled with an infinite heat bath. This interaction induces both random quantum noise and a deterministic damping effect to the system, which nudge the system towards a steady state that hovers near the global minimum of objective functions. We theoretically prove the convergence of QLD in convex landscapes, demonstrating that the average energy of the system can approach zero in the low temperature limit with an exponential decay rate correlated with the evolution time. Numerically, we first show the energy dissipation capability of QLD by retracing its origins to spontaneous emission. Furthermore, we conduct detailed discussion of the impact of each parameter. Finally, based on the observations when comparing QLD with classical Fokker-Plank-Smoluchowski equation, we propose a time-dependent QLD by making temperature and $\hbar$ time-dependent parameters, which can be theoretically proven to converge better than the time-independent case and also outperforms a series of state-of-the-art quantum and classical optimization algorithms in many non-convex landscapes.
    摘要 我们开始研究使用量子兰姆耳动力学(QLD)解决优化问题,特别是非对称目标函数,这些目标函数可能对传统的梯度下降算法带来很大的阻碍。我们考虑一个与无限热泵相互作用的系统,这种互动导致系统中的量子随机变动和决定性抑制效应,导致系统趋向稳定状态,该状态仅仅偏离全域最小值。我们 teorically 证明 QLD 在凸地形中的渐近减少,表明在低温限制下,系统的平均能量可以接近零,并且在演化时间下降对应数字为 exponential 衰落。numerically,我们首先显示了 QLD 的能量散射能力,并追溯它的起源到自适应发射。此外,我们进行了详细的参数影响分析。最后,根据对比 QLD 与经典 Fokker-Plank-Smoluchowski 方程的观察,我们提出了时间相依的 QLD,这可以理论上证明更好地渐近于稳定状态,并且在许多非对称的地形上超越了一系列现有的量子和经典优化算法。

A Simple Geometric-Aware Indoor Positioning Interpolation Algorithm Based on Manifold Learning

  • paper_url: http://arxiv.org/abs/2311.15583
  • repo_url: None
  • paper_authors: Suorong Yang, Geng Zhang, Jian Zhao, Furao Shen
  • for: 提高室内定位系统中间点的准确性和效率
  • methods: 利用拓扑学原理学习地方的 topological manifold,从而更加简单和高效地估算室内定位点
  • results: 在 simulate 和实际数据集上进行了系统性的实验和性能分析,并证明了提案的算法可以准确地和高效地估算室内定位点,并且在实时室内定位场景中具有实际应用前景。
    Abstract Interpolation methodologies have been widely used within the domain of indoor positioning systems. However, existing indoor positioning interpolation algorithms exhibit several inherent limitations, including reliance on complex mathematical models, limited flexibility, and relatively low precision. To enhance the accuracy and efficiency of indoor positioning interpolation techniques, this paper proposes a simple yet powerful geometric-aware interpolation algorithm for indoor positioning tasks. The key to our algorithm is to exploit the geometric attributes of the local topological manifold using manifold learning principles. Therefore, instead of constructing complicated mathematical models, the proposed algorithm facilitates the more precise and efficient estimation of points grounded in the local topological manifold. Moreover, our proposed method can be effortlessly integrated into any indoor positioning system, thereby bolstering its adaptability. Through a systematic array of experiments and comprehensive performance analyses conducted on both simulated and real-world datasets, we demonstrate that the proposed algorithm consistently outperforms the most commonly used and representative interpolation approaches regarding interpolation accuracy and efficiency. Furthermore, the experimental results also underscore the substantial practical utility of our method and its potential applicability in real-time indoor positioning scenarios.
    摘要 《 interpolación 方法ologies 在indoor positioning 系统中广泛应用。然而,现有的indoor positioning interpolación 算法具有许多内置的限制,包括依赖于复杂的数学模型、有限的灵活性和相对较低的精度。为了提高indoor positioning interpolación 技术的精度和效率,本文提出了一种简单又强大的地理特征意识 interpolación 算法。我们的算法的关键是利用local topological manifold的 геометрические特征,使用拟合学原理来优化 interpolación 精度。因此,不同于构建复杂的数学模型,我们的算法可以更加准确地和高效地估算points grounded in local topological manifold。此外,我们的方法可以轻松地与任何indoor positioning 系统集成,从而增强其适应性。通过对模拟数据集和实际数据集进行系统性的实验和完整性分析,我们证明了我们的算法在 interpolación 精度和效率方面常常超越了最常用和代表性的 interpolación 方法。此外,实验结果还证明了我们的方法在实时indoor positioning 场景中的实际应用性。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. The Traditional Chinese version would be slightly different.

Lightly Weighted Automatic Audio Parameter Extraction for the Quality Assessment of Consensus Auditory-Perceptual Evaluation of Voice

  • paper_url: http://arxiv.org/abs/2311.15582
  • repo_url: None
  • paper_authors: Yi-Heng Lin, Wen-Hsuan Tseng, Li-Chin Chen, Ching-Ting Tan, Yu Tsao
  • for: 这个研究旨在提高严重不同评估方法的标准化和可重复性,以提高诊断和评估声音质量的精度和可靠性。
  • methods: 该研究提议使用轻量级自动音 ParameterExtraction,以增加临床关联性、降低复杂性和提高声音质量评估的可解释性。研究使用年龄、性别和五种音 Parameters:抖动、绝对抖动、晶谐噪听比(HNR)和零交叉。使用类传统机器学习方法进行分类。
  • results: 研究发现,该方法与现有State-of-the-art(SOTA)方法相当,并超过使用流行的音频预训练模型获得的潜在表示。这种方法提供了不同特征提取方法的可行性和声音质量评估中各种特征的适用性。音 Parameters如抖动和HNR被证明是评估声音质量特征的适用性良好的指标。然而,预训练模型在处理噪声相关的评分中存在限制。
    Abstract The Consensus Auditory-Perceptual Evaluation of Voice is a widely employed tool in clinical voice quality assessment that is significant for streaming communication among clinical professionals and benchmarking for the determination of further treatment. Currently, because the assessment relies on experienced clinicians, it tends to be inconsistent, and thus, difficult to standardize. To address this problem, we propose to leverage lightly weighted automatic audio parameter extraction, to increase the clinical relevance, reduce the complexity, and enhance the interpretability of voice quality assessment. The proposed method utilizes age, sex, and five audio parameters: jitter, absolute jitter, shimmer, harmonic-to-noise ratio (HNR), and zero crossing. A classical machine learning approach is employed. The result reveals that our approach performs similar to state-of-the-art (SOTA) methods, and outperforms the latent representation obtained by using popular audio pre-trained models. This approach provide insights into the feasibility of different feature extraction approaches for voice evaluation. Audio parameters such as jitter and the HNR are proven to be suitable for characterizing voice quality attributes, such as roughness and strain. Conversely, pre-trained models exhibit limitations in effectively addressing noise-related scorings. This study contributes toward more comprehensive and precise voice quality evaluations, achieved by a comprehensively exploring diverse assessment methodologies.
    摘要 “对话质量评估工具”是一种广泛使用的工具,用于跨诊所职业人员之间的交流和评估。目前,由于评估过程仰赖经验丰富的临床专业人员,因此容易受到主观性和不一致性的影响。为了解决这个问题,我们提议使用轻量级自动化音 Parameters 提取,以提高临床相关性、减少复杂性和提高评估结果的可读性。我们的方法使用年龄、性别和五个音 Parameters:抖动、绝对抖动、晃动、声音讯号比例(HNR)和零交叉。我们使用了 classical machine learning 方法。结果显示,我们的方法与现有的 state-of-the-art(SOTA)方法相似,并且在使用受欢迎的音预训练模型时表现更好。这个方法提供了不同的特征提取方法之间的比较,并证明了抖动和 HNR 是可以用来描述语音质量属性的有效特征。然而,预训练模型对于噪音相关的评分表现出限制。这些研究对于更加全面和精确的语音质量评估做出了贡献。

Streaming Lossless Volumetric Compression of Medical Images Using Gated Recurrent Convolutional Neural Network

  • paper_url: http://arxiv.org/abs/2311.16200
  • repo_url: None
  • paper_authors: Qianhao Chen, Jietao Chen
  • for: 这篇论文是用于开发一个可靠高效的数据量图像压缩方法,并且考虑到硬件设计和软件实现。
  • methods: 本文提出了一个硬件友好的流动损失压缩框架,使用了只有一千分之一的模型重量,较以前的学习型压缩框架来得更加实用。我们提出了一个阀道组合多种核心结构和融合阀门机制,以捕捉积存像中的条件相互依赖关系。基于这些背景信息,我们预测每个像素的分布,以进行条件编码。
  • results: 我们的方法在不同的医疗影像benchmark上均表现出色,较以前的损失压缩方法和学习型压缩方法来得更好。我们的方法还展示了强大的一致性和竞争性,并且在实时性方面取得了进一步的改进。
    Abstract Deep learning-based lossless compression methods offer substantial advantages in compressing medical volumetric images. Nevertheless, many learning-based algorithms encounter a trade-off between practicality and compression performance. This paper introduces a hardware-friendly streaming lossless volumetric compression framework, utilizing merely one-thousandth of the model weights compared to other learning-based compression frameworks. We propose a gated recurrent convolutional neural network that combines diverse convolutional structures and fusion gate mechanisms to capture the inter-slice dependencies in volumetric images. Based on such contextual information, we can predict the pixel-by-pixel distribution for entropy coding. Guided by hardware/software co-design principles, we implement the proposed framework on Field Programmable Gate Array to achieve enhanced real-time performance. Extensive experimental results indicate that our method outperforms traditional lossless volumetric compressors and state-of-the-art learning-based lossless compression methods across various medical image benchmarks. Additionally, our method exhibits robust generalization ability and competitive compression speed
    摘要 深度学习基于的无损压缩方法在医学三维图像压缩方面提供了重要的优势。然而,许多学习基于的算法往往面临一种实用性和压缩性之间的负担。这篇论文介绍了一种硬件友好的流处理无损三维压缩框架,使用的模型重量只有其他学习基于压缩框架的一万分之一。我们提议一种束 gate 循环卷积神经网络,该网络结合多种卷积结构和融合门机制,以捕捉三维图像中的slice相互依赖关系。基于这些上下文信息,我们可以预测每个像素的分布,以进行 entropy 编码。受硬件/软件合理设计原则导向,我们在 Field Programmable Gate Array 上实现了提高了实时性的方法。广泛的实验结果表明,我们的方法在各种医学图像标准准点上比传统的无损三维压缩器和当前学习基于的无损压缩方法表现出色,同时具有较好的普适性和压缩速度。

Experimental Analysis of Large-scale Learnable Vector Storage Compression

  • paper_url: http://arxiv.org/abs/2311.15578
  • repo_url: https://github.com/hugozhl/hetu
  • paper_authors: Hailin Zhang, Penghao Zhao, Xupeng Miao, Yingxia Shao, Zirui Liu, Tong Yang, Bin Cui
  • for: 这篇论文的目的是为了进行嵌入Vector的压缩,以减少模型训练和部署时的内存消耗。
  • methods: 这篇论文使用了14种嵌入压缩方法,并组织了这些方法成一个分类系统,以便进行比较和评估。
  • results: 这篇论文的实验结果显示,现有的嵌入压缩方法中,有些方法可以实现更好的压缩效果,但是这些方法的优劣尚未得到清晰的探讨。此外,这篇论文还发现了现有方法的限制,并建议了未来研究的可能方向。
    Abstract Learnable embedding vector is one of the most important applications in machine learning, and is widely used in various database-related domains. However, the high dimensionality of sparse data in recommendation tasks and the huge volume of corpus in retrieval-related tasks lead to a large memory consumption of the embedding table, which poses a great challenge to the training and deployment of models. Recent research has proposed various methods to compress the embeddings at the cost of a slight decrease in model quality or the introduction of other overheads. Nevertheless, the relative performance of these methods remains unclear. Existing experimental comparisons only cover a subset of these methods and focus on limited metrics. In this paper, we perform a comprehensive comparative analysis and experimental evaluation of embedding compression. We introduce a new taxonomy that categorizes these techniques based on their characteristics and methodologies, and further develop a modular benchmarking framework that integrates 14 representative methods. Under a uniform test environment, our benchmark fairly evaluates each approach, presents their strengths and weaknesses under different memory budgets, and recommends the best method based on the use case. In addition to providing useful guidelines, our study also uncovers the limitations of current methods and suggests potential directions for future research.
    摘要

Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation

  • paper_url: http://arxiv.org/abs/2311.16199
  • repo_url: None
  • paper_authors: Ameya Daigavane, Song Kim, Mario Geiger, Tess Smidt
  • for: 该论文是用于生成三维分子结构的权重自REGULAR化生成模型。
  • methods: 该模型使用了高阶 $E(3)$-对称特征进行消息传递,并使用圆柱傅里叶信号来有效地模型分子的3D几何结构。
  • results: 该模型可以准确地生成小分子结构,并且超越了现有的权重自REGULAR化模型,接近了扩散模型的性能。
    Abstract We present Symphony, an $E(3)$-equivariant autoregressive generative model for 3D molecular geometries that iteratively builds a molecule from molecular fragments. Existing autoregressive models such as G-SchNet and G-SphereNet for molecules utilize rotationally invariant features to respect the 3D symmetries of molecules. In contrast, Symphony uses message-passing with higher-degree $E(3)$-equivariant features. This allows a novel representation of probability distributions via spherical harmonic signals to efficiently model the 3D geometry of molecules. We show that Symphony is able to accurately generate small molecules from the QM9 dataset, outperforming existing autoregressive models and approaching the performance of diffusion models.
    摘要 我们介绍Symphony,一种$E(3)$-对称的抽象生成模型,用于三维分子结构。该模型通过Iteratively builds a molecule from molecular fragments的方式来生成分子。现有的抽象模型,如G-SchNet和G-SphereNet,使用旋转不变的特征来尊重分子的三维Symmetry。然而,Symphony使用消息传递与更高级的$E(3)$-对称特征,允许 Novel representation of probability distributions via spherical harmonic signals来高效地模型分子的三维几何结构。我们显示Symphony可以准确地生成QM9数据集中的小分子,超越现有的抽象模型,并接近扩散模型的性能。

Ultra-short-term multi-step wind speed prediction for wind farms based on adaptive noise reduction technology and temporal convolutional network

  • paper_url: http://arxiv.org/abs/2311.16198
  • repo_url: https://github.com/jethrojames/wind-speed-forecast-tcn_gru
  • paper_authors: Haojian Huang
  • for: 提高风力发电的使用率,应对能源危机和环境污染
  • methods: 使用数据噪声减少技术、时间卷积网络(TCN)和闭合回归单元(GRU)构建风速预测模型
  • results: 在三座山东省风电农场实验 validate 过程中,提出的模型表现比传统模型和基于 TCN 的其他模型更好,实现了高精度、强稳定的风速预测,预测结果具有操作和管理风电农场的数据支持 potential.
    Abstract As an important clean and renewable kind of energy, wind power plays an important role in coping with energy crisis and environmental pollution. However, the volatility and intermittency of wind speed restrict the development of wind power. To improve the utilization of wind power, this study proposes a new wind speed prediction model based on data noise reduction technology, temporal convolutional network (TCN), and gated recurrent unit (GRU). Firstly, an adaptive data noise reduction algorithm P-SSA is proposed based on singular spectrum analysis (SSA) and Pearson correlation coefficient. The original wind speed is decomposed into multiple subsequences by SSA and then reconstructed. When the Pearson correlation coefficient between the reconstructed sequence and the original sequence is greater than 0.99, other noise subsequences are deleted to complete the data denoising. Then, the receptive field of the samples is expanded through the causal convolution and dilated convolution of TCN, and the characteristics of wind speed change are extracted. Then, the time feature information of the sequence is extracted by GRU, and then the wind speed is predicted to form the wind speed sequence prediction model of P-SSA-TCN-GRU. The proposed model was validated on three wind farms in Shandong Province. The experimental results show that the prediction performance of the proposed model is better than that of the traditional model and other models based on TCN, and the wind speed prediction of wind farms with high precision and strong stability is realized. The wind speed predictions of this model have the potential to become the data that support the operation and management of wind farms. The code is available at link.
    摘要 随着能源危机和环境污染的加剧,清洁可再生能源的发展变得非常重要。风力能源作为一种重要的清洁可再生能源,但风速的不稳定和间歇性限制了风力发电的发展。为了提高风力发电的利用率,本研究提出了一种基于数据噪声减少技术、时间卷积网络(TCN)和闭合回归网络(GRU)的新风速预测模型。首先,一种适应性的数据噪声减少算法P-SSA是提出,该算法基于单色 спектраль分析(SSA)和归一化相关系数。原始风速序列被分解成多个子序列,并通过SSA重建。当归一化相关系数大于0.99时,其他噪声子序列被删除,以完成数据净化。然后,采样的受应度被扩展通过 causal 卷积和扩大卷积,从而提取风速变化的特征。然后,采样序列中的时间特征信息被提取,并通过 GRU 网络进行预测,从而形成 P-SSA-TCN-GRU 风速序列预测模型。该模型在三个山东省风力电站上进行验证。实验结果表明,提出的模型的预测性能比传统模型和基于 TCN 的其他模型更高,并实现了风速预测的高精度和强稳定。这些风速预测可以支持风力电站的运营和管理。代码可以在链接中获取。

Learning with Complementary Labels Revisited: A Consistent Approach via Negative-Unlabeled Learning

  • paper_url: http://arxiv.org/abs/2311.15502
  • repo_url: None
  • paper_authors: Wei Wang, Takashi Ishida, Yu-Jie Zhang, Gang Niu, Masashi Sugiyama
  • for: 解决弱监督学习问题,在训练例子每个都关联一个或多个补充标签,表示该例子不属于哪些类别。
  • methods: 基于一对多分类问题的思想,将 complementary-label learning 表示为一系列的负标签二分类问题。提出了一种风险共轨的方法,并引入了一种过拟合问题的修正方法。
  • results: 在 synthetic 和实际数据上进行了广泛的实验 validate 了我们提出的方法在比较领域的优势,并且证明了该方法的统计可靠性和收敛速率。
    Abstract Complementary-label learning is a weakly supervised learning problem in which each training example is associated with one or multiple complementary labels indicating the classes to which it does not belong. Existing consistent approaches have relied on the uniform distribution assumption to model the generation of complementary labels, or on an ordinary-label training set to estimate the transition matrix. However, both conditions may not be satisfied in real-world scenarios. In this paper, we propose a novel complementary-label learning approach that does not rely on these conditions. We find that complementary-label learning can be expressed as a set of negative-unlabeled binary classification problems when using the one-versus-rest strategy. This observation allows us to propose a risk-consistent approach with theoretical guarantees. Furthermore, we introduce a risk correction approach to address overfitting problems when using complex models. We also prove the statistical consistency and convergence rate of the corrected risk estimator. Extensive experimental results on both synthetic and real-world benchmark datasets validate the superiority of our proposed approach over state-of-the-art methods.
    摘要 <>将文本翻译成简化中文。<>COMPLEMENTARY-LABEL LEARNING是一种弱监督学习问题,每个训练例子都与一个或多个补充标签相关,这些标签指示该例子不属于哪些类。现有的一致方法都是基于uniform分布假设来模型补充标签的生成,或者基于一个标准标签训练集来估算过渡矩阵。然而,这些条件在实际场景中可能并不满足。在这篇论文中,我们提出了一种新的COMPLEMENTARY-LABEL LEARNING方法,不依赖于这些条件。我们发现,当使用一对一策略时,COMPLEMENTARY-LABEL LEARNING可以表示为一系列的负标签二分类问题。这一观察允许我们提议一种风险一致的方法,并且提供了理论保证。此外,我们还引入了一种 corrections 方法来解决使用复杂模型时的预测问题。我们也证明了风险修正估计器的统计一致性和收敛速率。广泛的实验结果表明,我们的提出的方法在实际 benchmark 数据上具有优势。

eess.IV - 2023-11-27

Joint Deep Image Restoration and Unsupervised Quality Assessment

  • paper_url: http://arxiv.org/abs/2311.16372
  • repo_url: None
  • paper_authors: Hakan Emre Gedik, Abhinau K. Venkataramanan, Alan C. Bovik
  • for: 图像修复和图像质量评估领域的研究
  • methods: 使用注意力机制进行图像修复和质量评估
  • results: 实现了高度相关的人类意见分数预测质量Here’s a more detailed explanation of each point:1. for: The paper is written for the fields of image restoration and image quality assessment, with a focus on using deep learning techniques to improve the accuracy of these tasks.2. methods: The paper proposes a novel attention-based convolutional neural network (CNN) that can simultaneously perform image restoration and quality assessment. The network uses “quality attention” maps to focus on the most important regions of the image when making predictions.3. results: The proposed method achieves state-of-the-art deblocking accuracy and a high correlation between predicted quality and human opinion scores, demonstrating its effectiveness in image restoration and quality assessment.
    Abstract Deep learning techniques have revolutionized the fields of image restoration and image quality assessment in recent years. While image restoration methods typically utilize synthetically distorted training data for training, deep quality assessment models often require expensive labeled subjective data. However, recent studies have shown that activations of deep neural networks trained for visual modeling tasks can also be used for perceptual quality assessment of images. Following this intuition, we propose a novel attention-based convolutional neural network capable of simultaneously performing both image restoration and quality assessment. We achieve this by training a JPEG deblocking network augmented with "quality attention" maps and demonstrating state-of-the-art deblocking accuracy, achieving a high correlation of predicted quality with human opinion scores.
    摘要 深度学习技术在图像修复和图像质量评估领域中引发革命性变革,而图像修复方法通常使用合成扭曲的训练数据,而深度质量评估模型则需要昂贵的标注主观数据。然而,最近的研究表明,深度神经网络训练用于视觉模型任务的激活可以用于图像质量评估。基于这一想法,我们提出了一种新的注意力机制 convolutional neural network,可同时进行图像修复和质量评估。我们通过在JPEG块化网络中添加“质量注意力”地图,并达到了最高的块化精度和人脉意见分数之间的高相关性。

Observer study-based evaluation of TGAN architecture used to generate oncological PET images

  • paper_url: http://arxiv.org/abs/2311.16047
  • repo_url: None
  • paper_authors: Roberto Fedrigo, Fereshteh Yousefirizi, Ziping Liu, Abhinav K. Jha, Robert V. Bergen, Jean-Francois Rajotte, Raymond T. Ng, Ingrid Bloise, Sara Harsini, Dan J. Kadrmas, Carlos Uribe, Arman Rahmim
  • for: 这个研究的目的是评估人类观察者能够区别真实的生物医学影像和synthetic的影像。
  • methods: 这个研究使用了two-alternative forced-choice (2AFC) observer study来评估人类观察者的识别能力。
  • results: 结果显示,6名8名训练 observer不能正确地识别真实影像,这表明synthetic dataset具有reasonable的实现度。
    Abstract The application of computer-vision algorithms in medical imaging has increased rapidly in recent years. However, algorithm training is challenging due to limited sample sizes, lack of labeled samples, as well as privacy concerns regarding data sharing. To address these issues, we previously developed (Bergen et al. 2022) a synthetic PET dataset for Head and Neck (H and N) cancer using the temporal generative adversarial network (TGAN) architecture and evaluated its performance segmenting lesions and identifying radiomics features in synthesized images. In this work, a two-alternative forced-choice (2AFC) observer study was performed to quantitatively evaluate the ability of human observers to distinguish between real and synthesized oncological PET images. In the study eight trained readers, including two board-certified nuclear medicine physicians, read 170 real/synthetic image pairs presented as 2D-transaxial using a dedicated web app. For each image pair, the observer was asked to identify the real image and input their confidence level with a 5-point Likert scale. P-values were computed using the binomial test and Wilcoxon signed-rank test. A heat map was used to compare the response accuracy distribution for the signed-rank test. Response accuracy for all observers ranged from 36.2% [27.9-44.4] to 63.1% [54.8-71.3]. Six out of eight observers did not identify the real image with statistical significance, indicating that the synthetic dataset was reasonably representative of oncological PET images. Overall, this study adds validity to the realism of our simulated H&N cancer dataset, which may be implemented in the future to train AI algorithms while favoring patient confidentiality and privacy protection.
    摘要 随着计算机视觉算法在医学成像领域的应用不断增长,但训练算法却面临限制性的样本数量、缺乏标注样本以及隐私问题的数据分享。为解决这些问题,我们在2022年(Bergen等人)开发了一个基于时间生成对抗网络(TGAN)架构的人工智能肿瘤 dataset,并评估其在合成图像中分类肿瘤和找到医学特征。在这项工作中,我们采用了两alternative forced-choice(2AFC)观察者研究,以量化人类观察员对真实和合成的医学PET图像之间的区别能力。研究中,8名训练有素的观察员,包括2名 board-certified 核医师,通过专门的网页应用读取170个真实/合成图像对,每对图像由观察员通过5点likert分度表示自信水平。对每对图像,观察员需要选择真实图像并输入自信水平。计算p-值使用binominal test和wilcoxon signed-rank test。使用热图比较响应准确性分布。所有观察员准确率范围从36.2%(27.9-44.4)到63.1%(54.8-71.3)。6名观察员中6名没有在统计上显示出区别,这表明我们的合成H&N肿瘤 dataset 具有相对准确的 representativeness。总的来说,这项研究为我们的模拟H&N肿瘤dataset增加了有效性,可能在未来用于训练AI算法,保护患者隐私和隐私权。

Machine-to-Machine Transfer Function in Deep Learning-Based Quantitative Ultrasound

  • paper_url: http://arxiv.org/abs/2311.16028
  • repo_url: None
  • paper_authors: Ufuk Soylu, Michael L. Oelze
    for: 这个研究旨在减少深度学习(DL)基于量子超音波(QUS)的数据不一致问题,特别是在收集数据时的水平上。methods: 这个研究使用了转换函数方法(Transfer Function),它可以在机器水平上减少数据不一致问题。此外,研究者还引入了机器至机器(M2M)转换函数,它可以在不同机器之间转移模型。results: 研究结果显示,将数据通过转换函数后,模型的准确性从原本的50%提高到了90%以上,AUC也从0.40提高到0.99。此外,研究者发现,选择calibration phantom的选择有着重要的影响,并且使用循环推断的Wiener filtering可以实现将一个机器的数据转移到另一个机器上。
    Abstract A Transfer Function approach was recently demonstrated to mitigate data mismatches at the acquisition level for a single ultrasound scanner in deep learning (DL) based quantitative ultrasound (QUS). As a natural progression, we further investigate the transfer function approach and introduce a Machine-to-Machine (M2M) Transfer Function, which possesses the ability to mitigate data mismatches at a machine level, i.e., mismatches between two scanners over the same frequency band. This ability opens the door to unprecedented opportunities for reducing DL model development costs, enabling the combination of data from multiple sources or scanners, or facilitating the transfer of DL models between machines with ease. We tested the proposed method utilizing a SonixOne machine and a Verasonics machine. In the experiments, we used a L9-4 array and conducted two types of acquisitions to obtain calibration data: stable and free-hand, using two different calibration phantoms. Without the proposed calibration method, the mean classification accuracy when applying a model on data acquired from one system to data acquired from another system was approximately 50%, and the mean AUC was about 0.40. With the proposed method, mean accuracy increased to approximately 90%, and the AUC rose to the 0.99. Additional observations include that shifts in statistics for the z-score normalization had a significant impact on performance. Furthermore, the choice of the calibration phantom played an important role in the proposed method. Additionally, robust implementation inspired by Wiener filtering provided an effective method for transferring the domain from one machine to another machine, and it can succeed using just a single calibration view without the need for multiple independent calibration frames.
    摘要 Recently, a Transfer Function approach was introduced to address data mismatches at the acquisition level for deep learning (DL) based quantitative ultrasound (QUS) using a single ultrasound scanner. Building upon this, we further investigate the transfer function approach and propose a Machine-to-Machine (M2M) Transfer Function, which can mitigate data mismatches between two scanners over the same frequency band. This ability enables the reduction of DL model development costs, the combination of data from multiple sources or scanners, and the transfer of DL models between machines with ease. We tested the proposed method using a SonixOne machine and a Verasonics machine, with a L9-4 array and two types of acquisitions (stable and free-hand) using two different calibration phantoms. Without the proposed calibration method, the mean classification accuracy was approximately 50%, and the mean AUC was about 0.40. With the proposed method, mean accuracy increased to approximately 90%, and the AUC rose to 0.99. Additionally, we observed that shifts in statistics for z-score normalization had a significant impact on performance, and the choice of calibration phantom played an important role in the proposed method. Furthermore, a robust implementation inspired by Wiener filtering provided an effective method for transferring the domain from one machine to another, succeeding with just a single calibration view without the need for multiple independent calibration frames.

Modular Customizable ROS-Based Framework for Rapid Development of Social Robots

  • paper_url: http://arxiv.org/abs/2311.15780
  • repo_url: None
  • paper_authors: Mahta Akhyani, Hadi Moradi
  • for: developing socially competent robots with tight integration of robotics, computer vision, speech processing, and web technologies
  • methods: using an open-source framework called Socially-interactive Robot Software platform (SROS) with a modular layered architecture, bridging Robot Operating System (ROS) with web and Android interface layers, and implementing specialized perceptual and interactive skills as ROS services
  • results: successfully validated core technologies including computer vision, speech processing, and GPT2 autocomplete speech, demonstrated modularity through integration of an additional ROS package, and enabled synchronized cross-domain interaction and multimodal behaviors on an example platform
    Abstract Developing socially competent robots requires tight integration of robotics, computer vision, speech processing, and web technologies. We present the Socially-interactive Robot Software platform (SROS), an open-source framework addressing this need through a modular layered architecture. SROS bridges the Robot Operating System (ROS) layer for mobility with web and Android interface layers using standard messaging and APIs. Specialized perceptual and interactive skills are implemented as ROS services for reusable deployment on any robot. This facilitates rapid prototyping of collaborative behaviors that synchronize perception with physical actuation. We experimentally validated core SROS technologies including computer vision, speech processing, and GPT2 autocomplete speech implemented as plug-and-play ROS services. Modularity is demonstrated through the successful integration of an additional ROS package, without changes to hardware or software platforms. The capabilities enabled confirm SROS's effectiveness in developing socially interactive robots through synchronized cross-domain interaction. Through demonstrations showing synchronized multimodal behaviors on an example platform, we illustrate how the SROS architectural approach addresses shortcomings of previous work by lowering barriers for researchers to advance the state-of-the-art in adaptive, collaborative customizable human-robot systems through novel applications integrating perceptual and social abilities.
    摘要 开发社交能力Robot需要紧密的机器人、计算机视觉、语音处理和网络技术的集成。我们提出了社交互动Robot软件平台(SROS),这是一个开源框架,通过模块化层次架构来满足这个需求。SROS将ROS层(Robot操作系统层)与网络和Android接口层用标准的消息和API进行桥接。特殊的感知和互动技能被实现为ROS服务,可以重用地部署在任何机器人上。这使得可以快速实现协作行为,同步感知与物理 actuation。我们实验 validate了核心SROS技术,包括计算机视觉、语音处理和GPT2自动完成语音实现为插件式ROS服务。模块性被证明通过不改变硬件或软件平台上成功地集成一个额外的ROS包。功能启用了证明SROS的有效性,通过同步多Modal功能来开发社交互动Robot。通过示例平台上的示例,我们说明了SROS的建筑方法如何解决过去的工作缺点,例如降低了对研究人员的障碍,以便通过新应用程序来提高可靠的、协作、可定制人机系统的状态艺术。

Model-based reconstructions for quantitative imaging in photoacoustic tomography

  • paper_url: http://arxiv.org/abs/2311.15735
  • repo_url: None
  • paper_authors: Andreas Hauptmann, Tanja Tarvainen
  • for: 本文主要用于介绍photoacoustic tomography中的重建任务,包括对目标、几何和量值的影响。
  • methods: 本文使用的重建方法包括模型光学和声学现象,以确保可靠地回归量值。另外,本文还提出了一些对tomographic reconstruction问题的解决方案,包括直观和快速分析方法,以及计算涉及优化方法和现代数据驱动方法。
  • results: 本文主要介绍了photoacoustic tomography中的重建结果,包括量值的回归和图像的重建。
    Abstract The reconstruction task in photoacoustic tomography can vary a lot depending on measured targets, geometry, and especially the quantity we want to recover. Specifically, as the signal is generated due to the coupling of light and sound by the photoacoustic effect, we have the possibility to recover acoustic as well as optical tissue parameters. This is referred to as quantitative imaging, i.e, correct recovery of physical parameters and not just a qualitative image. In this chapter, we aim to give an overview on established reconstruction techniques in photoacoustic tomography. We start with modelling of the optical and acoustic phenomena, necessary for a reliable recovery of quantitative values. Furthermore, we give an overview of approaches for the tomographic reconstruction problem with an emphasis on the recovery of quantitative values, from direct and fast analytic approaches to computationally involved optimisation based techniques and recent data-driven approaches.
    摘要 <>图像成像任务在光子颤音tomography中可以很不同,具体取决于测量目标、几何学和我们想要重建的量。特别是,由光子颤音效应生成的信号可以回归到音频以及光学参数。这称为量化成像,即正确地回归物理参数,而不仅是Qualitative图像。在这个章节中,我们将提供 photoacoustic tomography已知的重建技术的概述。我们从optical和Acoustic现象的模型开始,以确保可靠地回归量化值。此外,我们还会对tomographic重建问题进行概述,强调回归量化值的方法,从直接和快速分析方法到计算涉及优化方法和最近的数据驱动方法。>>>

eess.SP - 2023-11-27

What Really is `Molecule’ in Molecular Communications? The Quest for Physics of Particle-based Information Carriers

  • paper_url: http://arxiv.org/abs/2311.16356
  • repo_url: None
  • paper_authors: Hanlin Xiao, Kamela Dokaj, Ozgur B. Akan
  • for: 本研究旨在探讨分子通信领域中通用信息分子的本质和特性,以及它们在不同的通信系统和应用中的表现。
  • methods: 本文使用了一系列的理论模型和实验方法来研究信息分子的物理特性和相关的通信系统,包括分子拍照法、分子探测法和分子通信网络等。
  • results: 本研究发现,信息分子的物理特性会对其在不同的通信系统中的表现产生重要影响,例如,在不同的溶剂中,信息分子的移动速率和稳定性会有所不同。此外,通信系统的设计和实现也会受到信息分子的物理特性的限制。
    Abstract Molecular communication, as implied by its name, uses molecules as information carriers for communication between objects. It has an advantage over traditional electromagnetic-wave-based communication in that molecule-based systems could be biocompatible, operable in challenging environments, and energetically undemanding. Consequently, they are envisioned to have a broad range of applications, such as in the Internet of Bio-nano Things, targeted drug delivery, and agricultural monitoring. Despite the rapid development of the field, with an increasing number of theoretical models and experimental testbeds established by researchers, a fundamental aspect of the field has often been sidelined, namely, the nature of the molecule in molecular communication. The potential information molecules could exhibit a wide range of properties, making them require drastically different treatments when being modeled and experimented upon. Therefore, in this paper, we delve into the intricacies of commonly used information molecules, examining their fundamental physical characteristics, associated communication systems, and potential applications in a more realistic manner, focusing on the influence of their own properties. Through this comprehensive survey, we aim to offer a novel yet essential perspective on molecular communication, thereby bridging the current gap between theoretical research and real-world applications.
    摘要 分子通信,即其名称所示,使用分子作为信息传递载体,在物体之间进行通信。它与传统的电磁波通信相比,具有优点,如兼容生物体、可在复杂环境下运行、能量占用低。因此,它们被看作有广泛的应用前景,如生物纳нос物网、射针剂物elivery和农业监测等。尽管领域的发展速度加剧,研究人员已经建立了一系列理论模型和实验床,但是一个基本方面经常被忽略,即分子在分子通信中的本质。可以说,信息分子可以展示各种性能,因此在模型和实验中需要不同的处理。因此,在这篇论文中,我们将探讨通用信息分子的基本物理特性、相关的通信系统和应用场景,强调分子自身的性质对分子通信的影响。通过这项全面的检查,我们希望提供一个新的 yet 基本的视角,帮助填补当前理论研究和实际应用之间的漏洞。

Simultaneous Energy Harvesting and Hand Gesture Recognition in Large Area Monolithic Dye-Sensitized Solar Cells

  • paper_url: http://arxiv.org/abs/2311.16284
  • repo_url: None
  • paper_authors: Gethin Thomas, Adam Pockett, Kris Seunarine, Matt Carnie
  • for: 这个论文主要是为了探讨如何使用光电吸收器(DSSC)来实现自然语言交互(HCI),以提高用户与物联网设备(IoT)之间的交互体验。
  • methods: 这篇论文使用了光电吸收器的异形 Patterned monolithic 结构,并使用机器学习算法来识别手势。
  • results: 研究人员通过监测光电吸收器的电流输出,使用机器学习算法,可以准确地识别手势,准确率为 97.71%。这些结果表明,光电吸收器是自适应光照条件下的自动化交互技术的理想选择,同时具有用户质量的外观和功能特点。
    Abstract Internet of Things (IoT) devices have become prevalent, embedding intelligence into our environment. It is projected that over 75 billion IoT devices will be connected by 2025 worldwide, with the majority being operated indoors. Dye-sensitized solar cells (DSSC) have recently been optimized for ambient light, having the capabilities of providing sufficient energy for self-powered IoT devices. Interaction with digital technologies, termed Human Computer Interaction (HCI), is often achieved via physical mechanisms (e.g. remote controls, cell phones) which can hinder the natural interface between users and IoT devices, a key consideration for HCI. What if the solar cell that is powering the IoT device can also recognize hand gestures which would allow the user to naturally interact with the system? Previous attempts to achieve this have necessarily employed an array of solar cell/photodiodes to detect directionality. In this work, we demonstrate that by monitoring the photocurrent output of an asymmetrically patterned monolithic (i.e., single cell) DSSC, and using machine learning, we can recognize simple hand gestures, achieving an accuracy prediction of 97.71%. This work shows that, DSSCs are the perfect choice for self-powered interactive technologies, both in terms of powering IoT devices in ambient light conditions and having aesthetic qualities that are prioritized by users. As well as powering interactive technologies, they can also provide a means of interactive control.
    摘要 互联网物联网(IoT)设备已经广泛普及,嵌入智能到我们的环境中。预计到2025年,全球将有超过750亿个IoT设备连接在线,大多数在室内运行。Recently,染料敏感太阳能电池(DSSC)已经优化了周围的光照,可以为自动化的IoT设备提供充足的能源。与数字技术互动(HCI)相关的交互通常通过物理机制(如远程控制器、手机)进行,这会阻碍用户和IoT设备之间的自然交互,这是HCI的关键考虑因素。如果太阳电池可以识别手势,那么用户可以自然地与系统交互。在这种情况下,我们表明,通过监测异形印刷的单元太阳电池的光电流输出,并使用机器学习,可以识别简单的手势,实现预测准确率为97.71%。这个工作表明,太阳电池是自动化交互技术的 идеal 选择,不仅可以在周围光照条件下为IoT设备提供能源,还可以具有用户考虑的美学特点。此外,它们还可以提供交互控制的方式。

MadRadar: A Black-Box Physical Layer Attack Framework on mmWave Automotive FMCW Radars

  • paper_url: http://arxiv.org/abs/2311.16024
  • repo_url: None
  • paper_authors: David Hunt, Kristen Angell, Zhenzhou Qi, Tingjun Chen, Miroslav Pajic
  • for: 本研究旨在攻击自动驾驶车辆上的激光雷达系统,特别是 False Positive 攻击。
  • methods: 本研究使用了黑盒子攻击框架 MadRadar,可以在实时 estimating ictim 雷达的配置,然后基于这些估计进行攻击。
  • results: 研究表明,攻击者可以在受害者雷达的点云中发生 False Positive、False Negative 和Translation 攻击,并且可以在实际案例中进行证明。
    Abstract Frequency modulated continuous wave (FMCW) millimeter-wave (mmWave) radars play a critical role in many of the advanced driver assistance systems (ADAS) featured on today's vehicles. While previous works have demonstrated (only) successful false-positive spoofing attacks against these sensors, all but one assumed that an attacker had the runtime knowledge of the victim radar's configuration. In this work, we introduce MadRadar, a general black-box radar attack framework for automotive mmWave FMCW radars capable of estimating the victim radar's configuration in real-time, and then executing an attack based on the estimates. We evaluate the impact of such attacks maliciously manipulating a victim radar's point cloud, and show the novel ability to effectively `add' (i.e., false positive attacks), `remove' (i.e., false negative attacks), or `move' (i.e., translation attacks) object detections from a victim vehicle's scene. Finally, we experimentally demonstrate the feasibility of our attacks on real-world case studies performed using a real-time physical prototype on a software-defined radio platform.
    摘要 射频调变连续波(FMCW)毫米波激光器在今天的汽车中扮演了重要的角色。而以往的研究仅仅成功地进行了对这种感应器的伪造攻击,但是所有都假设了攻击者在受害者激光器的配置时有执行时间的知识。在这个研究中,我们介绍了MadRadar,一个可以实时估计受害者激光器的配置,并根据估计进行攻击的通用黑盒激光攻击框架。我们评估了这种攻击对受害者激光器的点云的影响,并显示了可以将 объек detection 添加(即伪阳性攻击)、删除(即伪阴性攻击)或 пере移(即平移攻击)受害者汽车的场景中的物体检测。最后,我们实际地评估了这些攻击的可行性,使用了真实的物理对应激光器平台。

Value-Based Reinforcement Learning for Digital Twins in Cloud Computing

  • paper_url: http://arxiv.org/abs/2311.15985
  • repo_url: None
  • paper_authors: Van-Phuc Bui, Shashi Raj Pandey, Pedro M. de Sant Ana, Petar Popovski
  • For: The paper is written for researchers and practitioners working on digital twin (DT) technology, particularly in the context of networked control systems.* Methods: The paper proposes a reinforcement learning solution combined with a Value of Information-based algorithm for performing optimal control and selecting the most informative sensors to satisfy the prediction accuracy of DT.* Results: The proposed method, called REVERB, can reduce the communication overhead up to five times while offering satisfactory performance for the DT platform.Here’s the Chinese translation of the three points:* For: 这篇论文是为了探讨网络控制系统中的数字孵化(DT)技术,特别是在网络控制系统中DT的应用。* Methods: 该论文提出一种基于强化学习的解决方案,并与值价信息算法结合在一起,以实现最佳控制和选择最有价值的感知器,以满足DT的预测准确性。* Results: 提议的方法(REVERB)可以降低通信负担,最多降低到五倍,并提供满足DT平台的良好性能。
    Abstract The setup considered in the paper consists of sensors in a Networked Control System that are used to build a digital twin (DT) model of the system dynamics. The focus is on control, scheduling, and resource allocation for sensory observation to ensure timely delivery to the DT model deployed in the cloud. Low latency and communication timeliness are instrumental in ensuring that the DT model can accurately estimate and predict system states. However, acquiring data for efficient state estimation and control computing poses a non-trivial problem given the limited network resources, partial state vector information, and measurement errors encountered at distributed sensors. We propose the REinforcement learning and Variational Extended Kalman filter with Robust Belief (REVERB), which leverages a reinforcement learning solution combined with a Value of Information-based algorithm for performing optimal control and selecting the most informative sensors to satisfy the prediction accuracy of DT. Numerical results demonstrate that the DT platform can offer satisfactory performance while reducing the communication overhead up to five times.
    摘要 文章考虑的设置包括在网络控制系统中使用感知器构建数字双方(DT)模型的系统动态。注重控制、调度和资源分配以确保感知器的观测数据在DT模型中的准确预测。然而,由于分布式感知器的限制网络资源、部分状态向量信息和测量误差,获取高效的状态估计和控制计算具有非常困难的问题。我们提出了REinforcement learning和Variational扩展卡尔曼滤波器(REVERB),它利用了强化学习解决方案和基于价值信息算法来实现最佳的控制和选择最有价值的感知器,以满足DT模型的预测精度。 numerics 结果表明,DT平台可以提供满意性的性能,同时减少通信负担达到五倍。

A New Polar-Domain Dictionary Design for the Near-field Region of Extremely Large Aperture Arrays

  • paper_url: http://arxiv.org/abs/2311.15828
  • repo_url: None
  • paper_authors: Özlem Tuğfe Demir, Emil Björnson
  • for: 用于提高多antenna基站(BS)远场区域内用户设备(UE)的定位精度
  • methods: 使用一个基于近场词典的 polar-domain 格局设计
  • results: 在 simulations 中,提议的非均匀距离抽象方法可以实现更低的列准噪和更好的 UE 定位比 uniform 距离抽象方法
    Abstract A grid of orthogonal beams with zero column coherence can be easily constructed to cover all prospective user equipments (UEs) in the far-field region of a multiple-antenna base station (BS). However, when the BS is equipped with an extremely large aperture array, the Fraunhofer distance is huge, causing the UEs to be located in the radiative near-field region. This calls for designing a grid of beams based on a near-field dictionary. In the previous work, a polar-domain grid design was proposed to maintain control over the column coherence. A limitation of this approach is identified in this paper, and we propose an enhanced methodology for the design of a polar-domain dictionary specifically tailored for the near-field of an extremely large aperture uniform planar array. Through simulation results, it is demonstrated that the proposed dictionary, employing a non-uniform distance sampling approach, achieves lower column coherence than the benchmark and significantly improves the localization of UEs compared to uniform distance sampling.
    摘要 一个 orthogonal 框架可以容易地构建,以覆盖所有预测的用户设备(UE)在多antenna基站(BS)的远场区域。然而,当BS装备了极其大的 aperature 数组时,弗洛恩霍фер距离很大,导致UE被置于射频近场区域。这种情况需要基于near-field字典的grid beam设计。在先前的工作中,一种polar-domain 格局设计被提议,以维护列干相关性。这种方法的限制被描述在这篇论文中,我们提出了一种改进的方法,特地针对near-field 的极大天线平面阵列。通过实验结果,我们证明了我们的字典,通过非均匀距离抽样方法,可以在column干相关性方面下降,并且对UE的本地化优于均匀距离抽样。

  • paper_url: http://arxiv.org/abs/2311.15752
  • repo_url: None
  • paper_authors: Prerna Singh, Ayush Tripathi, Lalan Kumar, Tapan Kumar Gandhi
  • for: 这个研究的目的是为了研究年龄对听觉视觉Integration的影响,特别是中年人群的年龄相关变化。
  • methods: 这个研究使用EEG数据进行测量,并在不同年龄组中进行评估听觉视觉Integration的效果。
  • results: 研究发现中年人群的脑活动延迟,特别是对双感知刺激的延迟。年龄增长导致脑区域的神经活动变化,包括额上回和前前兆盘区域的变化。此外,中年人群的听觉视觉Integration相关脑区域可以通过k-means算法归一化为五个不同的脑网络。
    Abstract The seamless integration of visual and auditory information is a fundamental aspect of human cognition. Although age-related functional changes in Audio-Visual Integration (AVI) have been extensively explored in the past, thorough studies across various age groups remain insufficient. Previous studies have provided valuable insights into agerelated AVI using EEG-based sensor data. However, these studies have been limited in their ability to capture spatial information related to brain source activation and their connectivity. To address these gaps, our study conducted a comprehensive audiovisual integration task with a specific focus on assessing the aging effects in various age groups, particularly middle-aged individuals. We presented visual, auditory, and audio-visual stimuli and recorded EEG data from Young (18-25 years), Transition (26- 33 years), and Middle (34-42 years) age cohort healthy participants. We aimed to understand how aging affects brain activation and functional connectivity among hubs during audio-visual tasks. Our findings revealed delayed brain activation in middleaged individuals, especially for bimodal stimuli. The superior temporal cortex and superior frontal gyrus showed significant changes in neuronal activation with aging. Lower frequency bands (theta and alpha) showed substantial changes with increasing age during AVI. Our findings also revealed that the AVI-associated brain regions can be clustered into five different brain networks using the k-means algorithm. Additionally, we observed increased functional connectivity in middle age, particularly in the frontal, temporal, and occipital regions. These results highlight the compensatory neural mechanisms involved in aging during cognitive tasks.
    摘要 人类认知中协调视听信息的无缝融合是一个基本的特征。虽然年龄相关功能变化的听视 integrate (AVI) 已经广泛研究过,但是 across various age groups 的全面研究仍然不充分。先前的研究通过 EEG 数据获得了有价值的信息,但是它们受到了表达相关的脑区域活动和连接的限制。为了缓解这些缺陷,我们的研究在不同年龄组中进行了全面的听视 integrate 任务,特别是对中年人群进行了详细的研究。我们向 Young (18-25 岁)、Transition (26-33 岁) 和 Middle (34-42 岁) 年龄组健康参与者进行了视觉、听音和听视 stimuli,并记录了 EEG 数据。我们想要了解年龄对大脑活动和功能连接性的影响,特别是在听视任务中。我们发现中年人群的大脑活动延迟,尤其是对双感知 stimuli。脑干部的超 temporal cortex 和 superior frontal gyrus 在年龄增长时显示了显著的变化。低频带(theta和alpha)在 AVI 过程中也显示了年龄增长的明显变化。我们还发现在中年人群中,听视 integrate 相关的脑区域可以使用 k-means 算法归一化为五个不同的脑网络。此外,我们发现中年人群中的前额叶、 temporal 和 occipital 区域的功能连接性增加。这些结果指出年龄在认知任务中的补偿性神经机制。

Error Performance of Coded AFDM Systems in Doubly Selective Channels

  • paper_url: http://arxiv.org/abs/2311.15595
  • repo_url: None
  • paper_authors: Haoran Yin
  • for: 本研究探讨了嵌入编码AFDM系统在双选择通道中的错误性能。
  • methods: 我们首先研究了AFDM系统的条件对应Error Probability(PEP),并derived its conditional coding gain。然后,我们发现了AFDM系统的编码 gain和多路分布 gain之间存在一种基本的负相关关系,即编码 gain随着分布路径数量的减少速度下降,而多路分布 gain则线性增加。
  • results: 我们还提出了一种近似优化的 turbo decoder,基于sum-product算法,以提高AFDM系统的错误性能。实验结果证明了我们的分析和提议的有效性,显示AFDM在高速度通道上超过OFDM和OTFS的性能。
    Abstract Affine frequency division multiplexing (AFDM) is a strong candidate for the sixth-generation wireless network thanks to its strong resilience to delay-Doppler spreads. In this letter, we investigate the error performance of coded AFDM systems in doubly selective channels. We first study the conditional pairwise-error probability (PEP) of AFDM system and derive its conditional coding gain. Then, we show that there is a fundamental trade-off between the diversity gain and the coding gain of AFDM system, namely the coding gain declines with a descending speed with respect to the number of separable paths, while the diversity gain increases linearly. Moreover, we propose a near-optimal turbo decoder based on the sum-product algorithm for coded AFDM systems to improve its error performance. Simulation results verify our analyses and the effectiveness of the proposed turbo decoder, showing that AFDM outperforms orthogonal frequency division multiplexing (OFDM) and orthogonal time frequency space (OTFS) in both coded and uncoded cases over high-mobility channels.
    摘要 affine频分多路复用(AFDM)是第六代无线网络的强有力竞争者,感谢它在延时-Doppler扩散下的强大抗性。在这封信中,我们研究了coded AFDM系统在双选择通道中的错误性能。我们首先研究了AFDM系统的条件对应error probability(PEP),然后 derivation its conditional coding gain。我们发现了AFDM系统的编码益和多路分配的多样性益之间存在基本的负面关系:编码益随着分配的数目下降,而多样性益 linearly 增长。此外,我们提议了coded AFDM系统的近似优化turbo解码器,基于sum-product算法,以提高其错误性能。实验结果证明了我们的分析和提议,显示AFDM在高速度通道上的性能超过了OFDM和OTFS。

Performance Analysis of MDMA-Based Cooperative MRC Networks with Relays in Dissimilar Rayleigh Fading Channels

  • paper_url: http://arxiv.org/abs/2311.15593
  • repo_url: None
  • paper_authors: Lei Teng, Wannian An, Chen Dong, Xiaoqi Qin, Xiaodong Xu
  • for: 本研究旨在提出一种基于多元接入技术的协操作网络,以提高未来无线通信系统的谱spectrum efficiency和可行区域。
  • methods: 本研究使用了模型分配多接入(MDMA)技术,并在不同的雷利减fade通道上实现了协操作网络。在这种网络中,有两个源节点、任意数量的解码并传递(DF)relay节点和一个目的节点。目的节点使用最大比率组合(MRC)将来自源和关节点的信号进行组合。
  • results: 通过使用状态转移矩阵(STM)和幂等生成函数(MGF),本研究得出了关于停机概率和资源利用效率的关键分析结果。理论和 simulate结果都验证了理论分析的正确性。
    Abstract Multiple access technology is a key technology in various generations of wireless communication systems. As a potential multiple access technology for the next generation wireless communication systems, model division multiple access (MDMA) technology improves spectrum efficiency and feasibility regions. This implies that the MDMA scheme can achieve greater performance gains compared to traditional schemes. Relayassisted cooperative networks, as a infrastructure of wireless communication, can effectively utilize resources and improve performance when MDMA is applied. In this paper, a communication relay cooperative network based on MDMA in dissimilar rayleigh fading channels is proposed, which consists of two source nodes, any number of decode-and-forward (DF) relay nodes, and one destination node, as well as using the maximal ratio combining (MRC) at the destination to combine the signals received from the source and relays. By applying the state transition matrix (STM) and moment generating function (MGF), closed-form analytical solutions for outage probability and resource utilization efficiency are derived. Theoretical and simulation results are conducted to verify the validity of the theoretical analysis.
    摘要 多access技术是现代无线通信系统中关键技术之一。作为下一代无线通信系统的可能多access技术,模型分段多access(MDMA)技术提高了频率效率和可行区域。这意味着MDMA方案可以在传统方案中获得更高的性能提升。基于relay协作网络,无线通信基础设施可以有效利用资源并提高性能,当MDMA被应用时。本文提出了基于MDMA的异同RAYLEIGH抖折通信网络,该网络包括两个源节点、任意数量的解码并发(DF) relay节点,以及一个目的节点,同时在目的节点使用最大比率组合(MRC)将来自源和relay节点的信号进行组合。通过应用状态转移矩阵(STM)和时刻生成函数(MGF),我们 deriv了关于停机概率和资源利用效率的关键分析结果。我们进行了理论和实验研究,以验证分析结果的正确性。