cs.SD - 2023-10-07

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

  • paper_url: http://arxiv.org/abs/2310.04863
  • repo_url: None
  • paper_authors: Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie
  • for: 这个研究是为了提高自动话语识别(ASR)和Speaker Diagnosis(SD)的融合模型,以提高会议记录的准确率。
  • methods: 该研究使用了一种新的非自然语言模型(Paraformer),并提出了一种 speaker-filling 策略和一种 inter-CTC 策略来提高模型的性能。
  • results: 实验结果表明,我们的模型在 AliMeeting 数据集上比 cascaded SA-ASR 模型减少了6.1%的相对Speaker-dependent Character Error Rate(SD-CER),并且与 SOTA 联合 AR SA-ASR 模型的 SD-CER 相似(34.8%),但具有1/10 RTF。
    Abstract Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.
    摘要 joint模型化多speaker ASR和speaker分类有最近显示出色的结果在speaker所有自动语音识别(SA-ASR)中。尽管能够获得状态之arte(SOTA)性能,大多数研究都基于一个autoregressive(AR) decoder,这会导致大量的实时因子(RTF)。为加速推理,我们引入了一个非autoregressive模型Paraformer作为SA-ASR模型中的声学模型。Paraformer使用单步decoder,以便并行生成,与SOTA AR transformer模型相当。此外,我们提出了一种speaker填充策略,以减少speaker认定错误,并采用一种inter-CTC策略,以提高encoder在声学模型中的能力。在AliMeeting corpus上的实验表明,我们的模型比杂合SA-ASR模型的测试集比例6.1%的相对speaker依赖字符错误率(SD-CER)下降。此外,我们的模型与SOTA联合AR SA-ASR模型相当的SD-CER值为34.8%,只需1/10 RTF。

FM Tone Transfer with Envelope Learning

  • paper_url: http://arxiv.org/abs/2310.04811
  • repo_url: https://github.com/fcaspe/fmtransfer
  • paper_authors: Franco Caspe, Andrew McPherson, Mark Sandler
  • for: 本研究旨在控制生成的音频使用 musical instruments,提高表达性和短语表达能力。
  • methods: 本文提出了 Envelope Learning,一种新的 tone transfer 建模方法,通过在生成参数层次使用教学目标来映射音乐事件。
  • results: 本研究实现了在实时演奏场景中提高音频表达性、短语表达能力和多样性,并实现了精准地捕捉音频事件的开始和结束。
    Abstract Tone Transfer is a novel deep-learning technique for interfacing a sound source with a synthesizer, transforming the timbre of audio excerpts while keeping their musical form content. Due to its good audio quality results and continuous controllability, it has been recently applied in several audio processing tools. Nevertheless, it still presents several shortcomings related to poor sound diversity, and limited transient and dynamic rendering, which we believe hinder its possibilities of articulation and phrasing in a real-time performance context. In this work, we present a discussion on current Tone Transfer architectures for the task of controlling synthetic audio with musical instruments and discuss their challenges in allowing expressive performances. Next, we introduce Envelope Learning, a novel method for designing Tone Transfer architectures that map musical events using a training objective at the synthesis parameter level. Our technique can render note beginnings and endings accurately and for a variety of sounds; these are essential steps for improving musical articulation, phrasing, and sound diversity with Tone Transfer. Finally, we implement a VST plugin for real-time live use and discuss possibilities for improvement.
    摘要 <>采用深度学习技术,音源与 sintizer之间的 Tone Transfer 技术可以改变音频片断的时域特征,同时保持音乐形式内容。由于其音质效果良好和连续可控,因此在多个音频处理工具中应用。然而,它仍然存在许多缺点,如音色缺乏多样性和过渡和动态渲染的限制,这些缺点阻碍了 Tone Transfer 的表达和phrase演奏的可能性。在这项工作中,我们介绍了目前 Tone Transfer 架构的问题,以及控制 synthetic audio 的乐器的挑战。然后,我们介绍了 Envelope Learning,一种新的方法,可以在 synthesis 参数层次上使用 musical event 来训练 Tone Transfer 架构。我们的技术可以准确地渲染音乐事件的开始和结束,并且可以为不同的音频种类实现多样化。这些步骤对于提高 Tone Transfer 的表达、phrase 和音色多样性是关键。最后,我们实现了一个 VST 插件,用于实时现场使用,并讨论了进一步改进的可能性。

Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

  • paper_url: http://arxiv.org/abs/2310.04760
  • repo_url: None
  • paper_authors: Ze Li, Yuke Lin, Ning Jiang, Xiaoyi Qin, Guoqing Zhao, Haiying Wu, Ming Li
  • for: 这个论文旨在提出一种新的半监督领域对应方法,专门针对语音识别 задачі。
  • methods: 本文使用限定目标领域的标签数据来 derivate 对应领域的领域特有描述子,并运用InfoMap算法进行嵌入聚类,以更正target领域的伪标签。此外,我们还引入subcenter-purification和进度融合策略来进一步改善伪标签的质量。
  • results: 本文的提案方法(Multi-objective Progressive Clustering,MoPC)在VoxSRC 2023 track 3的评估集上获得4.95% EER,排名第一名。此外,我们还进行了额外的实验在FFSVC dataset上,取得了良好的结果。
    Abstract Utilizing the pseudo-labeling algorithm with large-scale unlabeled data becomes crucial for semi-supervised domain adaptation in speaker verification tasks. In this paper, we propose a novel pseudo-labeling method named Multi-objective Progressive Clustering (MoPC), specifically designed for semi-supervised domain adaptation. Firstly, we utilize limited labeled data from the target domain to derive domain-specific descriptors based on multiple distinct objectives, namely within-graph denoising, intra-class denoising and inter-class denoising. Then, the Infomap algorithm is adopted for embedding clustering, and the descriptors are leveraged to further refine the target domain's pseudo-labels. Moreover, to further improve the quality of pseudo labels, we introduce the subcenter-purification and progressive-merging strategy for label denoising. Our proposed MoPC method achieves 4.95% EER and ranked the 1$^{st}$ place on the evaluation set of VoxSRC 2023 track 3. We also conduct additional experiments on the FFSVC dataset and yield promising results.
    摘要 使用大规模无标签数据的 Pseudo-labeling 算法在语音识别任务中成为了重要的 semi-supervised 领域适应技术。在这篇论文中,我们提出了一种新的 Pseudo-labeling 方法,即 Multi-objective Progressive Clustering (MoPC),特意设计用于 semi-supervised 领域适应。首先,我们利用目标频道的有限标注数据来 derive 频道特有的描述符,基于多个不同的目标函数,即在 Graph 中的内部干扰、类内干扰和类间干扰。然后,我们采用 Infomap 算法进行嵌入聚类,并使用描述符进一步改进目标频道的 Pseudo-标签。此外,为了进一步提高 Pseudo-标签 的质量,我们引入 subcenter-purification 和 progressive-merging 策略来进行标签干扰。我们的提出的 MoPC 方法在 VoxSRC 2023 评测集上取得了 4.95% EER,并在评测集上排名第一。我们还进行了额外的 FFSVC 数据集的实验,并取得了有优的结果。

An Exploration of Task-decoupling on Two-stage Neural Post Filter for Real-time Personalized Acoustic Echo Cancellation

  • paper_url: http://arxiv.org/abs/2310.04715
  • repo_url: None
  • paper_authors: Zihan Zhang, Jiayao Sun, Xianjun Xia, Ziqian Wang, Xiaopeng Yan, Yijian Xiao, Lei Xie
  • for: 这个论文旨在探讨personalized acoustic echo cancellation (PAEC)中的任务解耦策略,以及如何使用多尺度本地-全球speaker表示来提高speaker抽象。
  • methods: 这个论文提出了一种基于两阶段任务解耦post-filter (TDPF)的PAEC方法,并应用了多尺度本地-全球speaker表示来提高speaker抽象。
  • results: 实验结果表明,任务解耦模型可以比单一联合网络提供更好的性能,而且在任务解耦序列中,优化训练策略可以further提高模型的性能。
    Abstract Deep learning based techniques have been popularly adopted in acoustic echo cancellation (AEC). Utilization of speaker representation has extended the frontier of AEC, thus attracting many researchers' interest in personalized acoustic echo cancellation (PAEC). Meanwhile, task-decoupling strategies are widely adopted in speech enhancement. To further explore the task-decoupling approach, we propose to use a two-stage task-decoupling post-filter (TDPF) in PAEC. Furthermore, a multi-scale local-global speaker representation is applied to improve speaker extraction in PAEC. Experimental results indicate that the task-decoupling model can yield better performance than a single joint network. The optimal approach is to decouple the echo cancellation from noise and interference speech suppression. Based on the task-decoupling sequence, optimal training strategies for the two-stage model are explored afterwards.
    摘要 深度学习基于技术已经广泛应用于声学噪声抑制(AEC)领域。通过使用说话者表示方法,AEC的前iers延伸了,因此吸引了许多研究人员的关注。同时,任务解耦策略广泛应用于speech enhancement。为了进一步探索任务解耦方法,我们提议使用两个阶段任务解耦后filter(TDPF)在PAEC中。此外,我们采用多尺度本地-全球说话者表示方法来提高PAEC中的说话者抽取。实验结果表明,任务解耦模型可以在PAEC中提供更好的性能。最佳方法是在任务解耦序列中分离噪声和干扰音频抑制。基于任务解耦序列,我们后续探索了两个阶段模型的优化训练策略。

Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech Recognition

  • paper_url: http://arxiv.org/abs/2310.04657
  • repo_url: None
  • paper_authors: Kaixun Huang, Ao Zhang, Binbin Zhang, Tianyi Xu, Xingchen Song, Lei Xie
  • for: 提高自动语音识别(ASR)系统对 Contextual Phrases 的识别性能
  • methods: 使用 Attention-based Deep Contextual Biasing 方法,同时支持显式和隐式偏袋
  • results: 实现了对 Contextual Phrases 的显著改进(32.0% 相对 CER 减少),并可以与浅混合方法相互协作以获得更好的结果
    Abstract The attention-based deep contextual biasing method has been demonstrated to effectively improve the recognition performance of end-to-end automatic speech recognition (ASR) systems on given contextual phrases. However, unlike shallow fusion methods that directly bias the posterior of the ASR model, deep biasing methods implicitly integrate contextual information, making it challenging to control the degree of bias. In this study, we introduce a spike-triggered deep biasing method that simultaneously supports both explicit and implicit bias. Moreover, both bias approaches exhibit significant improvements and can be cascaded with shallow fusion methods for better results. Furthermore, we propose a context sampling enhancement strategy and improve the contextual phrase filtering algorithm. Experiments on the public WenetSpeech Mandarin biased-word dataset show a 32.0% relative CER reduction compared to the baseline model, with an impressively 68.6% relative CER reduction on contextual phrases.
    摘要 针对给定的上下文表达,基于注意力的深度上下文偏好方法已经证明可以有效提高端到端自动语音识别(ASR)系统的识别性能。与浅层融合方法不同的是,深度偏好方法不直接偏袋ASR模型的 posterior,而是通过隐式地整合上下文信息,使其控制上下文偏好的难度增加。在本研究中,我们提出了触发器驱动的深度偏好方法,可以同时支持显式和隐式偏好。此外,两种偏好方法都展现出了显著的改善,可以与浅层融合方法相互协同使用。此外,我们还提出了上下文采样优化策略和上下文表达过滤算法改进。实验结果表明,在公共的WenetSpeech普通话biased-word数据集上,与基eline模型相比,我们的模型可以 дости到32.0%的相对报告错误率(CER)减少,其中上下文表达减少了68.6%的相对CER。

Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction

  • paper_url: http://arxiv.org/abs/2310.04644
  • repo_url: https://github.com/cctn-bci/neural2speech
  • paper_authors: Jiawei Li, Chunxu Guo, Li Fu, Lu Fan, Edward F. Chang, Yuanning Li
  • for: 实现直接通过脑电传输机制进行脑语言交流,这是脑机器人界的一个重要目标。
  • methods: 我们提出了一个名为Neural2Speech的专业转移学习框架,这个框架包括两个不同的训练阶段。首先,我们使用可以在各种语音数据库中找到的语音自动化器进行预训练,以将语音波形从对应的语音表现中解析出来。其次,我们使用小型脑电资料进行适应器的训练,以将脑活动和语音表现相互调整。
  • results: 我们的提案Neural2Speech可以实现从脑电资料中重建语音,即使只有20分钟的脑电资料。与之比较,我们的方法在语音质量和准确性方面表现出色,较之前的基eline方法更高。
    Abstract Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
    摘要 <>转换给定文本到简化中文。>直接通过脑机器 interfaces 的沟通是重构自然语音的重要任务。先前的努力已经探索了使用复杂的深度神经网络(DNN)模型,通过大量神经记录数据进行训练,以实现语音重构。然而,在有限规模神经记录数据的情况下,实现满意的语音重构性能是困难的,主要是因为语音表示的复杂性和神经数据约束。为了解决这些挑战,我们提出了一种基于传输学习的神经驱动 speech reconstruction 框架,称为 Neural2Speech,该框架包括两个不同的训练阶段。首先,一个语音自适应器在可用的语音资料上进行预训练,以解码语音波形从编码的语音表示中。其次,一个轻量级的适配器在小规模神经记录数据上进行训练,以对神经活动和语音表示进行对应。值得注意的是,我们提出的 Neural2Speech 已经在只有 20 分钟的脑内数据上达到了比较出色的语音质量和可读性。

cs.CV - 2023-10-07

DISCOVER: Making Vision Networks Interpretable via Competition and Dissection

  • paper_url: http://arxiv.org/abs/2310.04929
  • repo_url: None
  • paper_authors: Konstantinos P. Panousis, Sotirios Chatzis
  • for: 这个论文旨在提高深度网络的后期解释性,特别是网络解剖。我们的目标是为视觉任务训练的网络提供一个框架,以便更好地发现每个神经元的个人功能。
  • methods: 我们利用了最新的视觉语言模型和网络层的新概念——随机地方竞争的线性单元。只有小量的层神经元被激活,导致activation sparse度 extremly low(只有$\approx 4%$)。我们的提posed方法可以推断(稀疏)神经元活动模式,使神经元可以activate/特化于输入特征。
  • results: 我们的方法可以保持或改进视觉网络的分类性能,同时实现了一种原则性的文本基于描述和网络神经元表示的框架。在我们的实验中,我们发现:(i)我们的方法可以提高网络的分类性能,(ii)我们的方法可以实现文本基于描述和网络神经元表示的原则性框架。
    Abstract Modern deep networks are highly complex and their inferential outcome very hard to interpret. This is a serious obstacle to their transparent deployment in safety-critical or bias-aware applications. This work contributes to post-hoc interpretability, and specifically Network Dissection. Our goal is to present a framework that makes it easier to discover the individual functionality of each neuron in a network trained on a vision task; discovery is performed in terms of textual description generation. To achieve this objective, we leverage: (i) recent advances in multimodal vision-text models and (ii) network layers founded upon the novel concept of stochastic local competition between linear units. In this setting, only a small subset of layer neurons are activated for a given input, leading to extremely high activation sparsity (as low as only $\approx 4\%$). Crucially, our proposed method infers (sparse) neuron activation patterns that enables the neurons to activate/specialize to inputs with specific characteristics, diversifying their individual functionality. This capacity of our method supercharges the potential of dissection processes: human understandable descriptions are generated only for the very few active neurons, thus facilitating the direct investigation of the network's decision process. As we experimentally show, our approach: (i) yields Vision Networks that retain or improve classification performance, and (ii) realizes a principled framework for text-based description and examination of the generated neuronal representations.
    摘要 现代深度网络具有极高的复杂性,其含义难以解释。这成为了在安全关键应用或偏见敏感应用中透明部署的障碍。本研究做出了后期解释性的贡献,特别是网络解剖。我们的目标是提供一种框架,使得在视觉任务上训练的网络中每个神经元的个性功能更易于发现。为达到这个目标,我们利用:(i)视觉语言模型的最新进展和(ii)基于新的精度地方竞争理论的网络层。在这种设置下,只有输入的一小部分神经元被激活,导致活动率极低(只有约4%)。然而,我们的提议方法可以推断(稀疏)神经元活动模式,使神经元能够对特定输入特征进行特殊化,从而拓宽它们的个性功能。这种能力使我们的方法在分解过程中具有很高的可用性:只有活动的神经元被生成为人类可读的描述,从而方便直接检查网络的决策过程。我们的实验表明,我们的方法可以:(i)保持或提高分类性能,和(ii)实现基于文本描述的网络检查和分析的原则性框架。

DynamicBEV: Leveraging Dynamic Queries and Temporal Context for 3D Object Detection

  • paper_url: http://arxiv.org/abs/2310.05989
  • repo_url: None
  • paper_authors: Jiawei Yao, Yingxin Lai
  • for: 这个研究的目的是为了提高BEV图像中3D物体检测的精度和效率,并且能够适应复杂的空间时间关系。
  • methods: 这篇论文提出了一个新的方法 called DynamicBEV,它使用动态查询来替代传统的静止查询,以更好地适应场景中的复杂空间时间关系。这个方法使用K-means clustering和Top-K Attention来协同组合本地和远程特征,以提高数据聚合效果。此外,这篇论文还提出了一个轻量级的时间融合模组(LTFM),用于有效地融合时间上下文,并且降低计算量。最后,这篇论文还使用了一个自定义的多标本损失函数,以确保特征表现的平衡。
  • results: 根据nuScenes dataset的实验结果,这篇论文确认了DynamicBEV的效果,并且获得了新的州际纪录,这说明了这个方法在Query-based BEV object detection中的优越性。
    Abstract 3D object detection is crucial for applications like autonomous driving and robotics. While query-based 3D object detection for BEV (Bird's Eye View) images has seen significant advancements, most existing methods follows the paradigm of static query. Such paradigm is incapable of adapting to complex spatial-temporal relationships in the scene. To solve this problem, we introduce a new paradigm in DynamicBEV, a novel approach that employs dynamic queries for BEV-based 3D object detection. In contrast to static queries, the proposed dynamic queries exploit K-means clustering and Top-K Attention in a creative way to aggregate information more effectively from both local and distant feature, which enable DynamicBEV to adapt iteratively to complex scenes. To further boost efficiency, DynamicBEV incorporates a Lightweight Temporal Fusion Module (LTFM), designed for efficient temporal context integration with a significant computation reduction. Additionally, a custom-designed Diversity Loss ensures a balanced feature representation across scenarios. Extensive experiments on the nuScenes dataset validate the effectiveness of DynamicBEV, establishing a new state-of-the-art and heralding a paradigm-level breakthrough in query-based BEV object detection.
    摘要 三维物体检测是自动驾驶和 робо技术中非常重要的应用。而现有的方法都是使用静态查询,这种方法无法适应场景中的复杂空时关系。为解决这个问题,我们提出了一种新的思路:动态查询(DynamicBEV),它利用K-means归一化和Top-K注意力来更有效地从本地和远程特征处理信息,以适应场景的变化。此外,DynamicBEV还包括一个轻量级时间融合模块(LTFM),用于有效地融合时间上下文,并大幅减少计算量。此外,我们还设计了一种自定义的多样性损失函数,以保证不同场景中特征表示的均衡。广泛的实验 validate DynamicBEV 的有效性,成功创造了一个新的静态查询下的 BEV 物体检测状态ixel。

$H$-RANSAC, an algorithmic variant for Homography image transform from featureless point sets: application to video-based football analytics

  • paper_url: http://arxiv.org/abs/2310.04912
  • repo_url: https://github.com/gnousias/h-ransac
  • paper_authors: George Nousias, Konstantinos Delibasis, Ilias Maglogiannis
  • for: 这种paper是为了解决图像匹配问题,具体来说是计算图像之间的投影矩阵。
  • methods: 这种方法使用了一种通用的RANSAC算法,并添加了一些针对特定情况的修改,以提高其精度和可靠性。
  • results: 这种方法在一个大量的足球赛事图像 dataset 上进行测试,与其他State-of-the-art实现相比,表现出了更高的精度和更多的成功处理的图像对。
    Abstract Estimating homography matrix between two images has various applications like image stitching or image mosaicing and spatial information retrieval from multiple camera views, but has been proved to be a complicated problem, especially in cases of radically different camera poses and zoom factors. Many relevant approaches have been proposed, utilizing direct feature based, or deep learning methodologies. In this paper, we propose a generalized RANSAC algorithm, H-RANSAC, to retrieve homography image transformations from sets of points without descriptive local feature vectors and point pairing. We allow the points to be optionally labelled in two classes. We propose a robust criterion that rejects implausible point selection before each iteration of RANSAC, based on the type of the quadrilaterals formed by random point pair selection (convex or concave and (non)-self-intersecting). A similar post-hoc criterion rejects implausible homography transformations is included at the end of each iteration. The expected maximum iterations of $H$-RANSAC are derived for different probabilities of success, according to the number of points per image and per class, and the percentage of outliers. The proposed methodology is tested on a large dataset of images acquired by 12 cameras during real football matches, where radically different views at each timestamp are to be matched. Comparisons with state-of-the-art implementations of RANSAC combined with classic and deep learning image salient point detection indicates the superiority of the proposed $H$-RANSAC, in terms of average reprojection error and number of successfully processed pairs of frames, rendering it the method of choice in cases of image homography alignment with few tens of points, while local features are not available, or not descriptive enough. The implementation of $H$-RANSAC is available in https://github.com/gnousias/H-RANSAC
    摘要 “计算图像之间的Homography矩阵有很多应用,如图像融合或图像拼接,以及从多个摄像头视角中获取空间信息。但是,这个问题在摄像头姿态和缩放因子之间很大的情况下是非常复杂的。许多相关的方法已经被提出,包括直接特征基于方法和深度学习方法。在这篇论文中,我们提出一种通用的RANSAC算法,即H-RANSAC,用于从无特征点Cloud中提取Homography图像变换。我们允许点可选择为两类标签。我们提出了一种robust的检验点选择前每次RANSAC迭代的 критери产生,基于随机点对选择的四边形类型(半球或半球和(非)自交)。此外,我们还包括在每次迭代结束后对不可能的Homography变换进行检验的预后检查。我们预计在不同的成功概率下,H-RANSAC的最大迭代次数可以得出,根据图像点Cloud的大小和每个图像点的类别。我们对一个包含12个摄像头 captured during real football matches的大型图像集进行测试,并与STATE-OF-THE-ART的RANSAC+ classic/深度学习图像突出点检测结合相比。结果表明,我们的H-RANSAC方法在平均 reprojection error 和成功处理图像对的数量方面表现出色,在图像Homography对齐问题中具有优选性, especial when local features are not available or not descriptive enough。H-RANSAC实现可以在https://github.com/gnousias/H-RANSAC ”

WAIT: Feature Warping for Animation to Illustration video Translation using GANs

  • paper_url: http://arxiv.org/abs/2310.04901
  • repo_url: https://github.com/giddyyupp/wait
  • paper_authors: Samet Hicsonmez, Nermin Samet, Fidan Samet, Oguz Bakir, Emre Akbas, Pinar Duygulu
  • for: 本研究探讨了一个新的视频到视频翻译领域,旨在将动画电影翻译为原始插图风格。
  • methods: 我们提出了一种新的视频翻译问题,使用一个未排序的图像集来翻译输入视频。这是一项挑战性的任务,因为我们没有利用视频序列的优势,同时从多个图像中获得一致的风格更加困难。
  • results: 我们提出了一种基于图像到图像翻译模型的新生成器网络,并使用特征扭曲层来确保视频之间的时间协调。我们通过三个数据集进行质量和量化的比较,证明了我们的方法的效果。代码和预训练模型可以在GitHub上获取。
    Abstract In this paper, we explore a new domain for video-to-video translation. Motivated by the availability of animation movies that are adopted from illustrated books for children, we aim to stylize these videos with the style of the original illustrations. Current state-of-the-art video-to-video translation models rely on having a video sequence or a single style image to stylize an input video. We introduce a new problem for video stylizing where an unordered set of images are used. This is a challenging task for two reasons: i) we do not have the advantage of temporal consistency as in video sequences; ii) it is more difficult to obtain consistent styles for video frames from a set of unordered images compared to using a single image. Most of the video-to-video translation methods are built on an image-to-image translation model, and integrate additional networks such as optical flow, or temporal predictors to capture temporal relations. These additional networks make the model training and inference complicated and slow down the process. To ensure temporal coherency in video-to-video style transfer, we propose a new generator network with feature warping layers which overcomes the limitations of the previous methods. We show the effectiveness of our method on three datasets both qualitatively and quantitatively. Code and pretrained models are available at https://github.com/giddyyupp/wait.
    摘要 在这篇论文中,我们探讨了一个新的视频到视频翻译领域。我们受到了动画电影的推广,这些电影是根据儿童插画书改编的。我们希望使用原始插画的风格翻译视频。现有的视频到视频翻译模型都是基于单个风格图片或视频序列来翻译输入视频。我们引入了一个新的视频翻译问题,在这个问题中,我们使用一个无序集合的图片来翻译输入视频。这是一个具有两种挑战性:一是我们没有优势的时间一致性,二是从无序图片中获取视频帧的一致风格更加困难。大多数视频到视频翻译方法都是基于图片到图片翻译模型,并且添加了镜像流、时间预测器等额外网络来捕捉时间关系。这些额外网络使模型训练和推理变得复杂,并且慢了过程。为保证视频翻译中的时间一致性,我们提议了一个新的生成器网络,具有特征扭曲层,这种方法超越了之前的方法的局限性。我们通过三个数据集的质量和量的评估,证明了我们的方法的有效性。代码和预训练模型可以在https://github.com/giddyyupp/wait上获取。

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

  • paper_url: http://arxiv.org/abs/2310.04900
  • repo_url: https://github.com/ninatu/howtocaption
  • paper_authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne
  • for: 这个论文是为了提高text-video模型的性能而设计的。
  • methods: 该论文使用大型自然语言模型(LLM)来生成视频描述,并通过提取视频字幕的子title来减少人工标注的干扰。
  • results: 该论文的实验结果表明,使用该方法可以获得大规模无需人工标注的视频描述,并且可以提高text-video retrieval任务的性能。
    Abstract Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal learning. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.
    摘要 文章中提到的视频教程是一个非常好的来源 для学习多Modal表示,通过使用自动语音识别系统(ASR)提取视频中的音频信号中的话语和字幕,并利用这些话语和字幕来学习多Modal表示。然而,与人工标注的字幕相比,视频中的话语和字幕与视频的视觉内容之间存在干扰,因此提供了质量不高的指导。因此,大规模无注释网络视频训练数据仍然是训练文本-视频模型的下OPTimal。在这种情况下,我们提议利用大型自然语言模型(LLM)来获取视频描述。 Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.

Machine Learning for Automated Mitral Regurgitation Detection from Cardiac Imaging

  • paper_url: http://arxiv.org/abs/2310.04871
  • repo_url: None
  • paper_authors: Ke Xiao, Erik Learned-Miller, Evangelos Kalogerakis, James Priest, Madalina Fiterau
  • for: Mitral regurgitation (MR) 诊断
  • methods: 使用 semi-supervised 模型 CUSSP,利用标准计算机视觉技术和对比模型,从大量无标签数据中学习,并与专业分类器结合,实现首次自动 MR 诊断系统。
  • results: 在测试集上,CUSSP 得到了 F1 分数 0.69 和 ROC-AUC 分数 0.88,创造了这个新任务的首个benchmarkresult。
    Abstract Mitral regurgitation (MR) is a heart valve disease with potentially fatal consequences that can only be forestalled through timely diagnosis and treatment. Traditional diagnosis methods are expensive, labor-intensive and require clinical expertise, posing a barrier to screening for MR. To overcome this impediment, we propose a new semi-supervised model for MR classification called CUSSP. CUSSP operates on cardiac imaging slices of the 4-chamber view of the heart. It uses standard computer vision techniques and contrastive models to learn from large amounts of unlabeled data, in conjunction with specialized classifiers to establish the first ever automated MR classification system. Evaluated on a test set of 179 labeled -- 154 non-MR and 25 MR -- sequences, CUSSP attains an F1 score of 0.69 and a ROC-AUC score of 0.88, setting the first benchmark result for this new task.
    摘要 Mitral regurgitation (MR) 是一种心脏阀门疾病,有可能有致命的后果,只能通过及时诊断和治疗来抵消。传统诊断方法是昂贵,劳动密集,需要丰富的临床专业知识,这成为了 MR 检测的障碍。为了突破这个障碍,我们提出了一种新的半supervised模型 для MR 分类,称为 CUSSP。CUSSP 运行在心脏成像剖面的4室视图上,使用标准的计算机视觉技术和对比模型,从大量的无标签数据中学习,并与专门的分类器结合以建立首次的自动 MR 分类系统。在一个测试集上进行评估,CUSSP 的 F1 分数为 0.69,ROC-AUC 分数为 0.88,创造了这个新任务的首个benchmark结果。

Exploiting Facial Relationships and Feature Aggregation for Multi-Face Forgery Detection

  • paper_url: http://arxiv.org/abs/2310.04845
  • repo_url: None
  • paper_authors: Chenhao Lin, Fangbin Yi, Hang Wang, Qian Li, Deng Jingyi, Chao Shen
  • for: 防止多人脸伪造攻击
  • methods: 使用facial relationships学习模块和全局特征聚合模块
  • results: 在两个公开的多人脸伪造数据集上实现了状态畅的多人脸伪造检测效果
    Abstract Face forgery techniques have emerged as a forefront concern, and numerous detection approaches have been proposed to address this challenge. However, existing methods predominantly concentrate on single-face manipulation detection, leaving the more intricate and realistic realm of multi-face forgeries relatively unexplored. This paper proposes a novel framework explicitly tailored for multi-face forgery detection,filling a critical gap in the current research. The framework mainly involves two modules:(i) a facial relationships learning module, which generates distinguishable local features for each face within images,(ii) a global feature aggregation module that leverages the mutual constraints between global and local information to enhance forgery detection accuracy.Our experimental results on two publicly available multi-face forgery datasets demonstrate that the proposed approach achieves state-of-the-art performance in multi-face forgery detection scenarios.
    摘要 面孔伪造技术已经成为当前研究的突出问题,许多检测方法已经被提议以解决这个挑战。然而,现有的方法主要集中在单个面孔修改检测上,留下更复杂和实际的多面孔伪造场景尚未得到足够的探索。这篇论文提出了一种新的框架,专门针对多面孔伪造检测。该框架主要包括两个模块:(i)一个人脸关系学习模块,生成每个图像中的 distinguishing 本地特征(ii)一个全局特征聚合模块,利用全局和本地信息之间的互补关系来增强伪造检测精度。我们在两个公开available的多面孔伪造数据集上进行了实验,结果表明,我们提出的方法在多面孔伪造检测场景中具有国际之最的表现。

Extract-Transform-Load for Video Streams

  • paper_url: http://arxiv.org/abs/2310.04830
  • repo_url: https://github.com/ferdiko/vetl
  • paper_authors: Ferdinand Kossmann, Ziniu Wu, Eugenie Lai, Nesime Tatbul, Lei Cao, Tim Kraska, Samuel Madden
  • for: 这篇论文主要是为了解决大规模视频分析中的存储和查询问题。
  • methods: 这篇论文提出了一种名为Skyscraper的系统,可以实现自适应的视频抽取、转换和加载(V-ETL)过程,以降低存储和查询成本。Skyscraper使用了缓存和云汇抽象来应对工作负载峰值,并可以自动调整抽取率和分辨率以适应不同的内容。
  • results: 在实验中,Skyscraper在对视频抽取和转换过程中显著降低了成本,同时也提供了一定的可靠性保证,而现有的最佳实践系统则无法同时实现这两点。
    Abstract Social media, self-driving cars, and traffic cameras produce video streams at large scales and cheap cost. However, storing and querying video at such scales is prohibitively expensive. We propose to treat large-scale video analytics as a data warehousing problem: Video is a format that is easy to produce but needs to be transformed into an application-specific format that is easy to query. Analogously, we define the problem of Video Extract-Transform-Load (V-ETL). V-ETL systems need to reduce the cost of running a user-defined V-ETL job while also giving throughput guarantees to keep up with the rate at which data is produced. We find that no current system sufficiently fulfills both needs and therefore propose Skyscraper, a system tailored to V-ETL. Skyscraper can execute arbitrary video ingestion pipelines and adaptively tunes them to reduce cost at minimal or no quality degradation, e.g., by adjusting sampling rates and resolutions to the ingested content. Skyscraper can hereby be provisioned with cheap on-premises compute and uses a combination of buffering and cloud bursting to deal with peaks in workload caused by expensive processing configurations. In our experiments, we find that Skyscraper significantly reduces the cost of V-ETL ingestion compared to adaptions of current SOTA systems, while at the same time giving robustness guarantees that these systems are lacking.
    摘要 社交媒体、自动驾驶车和交通摄像头产生大规模的视频流,但存储和查询这些视频流的成本过高。我们认为将大规模视频分析视为数据存储问题:视频是容易生成的格式,但需要转换为特定应用程序可查询的格式。我们定义视频EXTRACT-TRANSFORM-LOAD(V-ETL)问题。V-ETL系统需要降低运行用户定义的V-ETL任务的成本,同时给予吞吐量保证以满足数据生产的速度。我们发现当前系统无法充分满足这两个需求,因此我们提出了Skyscraper系统。Skyscraper可以执行任意视频入库管道,并动态调整这些管道以降低成本,例如调整采样率和分辨率。Skyscraper可以通过具有低成本的在线计算机和缓存和云冲顶来处理因价格高昂的处理配置而引起的峰值工作负荷。在我们的实验中,我们发现Skyscraper可以与当前SOTA系统相比,对视频入库成本进行显著减少,同时保证这些系统缺乏的可靠性 guarantee。

How to effectively train an ensemble of Faster R-CNN object detectors to quantify uncertainty

  • paper_url: http://arxiv.org/abs/2310.04829
  • repo_url: https://github.com/akola-mbey-denis/efficientensemble
  • paper_authors: Denis Mbey Akola, Gianni Franchi
    for:这个论文提出了一种新的对象检测 ensemble 模型训练方法,具体来说是对 Faster R-CNN 模型进行不确定性估计。methods: authors 提出了训练一个 Region Proposal Network (RPN) 和多个 Fast R-CNN 预测头,以建立一个可靠的深度ensemble网络,用于对象检测中估计不确定性。results: authors 采用这种方法,并通过实验表明,这种方法比naive方法(完全训练所有 $n$ 模型)要快得多。此外, authors 还使用 Ensemble Model’s Expected Calibration Error (ECE) 来估计不确定性。最后, authors 比较了这种模型与 Gaussian YOLOv3 的性能。
    Abstract This paper presents a new approach for training two-stage object detection ensemble models, more specifically, Faster R-CNN models to estimate uncertainty. We propose training one Region Proposal Network(RPN)~\cite{https://doi.org/10.48550/arxiv.1506.01497} and multiple Fast R-CNN prediction heads is all you need to build a robust deep ensemble network for estimating uncertainty in object detection. We present this approach and provide experiments to show that this approach is much faster than the naive method of fully training all $n$ models in an ensemble. We also estimate the uncertainty by measuring this ensemble model's Expected Calibration Error (ECE). We then further compare the performance of this model with that of Gaussian YOLOv3, a variant of YOLOv3 that models uncertainty using predicted bounding box coordinates. The source code is released at \url{https://github.com/Akola-Mbey-Denis/EfficientEnsemble}
    摘要 这篇论文提出了一种新的对两stage对象检测集成模型进行训练方法,具体来说是对Faster R-CNN模型进行不确定性估计。我们提议在一个Region Proposal Network(RPN)~\cite{https://doi.org/10.48550/arxiv.1506.01497}和多个快速R-CNN预测头上进行训练,这样就能够建立一个robust的深度集成网络,用于对象检测中的不确定性估计。我们介绍了这种方法,并通过实验表明这种方法比Naive方法(即将所有$n$模型在集成中完全训练)要快得多。我们还使用 ensemble模型的预测结果来估计不确定性,并使用Expected Calibration Error(ECE)来测量这个模型的不确定性。最后,我们进一步比较了这个模型的性能与Gaussian YOLOv3模型,这是一种使用预测 bounding box 坐标来模拟不确定性的YOLOv3变体。代码可以在 \url{https://github.com/Akola-Mbey-Denis/EfficientEnsemble} 上下载。

Comparative study of multi-person tracking methods

  • paper_url: http://arxiv.org/abs/2310.04825
  • repo_url: None
  • paper_authors: Denis Mbey Akola
  • for: 这个论文研究了两种跟踪算法(SORT和Tracktor++),这两种算法在MOT Challenge leaderboard上排名第一位(MOTChallenge网页:https://motchallenge.net)。
  • methods: 作者采用了流行的跟踪-by-检测方法,并使用自己训练的人体检测模型(MOT17Det数据集:https://motchallenge.net/data/MOT17Det/)和MOT17数据集(https://motchallenge.net/data/MOT17/)中的人体识别模型来降低Tracktor++中的假重复警示。
  • results: 实验结果表明,Tracktor++比SORT更好的多人跟踪算法。作者还进行了减少RE-ID网络和运动的贡献对Tracktor++结果的分析,并提供了未来研究的建议。
    Abstract This paper presents a study of two tracking algorithms (SORT~\cite{7533003} and Tracktor++~\cite{2019}) that were ranked first positions on the MOT Challenge leaderboard (The MOTChallenge web page: https://motchallenge.net ). The purpose of this study is to discover the techniques used and to provide useful insights about these algorithms in the tracking pipeline that could improve the performance of MOT tracking algorithms. To this end, we adopted the popular tracking-by-detection approach. We trained our own Pedestrian Detection model using the MOT17Det dataset (MOT17Det : https://motchallenge.net/data/MOT17Det/ ). We also used a re-identification model trained on MOT17 dataset (MOT17 : https://motchallenge.net/data/MOT17/ ) for Tracktor++ to reduce the false re-identification alarms. We then present experimental results which shows that Tracktor++ is a better multi-person tracking algorithm than SORT. We also performed ablation studies to discover the contribution of re-identification(RE-ID) network and motion to the results of Tracktor++. We finally conclude by providing some recommendations for future research.
    摘要 Here's the Simplified Chinese translation:这篇论文研究了两种多对象跟踪(MOT)算法(SORT和Tracktor++),它们在MOT Challenge leaderboard上排名第一。研究的目的是了解这些算法使用的技术和提供有用的洞察,以提高MOT跟踪性能。采用了跟踪通过检测的方法,并使用MOT17Det数据集来训练自己的人体检测模型。此外,还使用MOT17数据集来训练Tracktor++中的重复标识模型,以降低假的重复警示。实验结果表明,Tracktor++比SORT更好的多人跟踪算法。此外,还进行了减少RE-ID网络和运动对Tracktor++的贡献的ablation研究。最后,文章提出了一些未来研究的建议。

Combining UPerNet and ConvNeXt for Contrails Identification to reduce Global Warming

  • paper_url: http://arxiv.org/abs/2310.04808
  • repo_url: https://github.com/biluko/2023gric
  • paper_authors: Zhenkuan Wang
    for: This study focuses on aircraft contrail detection in global satellite images to improve contrail models and mitigate their impact on climate change.methods: An innovative data preprocessing technique for NOAA GOES-16 satellite images is developed, using brightness temperature data from the infrared channel to create false-color images, enhancing model perception. The model selection is based on the UPerNet architecture, implemented using the MMsegmentation library, with the integration of two ConvNeXt configurations for improved performance.results: The approach achieves exceptional results, boasting a high Dice coefficient score, placing it in the top 5% of participating teams.
    Abstract Semantic segmentation is a critical tool in computer vision, applied in various domains like autonomous driving and medical imaging. This study focuses on aircraft contrail detection in global satellite images to improve contrail models and mitigate their impact on climate change.An innovative data preprocessing technique for NOAA GOES-16 satellite images is developed, using brightness temperature data from the infrared channel to create false-color images, enhancing model perception. To tackle class imbalance, the training dataset exclusively includes images with positive contrail labels.The model selection is based on the UPerNet architecture, implemented using the MMsegmentation library, with the integration of two ConvNeXt configurations for improved performance. Cross-entropy loss with positive class weights enhances contrail recognition. Fine-tuning employs the AdamW optimizer with a learning rate of $2.5 \times 10^{-4}$.During inference, a multi-model prediction fusion strategy and a contrail determination threshold of 0.75 yield a binary prediction mask. RLE encoding is used for efficient prediction result organization.The approach achieves exceptional results, boasting a high Dice coefficient score, placing it in the top 5\% of participating teams. This underscores the innovative nature of the segmentation model and its potential for enhanced contrail recognition in satellite imagery.For further exploration, the code and models are available on GitHub: \url{https://github.com/biluko/2023GRIC.git}.
    摘要 semantic segmentation是计算机视觉中的一种重要工具,在各种领域中应用,如自动驾驶和医疗影像。这个研究将focus on气象卫星图像中的飞机烟尘迹象检测,以改进烟尘模型并减少对气候变化的影响。在这种情况下,我们开发了一种新的数据预处理技术,使用NOAA GOES-16卫星图像的红外通道的亮度温度数据创建 false-color 图像,提高模型的识别能力。为了解决类别偏斜问题,我们专门使用包含正确烟尘标签的训练集。模型选择基于 UPerNet 架构,通过 MMsegmentation 库的实现,并配置了两种 ConvNeXt 结构以提高性能。使用十字积分损失函数,并将正确类型权重为 2.5 倍 10 ^{-4}。在推理时,我们采用多模型预测融合策略和烟尘确定阈值 0.75,生成二进制预测Mask。使用 RLE 编码器进行有效的预测结果组织。这种方法实现了出色的结果,具有高 dice 系数,位列参赛队伍的前 5%。这表明我们的 segmentation 模型具有创新性,并有可能在卫星图像中提高烟尘识别精度。如果您想了解更多,可以在 GitHub 上查看我们的代码和模型: \url{https://github.com/biluko/2023GRIC.git}。

Fully Sparse Long Range 3D Object Detection Using Range Experts and Multimodal Virtual Points

  • paper_url: http://arxiv.org/abs/2310.04800
  • repo_url: None
  • paper_authors: Ajinkya Khoche, Laura Pereira Sánchez, Nazre Batool, Sina Sharif Mansouri, Patric Jensfelt
  • for: 提高自动驾驶车辆的安全性和效率,准确探测和 реаги于远距离对象、障碍物和危险。
  • methods: combining two LiDAR based 3D detection networks, one specializing at near to mid-range objects, and one at long-range 3D detection, with Multimodal Virtual Points (MVP) to enrich the data with virtual points.
  • results: achieves state-of-the-art performance on the Argoverse2 (AV2) dataset, with improvements at long range.
    Abstract 3D object detection at long-range is crucial for ensuring the safety and efficiency of self-driving cars, allowing them to accurately perceive and react to objects, obstacles, and potential hazards from a distance. But most current state-of-the-art LiDAR based methods are limited by the sparsity of range sensors, which generates a form of domain gap between points closer to and farther away from the ego vehicle. Another related problem is the label imbalance for faraway objects, which inhibits the performance of Deep Neural Networks at long-range. Although image features could be beneficial for long-range detections, and some recently proposed multimodal methods incorporate image features, they do not scale well computationally at long ranges or are limited by depth estimation accuracy. To address the above limitations, we propose to combine two LiDAR based 3D detection networks, one specializing at near to mid-range objects, and one at long-range 3D detection. To train a detector at long range under a scarce label regime, we further propose to weigh the loss according to the labelled objects' distance from ego vehicle. To mitigate the LiDAR sparsity issue, we leverage Multimodal Virtual Points (MVP), an image based depth completion algorithm, to enrich our data with virtual points. Our method, combining two range experts trained with MVP, which we refer to as RangeFSD, achieves state-of-the-art performance on the Argoverse2 (AV2) dataset, with improvements at long range. The code will be released soon.
    摘要 三维物体探测在长距离下是自动驾驶车辆的安全和效率 Ensuring the safety and efficiency of self-driving cars, allowing them to accurately perceive and react to objects, obstacles, and potential hazards from a distance. However, most current state-of-the-art LiDAR-based methods are limited by the sparsity of range sensors, which creates a domain gap between points closer to and farther away from the ego vehicle. Another related problem is the label imbalance for faraway objects, which hinders the performance of deep neural networks at long range. Although image features could be beneficial for long-range detections, and some recently proposed multimodal methods incorporate image features, they do not scale well computationally at long ranges or are limited by depth estimation accuracy. To address the above limitations, we propose to combine two LiDAR-based 3D detection networks, one specializing in near-to-mid-range objects, and one at long-range 3D detection. To train a detector at long range under a scarce label regime, we further propose to weigh the loss according to the labeled objects' distance from the ego vehicle. To mitigate the LiDAR sparsity issue, we leverage Multimodal Virtual Points (MVP), an image-based depth completion algorithm, to enrich our data with virtual points. Our method, combining two range experts trained with MVP, which we refer to as RangeFSD, achieves state-of-the-art performance on the Argoverse2 (AV2) dataset, with improvements at long range. The code will be released soon.

HI-SLAM: Monocular Real-time Dense Mapping with Hybrid Implicit Fields

  • paper_url: http://arxiv.org/abs/2310.04787
  • repo_url: None
  • paper_authors: Wei Zhang, Tiecheng Sun, Sen Wang, Qing Cheng, Norbert Haala
  • for: 这个论文旨在提出一种基于神经场的实时单目地图呈现框架,以实现准确和密集的同时定位和地图建模(SLAM)。
  • methods: 我们的方法将精细的SLAM方法与神经场隐式场合并,并使用多分辨率网格编码和签名距离函数(SDF)表示来生成神经场。这使得我们可以在实时更新地图,并通过在线循环关闭来保持全球一致性。
  • results: 我们的方法在实验中表现出比现有方法更高的准确率和地图完整性,同时保持实时性。
    Abstract In this letter, we present a neural field-based real-time monocular mapping framework for accurate and dense Simultaneous Localization and Mapping (SLAM). Recent neural mapping frameworks show promising results, but rely on RGB-D or pose inputs, or cannot run in real-time. To address these limitations, our approach integrates dense-SLAM with neural implicit fields. Specifically, our dense SLAM approach runs parallel tracking and global optimization, while a neural field-based map is constructed incrementally based on the latest SLAM estimates. For the efficient construction of neural fields, we employ multi-resolution grid encoding and signed distance function (SDF) representation. This allows us to keep the map always up-to-date and adapt instantly to global updates via loop closing. For global consistency, we propose an efficient Sim(3)-based pose graph bundle adjustment (PGBA) approach to run online loop closing and mitigate the pose and scale drift. To enhance depth accuracy further, we incorporate learned monocular depth priors. We propose a novel joint depth and scale adjustment (JDSA) module to solve the scale ambiguity inherent in depth priors. Extensive evaluations across synthetic and real-world datasets validate that our approach outperforms existing methods in accuracy and map completeness while preserving real-time performance.
    摘要 封信中,我们提出了基于神经场的实时单视映射框架,用于高精度和完整的同时地理位和地图(SLAM)。现有的神经映射框架有很好的表现,但它们依赖于RGB-D或 pose输入,或者不能在实时上运行。为了解决这些限制,我们的方法将粗粒度 SLAM 与神经隐藏场 integrate вместе。具体来说,我们的粗粒度 SLAM 方法在并行跟踪和全局优化时运行,而神经场基于最新的 SLAM 估计构建了地图。为了高效地构建神经场,我们采用多尺度网格编码和签名距离函数(SDF)表示。这使得我们可以保持地图总是最新的,并在全球更新时适时更新。为了保证全球一致性,我们提出了一种高效的 Sim(3) 基于 pose graph bundle adjustment(PGBA)方法来在线进行循环关闭和缓解 pose 和比例偏移。为了进一步提高深度精度,我们引入了学习的单视深度估计。我们提出了一种新的共同深度和比例调整(JDSA)模块,以解决深度估计中的比例歧阱。我们在 sintetic 和实际 datasets 进行了广泛的评估,并证明了我们的方法在准确和地图完整性方面超过现有方法,而且保持实时性。

IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers

  • paper_url: http://arxiv.org/abs/2310.04780
  • repo_url: https://github.com/hzlsaber/IPMix
  • paper_authors: Zhenglin Huang, Xianan Bao, Na Zhang, Qingqi Zhang, Xiaomei Tu, Biao Wu, Xi Yang
  • For: 提高卷积神经网络的Robustness和纯度之间的质量衡量。* Methods: integrate three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead, and introduce structural complexity at different levels to generate more diverse images, as well as adopt the random mixing method for multi-scale information fusion.* Results: outperform state-of-the-art corruption robustness on CIFAR-C and ImageNet-C, and significantly improve other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.
    Abstract Data augmentation has been proven effective for training high-accuracy convolutional neural network classifiers by preventing overfitting. However, building deep neural networks in real-world scenarios requires not only high accuracy on clean data but also robustness when data distributions shift. While prior methods have proposed that there is a trade-off between accuracy and robustness, we propose IPMix, a simple data augmentation approach to improve robustness without hurting clean accuracy. IPMix integrates three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead. To further improve the robustness, IPMix introduces structural complexity at different levels to generate more diverse images and adopts the random mixing method for multi-scale information fusion. Experiments demonstrate that IPMix outperforms state-of-the-art corruption robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also significantly improves the other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.
    摘要 <>训练高精度卷积神经网络分类器时,数据扩充已被证明是有效的防止过拟合的方法。然而,在实际场景中建立深度神经网络需要不仅在干净数据上达到高精度,还需要在数据分布shift时具有Robustness。而之前的方法认为存在精度和Robustness之间的负面关系,我们提出了IPMix,一种简单的数据扩充方法,可以在有限计算负担下提高训练数据的多样性,而不会增加过拟合。IPMix在图像、patch和像素三级数据扩充方面进行了一体化和标签保持的实现,并通过不同级别的结构复杂度来生成更多样的图像,并采用了随机混合方法来融合多尺度信息。实验表明,IPMix在CIFAR-C和ImageNet-C上的抗损害性能比state-of-the-art高,而且还在ImageNet-R、ImageNet-A和ImageNet-O上显著提高了其他安全指标,包括对抗扰动抑制、均衡、预测一致性和异常检测,达到了或与state-of-the-art相当的结果。<>

TransCC: Transformer Network for Coronary Artery CCTA Segmentation

  • paper_url: http://arxiv.org/abs/2310.04779
  • repo_url: None
  • paper_authors: Chenchu Xu, Meng Li, Xue Wu
  • for: 这个研究旨在提高 coronary computed tomography angiography (CCTA) 影像的精确分类,以早期检测和治疗 coronary heart disease (CHD)。
  • methods: 本研究使用 transformer 和卷积神经网络,具有自注意机制,以解决 coronary 分类任务中的两个挑战:一是对标的地方结构带来损害,二是需要同时考虑全局和本地特征。
  • results: 实验结果显示,TransCC 可以优于现有方法, segmentation 性能平均为 0.730 和 0.582,这些结果证明 TransCC 在 CCTA 影像分类中的有效性。
    Abstract The accurate segmentation of Coronary Computed Tomography Angiography (CCTA) images holds substantial clinical value for the early detection and treatment of Coronary Heart Disease (CHD). The Transformer, utilizing a self-attention mechanism, has demonstrated commendable performance in the realm of medical image processing. However, challenges persist in coronary segmentation tasks due to (1) the damage to target local structures caused by fixed-size image patch embedding, and (2) the critical role of both global and local features in medical image segmentation tasks.To address these challenges, we propose a deep learning framework, TransCC, that effectively amalgamates the Transformer and convolutional neural networks for CCTA segmentation. Firstly, we introduce a Feature Interaction Extraction (FIE) module designed to capture the characteristics of image patches, thereby circumventing the loss of semantic information inherent in the original method. Secondly, we devise a Multilayer Enhanced Perceptron (MEP) to augment attention to local information within spatial dimensions, serving as a complement to the self-attention mechanism. Experimental results indicate that TransCC outperforms existing methods in segmentation performance, boasting an average Dice coefficient of 0.730 and an average Intersection over Union (IoU) of 0.582. These results underscore the effectiveness of TransCC in CCTA image segmentation.
    摘要 精准分割 coronary computed tomography angiography (CCTA) 图像具有临床价值,可以早期探测和治疗 coronary heart disease (CHD)。transformer 使用自注意机制,在医疗图像处理领域表现出色。然而,在 coronary 分割任务中仍存在一些挑战,主要是因为 (1) 固定大小图像块嵌入引起的目标地方结构损害,以及 (2) 医疗图像分割任务中的全球和本地特征的重要作用。为Address these challenges, we propose a deep learning framework, TransCC, that effectively combines the Transformer and convolutional neural networks for CCTA segmentation.首先,我们提出了一种Feature Interaction Extraction (FIE)模块,用于捕捉图像块特征,从而绕过原始方法中的含义损失。其次,我们设计了一种多层增强感知机制 (MEP),用于增强对本地信息的注意力,作为自注意机制的补偿。实验结果表明,TransCC 在分割性能方面表现出色,其中平均 dice 系数为 0.730,平均 intersection over union (IoU) 为 0.582。这些结果证明 TransCC 在 CCTA 图像分割中的效果。

1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023 Technical Report:A Concise Pipeline for Egocentric Hand Pose Reconstruction

  • paper_url: http://arxiv.org/abs/2310.04769
  • repo_url: None
  • paper_authors: Zhishan Zhou, Zhi Lv, Shihao Zhou, Minqiang Zou, Tong Wu, Mochen Yu, Yao Tang, Jiajun Liang
  • for: Egocentric 3D Hand Pose Estimation challenge
  • methods: ViT backbones and simple regressor for 3D keypoints prediction, non-model method for merging multi-view results
  • results: 12.21mm MPJPE on test dataset, first place in challenge
    Abstract This report introduce our work on Egocentric 3D Hand Pose Estimation workshop. Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image. In the competition, we adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines. We noticed that Hand-objects occlusions and self-occlusions lead to performance degradation, thus proposed a non-model method to merge multi-view results in the post-process stage. Moreover, We utilized test time augmentation and model ensemble to make further improvement. We also found that public dataset and rational preprocess are beneficial. Our method achieved 12.21mm MPJPE on test dataset, achieve the first place in Egocentric 3D Hand Pose Estimation challenge.
    摘要 这份报告介绍我们在 egocentric 3D 手势估计工作坊中的工作。我们使用 AssemblyHands,这个挑战是从单视图图像中进行 egocentric 3D 手势估计。在竞赛中,我们采用基于 ViT 的背bone 和简单的回归器进行 3D 关键点预测,这提供了强大的模型基线。我们注意到手-物体遮挡和自遮挡会导致性能下降,因此我们提议一种非模型方法将多视图结果在后处理阶段合并。此外,我们利用测试时的扩展和模型ensemble来做进一步的改进。我们还发现公共数据集和合理的预处理是有利的。我们的方法在测试集上达到了 12.21mm MPJPE,在 Egocentric 3D Hand Pose Estimation 挑战中获得了第一名。

CAD Models to Real-World Images: A Practical Approach to Unsupervised Domain Adaptation in Industrial Object Classification

  • paper_url: http://arxiv.org/abs/2310.04757
  • repo_url: https://github.com/dritter-bht/synthnet-transfer-learning
  • paper_authors: Dennis Ritter, Mike Hemberger, Marc Hönig, Volker Stopp, Erik Rodner, Kristian Hildebrand
  • for: 本研究系统atica��analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting, using only category-labeled CAD models.
  • methods: 本研究使用的方法包括 domain adaptation pipeline, which achieves SoTA performance on the VisDA benchmark and drastically improves recognition performance on a new open industrial dataset comprised of 102 mechanical parts.
  • results: 研究结果表明,使用这种方法可以帮助实现state-of-the-art unsupervised domain adaptation in practice,并且提供了一些实践中应用的指南。
    Abstract In this paper, we systematically analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting. In contrast to standard natural object benchmarks existing in the field, our results highlight the most important design choices when only category-labeled CAD models are available but classification needs to be done with real-world images. Our domain adaptation pipeline achieves SoTA performance on the VisDA benchmark, but more importantly, drastically improves recognition performance on our new open industrial dataset comprised of 102 mechanical parts. We conclude with a set of guidelines that are relevant for practitioners needing to apply state-of-the-art unsupervised domain adaptation in practice. Our code is available at https://github.com/dritter-bht/synthnet-transfer-learning.
    摘要 在这篇论文中,我们系统地分析了无监督领域适应管道,用于对工业场景中的物体分类。与现有领域中的标准自然物体标准相比,我们的结果显示了领域适应管道的重要设计选择,当仅有类别标注的CAD模型可用,但是需要使用实际图像进行分类。我们的领域适应管道在VisDA标准曲线上实现了领先的性能,而且在我们新提出的102种机械部件的开放 dataset中显著提高了识别性能。我们 conclude with一些实践中应用状态空间领域适应的指南,可以帮助实践者。我们的代码可以在https://github.com/dritter-bht/synthnet-transfer-learning中找到。

Balancing stability and plasticity in continual learning: the readout-decomposition of activation change (RDAC) framework

  • paper_url: http://arxiv.org/abs/2310.04741
  • repo_url: None
  • paper_authors: Daniel Anthes, Sushrut Thorat, Peter König, Tim C. Kietzmann
  • for: 本研究旨在解释继续学习(Continual Learning,CL)算法中稳定性和材料性之间的贸易关系,并提供有价值的思路以帮助解决这一问题。
  • methods: 该研究提出了一种名为Readout-Decomposition of Activation Change(RDAC)框架,该框架可以帮助解释CL算法中稳定性和材料性之间的关系,同时还可以解释在深度非线性神经网络中处理分割CIFAR-110任务时,各种常用的正则化算法(如Synaptic intelligence、Elastic-weight consolidation、Learning without Forgetting)和回忆策略(如Gradient episodic memory、Data replay)的稳定性和材料性之间的贸易关系。
  • results: 研究发现,GEM和Data replay等回忆策略可以保持稳定性和材料性,而SI、EWC和LwF等正则化算法则在维持稳定性的同时会减少材料性。此外,对一个隐藏层线性神经网络进行分析,我们 derivated一种gradient decomposition算法,可以限制活动变化在先前的读写空间中,以保持高稳定性而不会进一步减少材料性。结果表明,该算法可以维持稳定性无需重要的材料性损失。
    Abstract Continual learning (CL) algorithms strive to acquire new knowledge while preserving prior information. However, this stability-plasticity trade-off remains a central challenge. This paper introduces a framework that dissects this trade-off, offering valuable insights into CL algorithms. The Readout-Decomposition of Activation Change (RDAC) framework first addresses the stability-plasticity dilemma and its relation to catastrophic forgetting. It relates learning-induced activation changes in the range of prior readouts to the degree of stability and changes in the null space to the degree of plasticity. In deep non-linear networks tackling split-CIFAR-110 tasks, the framework clarifies the stability-plasticity trade-offs of the popular regularization algorithms Synaptic intelligence (SI), Elastic-weight consolidation (EWC), and learning without Forgetting (LwF), and replay-based algorithms Gradient episodic memory (GEM), and data replay. GEM and data replay preserved stability and plasticity, while SI, EWC, and LwF traded off plasticity for stability. The inability of the regularization algorithms to maintain plasticity was linked to them restricting the change of activations in the null space of the prior readout. Additionally, for one-hidden-layer linear neural networks, we derived a gradient decomposition algorithm to restrict activation change only in the range of the prior readouts, to maintain high stability while not further sacrificing plasticity. Results demonstrate that the algorithm maintained stability without significant plasticity loss. The RDAC framework informs the behavior of existing CL algorithms and paves the way for novel CL approaches. Finally, it sheds light on the connection between learning-induced activation/representation changes and the stability-plasticity dilemma, also offering insights into representational drift in biological systems.
    摘要

Activate and Reject: Towards Safe Domain Generalization under Category Shift

  • paper_url: http://arxiv.org/abs/2310.04724
  • repo_url: None
  • paper_authors: Chaoqi Chen, Luyao Tang, Leitian Tao, Hong-Yu Zhou, Yue Huang, Xiaoguang Han, Yizhou Yu
  • for: 本研究旨在解决深度神经网络在开放世界中实现满意的准确率的问题,特别是在不同领域和物种出现时能够同时探测未知类型样本和知名类型样本的检测问题。
  • methods: 我们提出了一种Activate and Reject(ART)框架,通过在训练时间期间优化未知类型的概率,然后使用权重平滑来缓解过自信问题。在测试时,我们引入了一种步骤式在线适应方法,通过跨领域最近邻和类prototype信息来预测标签,不需要更新网络参数或使用阈值机制。
  • results: 我们的实验表明,ART可以提高深度网络的普适能力,对不同的视觉任务进行改进。对于图像分类任务,ART提高了H-score的平均提升率为6.1%,相比之下前一个最佳方法。对于物体检测和 semantic segmentation,我们建立了新的标准 bencmarks,并实现了竞争性的表现。
    Abstract Albeit the notable performance on in-domain test points, it is non-trivial for deep neural networks to attain satisfactory accuracy when deploying in the open world, where novel domains and object classes often occur. In this paper, we study a practical problem of Domain Generalization under Category Shift (DGCS), which aims to simultaneously detect unknown-class samples and classify known-class samples in the target domains. Compared to prior DG works, we face two new challenges: 1) how to learn the concept of ``unknown'' during training with only source known-class samples, and 2) how to adapt the source-trained model to unseen environments for safe model deployment. To this end, we propose a novel Activate and Reject (ART) framework to reshape the model's decision boundary to accommodate unknown classes and conduct post hoc modification to further discriminate known and unknown classes using unlabeled test data. Specifically, during training, we promote the response to the unknown by optimizing the unknown probability and then smoothing the overall output to mitigate the overconfidence issue. At test time, we introduce a step-wise online adaptation method that predicts the label by virtue of the cross-domain nearest neighbor and class prototype information without updating the network's parameters or using threshold-based mechanisms. Experiments reveal that ART consistently improves the generalization capability of deep networks on different vision tasks. For image classification, ART improves the H-score by 6.1% on average compared to the previous best method. For object detection and semantic segmentation, we establish new benchmarks and achieve competitive performance.
    摘要 尽管深度神经网络在域内测试点上表现出色,但在开放世界中部署时,它们很难达到满意的准确率。在这篇论文中,我们研究了适用于领域总结下的类别转换问题(DGCS),该问题的目标是同时检测未知类样本并将知道类样本分类到目标领域中。相比于先前的DG工作,我们面临两个新的挑战:1)如何在训练期间学习“未知”的概念,只使用源领域知道类样本;2)如何适应到未经见过的环境中安全地部署模型。为此,我们提出了一种名为Activate and Reject(ART)框架,用于重塑模型的决策边界,以便容纳未知类和进行后续修改以进一步分类知道和未知类样本使用无标注测试数据。在训练期间,我们通过优化未知概率来提高模型对未知类的应答,然后使用权重平滑来缓解过自信问题。在测试时,我们引入了一种步骤式在线适应方法,通过跨领域最近邻和类型范围信息来预测标签,不需要更新网络参数或使用阈值机制。实验表明,ART可以在不同的视觉任务上提高深度网络的总成绩。对于图像分类,ART提高了H-score的平均提升为6.1%,较前一个最佳方法。对于物体检测和semantic segmentation,我们建立了新的benchmark和获得了竞争性的表现。

Memory-Constrained Semantic Segmentation for Ultra-High Resolution UAV Imagery

  • paper_url: http://arxiv.org/abs/2310.04721
  • repo_url: None
  • paper_authors: Qi Li, Jiaxin Cai, Yuanlong Yu, Jason Gu, Jia Pan, Wenxi Liu
  • for: 这篇论文主要目的是解决无人机图像分析中的高分辨率图像分类问题,特别是在具有 GPU 内存限制的 Computational Device 上进行高效的分类。
  • methods: 本文提出了一个 GPU 内存有效的和高效的框架,实现了本地推理而不需要访问对应像的高分辨率信息。具体来说,我们提出了一个新的空间导向高分辨率查询模组,可以透过查询邻近的潜在嵌入对象来预测每个像素的分类结果,而不需要访问高分辨率信息。此外,我们还提出了一个高效的内存基于的互动方案,以corrrect potential的Semantic Bias 。
  • results: 在实验中,我们使用了公共的标准库 benchmark 进行评估,并在小型和大型 GPU 内存使用限制下 achieve 了 superior 的表现。
    Abstract Amidst the swift advancements in photography and sensor technologies, high-definition cameras have become commonplace in the deployment of Unmanned Aerial Vehicles (UAVs) for diverse operational purposes. Within the domain of UAV imagery analysis, the segmentation of ultra-high resolution images emerges as a substantial and intricate challenge, especially when grappling with the constraints imposed by GPU memory-restricted computational devices. This paper delves into the intricate problem of achieving efficient and effective segmentation of ultra-high resolution UAV imagery, while operating under stringent GPU memory limitation. The strategy of existing approaches is to downscale the images to achieve computationally efficient segmentation. However, this strategy tends to overlook smaller, thinner, and curvilinear regions. To address this problem, we propose a GPU memory-efficient and effective framework for local inference without accessing the context beyond local patches. In particular, we introduce a novel spatial-guided high-resolution query module, which predicts pixel-wise segmentation results with high quality only by querying nearest latent embeddings with the guidance of high-resolution information. Additionally, we present an efficient memory-based interaction scheme to correct potential semantic bias of the underlying high-resolution information by associating cross-image contextual semantics. For evaluation of our approach, we perform comprehensive experiments over public benchmarks and achieve superior performance under both conditions of small and large GPU memory usage limitations. We will release the model and codes in the future.
    摘要 在摄像头和感知技术的快速进步下,高清晰相机在无人航空器(UAV)的应用中变得普遍。在UAV成像分析领域,分解超高清晰图像成为一项困难和复杂的任务,尤其是在面临GPU内存限制的计算设备上。本文探讨如何在GPU内存限制下实现高效和高质量的超高清晰图像分解方法。现有方法的策略是下samples the images to achieve computationally efficient segmentation, but this approach tends to overlook smaller, thinner, and curvilinear regions.为解决这问题,我们提出了一种GPU内存高效和可靠的框架,用于本地推理而无需访问背景信息。具体来说,我们引入了一种新的空间指导高分辨率查询模块,可以通过询问最近的潜在嵌入来预测每个像素的分类结果,并且只需考虑当前局部信息。此外,我们还提出了一种高效的内存基于的交互方案,以 corralling potential semantic bias of the underlying high-resolution information by associating cross-image contextual semantics.为评估我们的方法,我们在公共benchmark上进行了广泛的实验,并在小和大GPU内存使用限制下达到了superior表现。我们将在未来释放模型和代码。

A Comprehensive Survey on Deep Neural Image Deblurring

  • paper_url: http://arxiv.org/abs/2310.04719
  • repo_url: None
  • paper_authors: Sajjad Amrollahi Biyouki, Hoon Hwangbo
  • for: 图像锐化纷和提高图像质量,提高图像的纹理和对象视觉。
  • methods: 使用深度神经网络,包括盲目和非盲目图像锐化方法。
  • results: 深度神经网络在图像锐化方面带来了一场大进步,提高了性能指标和数据集的使用。但目前还存在一些挑战和研究空白,未来的研究可能把焦点放在这些领域。
    Abstract Image deblurring tries to eliminate degradation elements of an image causing blurriness and improve the quality of an image for better texture and object visualization. Traditionally, prior-based optimization approaches predominated in image deblurring, but deep neural networks recently brought a major breakthrough in the field. In this paper, we comprehensively review the recent progress of the deep neural architectures in both blind and non-blind image deblurring. We outline the most popular deep neural network structures used in deblurring applications, describe their strengths and novelties, summarize performance metrics, and introduce broadly used datasets. In addition, we discuss the current challenges and research gaps in this domain and suggest potential research directions for future works.
    摘要 图像锐化尝试消除图像模糊的因素,提高图像质量,以便更好地看到图像的文字和物体视觉。在过去,基于优先的优化方法曾经主导图像锐化领域,但是最近,深度神经网络在这个领域带来了一场大的突破。在这篇评论中,我们全面回顾了最近深度神经网络在图像锐化中的进步,包括无监控和监控图像锐化。我们列出了最受欢迎的深度神经网络结构,描述了它们的优势和创新,概括性能指标,并介绍了广泛使用的数据集。此外,我们讨论了当前领域的挑战和研究缺陷,并提出了未来研究的可能性。

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

  • paper_url: http://arxiv.org/abs/2310.04716
  • repo_url: None
  • paper_authors: Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
  • for: automatize numerous AI tasks by connecting Large Language Models (LLMs) to various domain-specific models or APIs
  • methods: build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor, using a visual encoder and a language decoder, and an innovative Reinforcement Learning (RL) based algorithm
  • results: outperforms the state-of-the-art methods by a clear margin, showing the potential as a generic UI task automation API
    Abstract Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm to predict geometric coordinates in a sequence of tokens using a language decoder. We further propose an innovative Reinforcement Learning (RL) based algorithm to supervise the tokens in such sequence jointly with visually semantic metrics, which effectively strengthens the spatial decoding capability of the pixel-to-sequence paradigm. Extensive experiments demonstrate our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin and shows the potential as a generic UI task automation API.
    摘要 近年来,大型语言模型(LLM)的流行性打开了许多自动化AI任务的可能性,通过将LLM与不同领域模型或API连接起来,使LLM serve为调度器,而领域特定模型或API则是行动执行者。尽管有很多领域特定模型/API,但它们仍然无法全面覆盖人机交互中的自动化需求。在这种情况下,我们建立了一个多模态模型,将自然语言指令与给定的UIcreenshot相关联,作为一个通用的UI任务自动化执行器。这个没有元数据的grounding模型由视觉编码器和语言解码器组成,首先在已有的文档理解任务上进行预训练,然后学习从UIcreenshot中提取空间信息的decode方法。为了利用图像到文本的预训练知识,我们采用了像素到序列的方法,通过语言解码器预测图像中的几何坐标序列。我们还提出了一种创新的强化学习(RL)算法,以便同时监督序列中的图像Semantic метрик,从而提高图像到序列的空间解码能力。经过广泛的实验,我们的提出的强化UI指令grounding模型超越了当前状态的方法,并显示出了作为通用UI任务自动化API的潜在力量。

Generalized Robust Test-Time Adaptation in Continuous Dynamic Scenarios

  • paper_url: http://arxiv.org/abs/2310.04714
  • repo_url: https://github.com/bit-da/rotta
  • paper_authors: Shuang Li, Longhui Yuan, Binhui Xie, Tao Yang
    for: 这个研究旨在解决实际应用中出现的同时进行构成和标签shift的问题,即在测试过程中,测试数据流中的标签和特征都在不断变化。methods: 这个研究使用的方法包括Robust Parameter Adaptation、recalibration of batch normalization、source knowledge regularization和Bias-Guided Output Adaptation等。这些方法可以帮助模型在测试过程中快速适应测试数据流中的变化。results: 实验结果显示,GRoTTA方法在PTTA设定下具有较高的效果,较以往的竞争者多个项目。这显示GRoTTA方法可以帮助模型在实际应用中更好地适应测试数据流中的变化。
    Abstract Test-time adaptation (TTA) adapts the pre-trained models to test distributions during the inference phase exclusively employing unlabeled test data streams, which holds great value for the deployment of models in real-world applications. Numerous studies have achieved promising performance on simplistic test streams, characterized by independently and uniformly sampled test data originating from a fixed target data distribution. However, these methods frequently prove ineffective in practical scenarios, where both continual covariate shift and continual label shift occur simultaneously, i.e., data and label distributions change concurrently and continually over time. In this study, a more challenging Practical Test-Time Adaptation (PTTA) setup is introduced, which takes into account the concurrent presence of continual covariate shift and continual label shift, and we propose a Generalized Robust Test-Time Adaptation (GRoTTA) method to effectively address the difficult problem. We start by steadily adapting the model through Robust Parameter Adaptation to make balanced predictions for test samples. To be specific, firstly, the effects of continual label shift are eliminated by enforcing the model to learn from a uniform label distribution and introducing recalibration of batch normalization to ensure stability. Secondly, the continual covariate shift is alleviated by employing a source knowledge regularization with the teacher-student model to update parameters. Considering the potential information in the test stream, we further refine the balanced predictions by Bias-Guided Output Adaptation, which exploits latent structure in the feature space and is adaptive to the imbalanced label distribution. Extensive experiments demonstrate GRoTTA outperforms the existing competitors by a large margin under PTTA setting, rendering it highly conducive for adoption in real-world applications.
    摘要 测试时适应(TTA)在推理阶段 exclusively使用无标注测试数据流来适应测试分布,这对实际应用中的模型部署具有很大的价值。许多研究已经在简单的测试流中达到了有 promise的性能,但这些方法经常在实际场景中失效,因为测试数据和标签分布在时间上同时发生变化。在本研究中,我们引入了更加具有挑战性的实际测试适应(PTTA)设定,该设定考虑了同时发生的连续变量和连续标签变换,并提出了一种通用鲁棒测试适应(GRoTTA)方法来有效地解决这个困难问题。我们首先通过鲁棒参数适应来稳定地适应测试样本。更 Specifically,我们首先消除了连续标签变换的影响,通过使用均匀标签分布来学习,并通过批量正则化来保证稳定性。其次,我们使用教师模型来更新参数,以 alleviate连续变量变换。另外,我们还利用测试流中的可能信息,通过偏好导向输出适应来细化平衡预测,这里利用了特征空间中的潜在结构,并且是适应偏高标签分布的。我们进行了广泛的实验,结果显示GRoTTA在PTTA设定下高效地击败了现有的竞争对手,这表明GRoTTA在实际应用中具有很高的潜力。

UFD-PRiME: Unsupervised Joint Learning of Optical Flow and Stereo Depth through Pixel-Level Rigid Motion Estimation

  • paper_url: http://arxiv.org/abs/2310.04712
  • repo_url: None
  • paper_authors: Shuai Yuan, Carlo Tomasi
  • for: 这篇论文是为了提出一种基于joint training的光流和立体动态模型的方法,以提高光流的精度和详细程度。
  • methods: 这篇论文使用了一种两个网络 architecture,第一个网络用于 estimate flow和 disparity jointly,而第二个网络用于使用 optical flow作为 Pseudo-labels来估算3D rigid motion和重建 optical flow。最后一个阶段使用了这两个网络的输出进行融合。
  • results: 这篇论文在 KITTI-2015 测试集上 achieved 7.36% optical flow error,与之前的state-of-the-art 9.38%的错误率相比,提高了大量的精度和详细程度。此外,这种方法也在 stero depth 方面 achieved slightly better or comparable results。
    Abstract Both optical flow and stereo disparities are image matches and can therefore benefit from joint training. Depth and 3D motion provide geometric rather than photometric information and can further improve optical flow. Accordingly, we design a first network that estimates flow and disparity jointly and is trained without supervision. A second network, trained with optical flow from the first as pseudo-labels, takes disparities from the first network, estimates 3D rigid motion at every pixel, and reconstructs optical flow again. A final stage fuses the outputs from the two networks. In contrast with previous methods that only consider camera motion, our method also estimates the rigid motions of dynamic objects, which are of key interest in applications. This leads to better optical flow with visibly more detailed occlusions and object boundaries as a result. Our unsupervised pipeline achieves 7.36% optical flow error on the KITTI-2015 benchmark and outperforms the previous state-of-the-art 9.38% by a wide margin. It also achieves slightly better or comparable stereo depth results. Code will be made available.
    摘要 beide 光流和立体差是图像匹配,因此可以从共同训练中受益。深度和3D运动提供geometry rather than photometry信息,可以进一步提高光流。因此,我们设计了一个首先网络,并将流和差 jointly estimated,并在无监督下训练。其次,使用光流从首先网络中得到的pseudo-labels,对差从首先网络中得到,并在每个像素处估计3D刚性运动,并重新计算光流。最后,将两个网络的输出进行融合。与之前的方法只考虑相机运动的情况下,我们的方法还估计了动态对象的刚性运动,这些运动是应用中关键的。这导致了更好的光流,有较明显的 occlusion和物体边界。我们的无监督管道在 KITTI-2015 测试benchmark上取得了7.36%的光流错误率,与之前的状态 искусственного智能 9.38% 的差距非常大。它还实现了与之前或相当的 stero depth 结果。我们将代码公开。

Multi-scale MRI reconstruction via dilated ensemble networks

  • paper_url: http://arxiv.org/abs/2310.04705
  • repo_url: None
  • paper_authors: Wendi Ma, Marlon Bran Lorenzana, Wei Dai, Hongfu Sun, Shekhar S. Chandra
  • for: 本文旨在提出一种高效的多尺度重建网络,用于提高MRI重建图像质量。
  • methods: 本文使用了扩展 convolutions 技术,并采用了多路分支和堆叠连接来保留分辨率和增加缩放级别。此外,文章还提出了一种复杂值版本,使用复杂 convolutions 来利用phas信息。
  • results: 实验结果表明,实数版本的模型比常见重建架构和一种多尺度网络更高效,而复杂值版本在phas信息更加强的情况下得到了更好的质量result。
    Abstract As aliasing artefacts are highly structural and non-local, many MRI reconstruction networks use pooling to enlarge filter coverage and incorporate global context. However, this inadvertently impedes fine detail recovery as downsampling creates a resolution bottleneck. Moreover, real and imaginary features are commonly split into separate channels, discarding phase information particularly important to high frequency textures. In this work, we introduce an efficient multi-scale reconstruction network using dilated convolutions to preserve resolution and experiment with a complex-valued version using complex convolutions. Inspired by parallel dilated filters, multiple receptive fields are processed simultaneously with branches that see both large structural artefacts and fine local features. We also adopt dense residual connections for feature aggregation to efficiently increase scale and the deep cascade global architecture to reduce overfitting. The real-valued version of this model outperformed common reconstruction architectures as well as a state-of-the-art multi-scale network whilst being three times more efficient. The complex-valued network yielded better qualitative results when more phase information was present.
    摘要 “为了减少别名遗留物的影响,许多MRI重建网络使用抢取来扩大范围和包含全球观点。然而,这会意外地阻碍细节重建,因为下推过滤导致解析瓶颈。此外,实际和想像的特征通常会被分配到不同的通道中,将相位资讯特别是高频率 текстур中的重要资讯排除。在这个工作中,我们引入了一个高效的多値重建网络,使用扩大核心来保留分辨率和实验使用复数值版本使用复数核心。受到平行扩大滤镜的启发,我们的网络可以同时处理多个接收场,处理大规模结构遗留物和细部本地特征。我们还采用紧密的复原连接来汇集特征,以增加缩减的尺度和深度汇集全球架构来降低过滤。实际版本的这个模型比常用重建架构和多値网络更高效,并且三倍更高效。复数值版本的模型在具有更多相位资讯时表现更好。”

Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis

  • paper_url: http://arxiv.org/abs/2310.04698
  • repo_url: None
  • paper_authors: Siqi Du, Shengjun Tang, Weixi Wang, Xiaoming Li, Renzhong Guo
  • for: 这个论文旨在提高森林远程感知数据的分析效率,通过将大语言模型(LLMs)integrated into forestry remote sensing data workflow。
  • methods: 该论文提出了一种模块化LLM专家系统,名为Tree-GPT,它将图像理解模块、域知识库和工具链 integrate into LLMs,使其能够理解图像、获得准确的知识、生成代码和在本地环境中进行数据分析。
  • results: 该论文测试了这些任务,包括搜索、视觉化和机器学习分析,并得到了良好的结果,表明Tree-GPT可以在森林研究和环境科学中动态使用LLMs。
    Abstract This paper introduces a novel framework, Tree-GPT, which incorporates Large Language Models (LLMs) into the forestry remote sensing data workflow, thereby enhancing the efficiency of data analysis. Currently, LLMs are unable to extract or comprehend information from images and may generate inaccurate text due to a lack of domain knowledge, limiting their use in forestry data analysis. To address this issue, we propose a modular LLM expert system, Tree-GPT, that integrates image understanding modules, domain knowledge bases, and toolchains. This empowers LLMs with the ability to comprehend images, acquire accurate knowledge, generate code, and perform data analysis in a local environment. Specifically, the image understanding module extracts structured information from forest remote sensing images by utilizing automatic or interactive generation of prompts to guide the Segment Anything Model (SAM) in generating and selecting optimal tree segmentation results. The system then calculates tree structural parameters based on these results and stores them in a database. Upon receiving a specific natural language instruction, the LLM generates code based on a thought chain to accomplish the analysis task. The code is then executed by an LLM agent in a local environment and . For ecological parameter calculations, the system retrieves the corresponding knowledge from the knowledge base and inputs it into the LLM to guide the generation of accurate code. We tested this system on several tasks, including Search, Visualization, and Machine Learning Analysis. The prototype system performed well, demonstrating the potential for dynamic usage of LLMs in forestry research and environmental sciences.
    摘要 The image understanding module extracts structured information from forest remote sensing images by utilizing automatic or interactive generation of prompts to guide the Segment Anything Model (SAM) in generating and selecting optimal tree segmentation results. The system then calculates tree structural parameters based on these results and stores them in a database. Upon receiving a specific natural language instruction, the LLM generates code based on a thought chain to accomplish the analysis task. The code is then executed by an LLM agent in a local environment. For ecological parameter calculations, the system retrieves the corresponding knowledge from the knowledge base and inputs it into the LLM to guide the generation of accurate code.The prototype system was tested on several tasks, including search, visualization, and machine learning analysis, and performed well, demonstrating the potential for dynamic usage of LLMs in forestry research and environmental sciences.

SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food Detection

  • paper_url: http://arxiv.org/abs/2310.04689
  • repo_url: https://github.com/lancezpf/seeds
  • paper_authors: Pengfei Zhou, Weiqing Min, Yang Zhang, Jiajun Song, Ying Jin, Shuqiang Jiang
  • for: zeroshot food detection (ZSFD)
  • methods: semantic separable diffusion synthesizer (SeeDS) framework, including semantic separable synthesizing module (S$^3$M) and region feature denoising diffusion model (RFDDM)
  • results: state-of-the-art ZSFD performance on two food datasets (ZSFooD and UECFOOD-256), and effectiveness on general ZSD datasets (PASCAL VOC and MS COCO)
    Abstract Food detection is becoming a fundamental task in food computing that supports various multimedia applications, including food recommendation and dietary monitoring. To deal with real-world scenarios, food detection needs to localize and recognize novel food objects that are not seen during training, demanding Zero-Shot Detection (ZSD). However, the complexity of semantic attributes and intra-class feature diversity poses challenges for ZSD methods in distinguishing fine-grained food classes. To tackle this, we propose the Semantic Separable Diffusion Synthesizer (SeeDS) framework for Zero-Shot Food Detection (ZSFD). SeeDS consists of two modules: a Semantic Separable Synthesizing Module (S$^3$M) and a Region Feature Denoising Diffusion Model (RFDDM). The S$^3$M learns the disentangled semantic representation for complex food attributes from ingredients and cuisines, and synthesizes discriminative food features via enhanced semantic information. The RFDDM utilizes a novel diffusion model to generate diversified region features and enhances ZSFD via fine-grained synthesized features. Extensive experiments show the state-of-the-art ZSFD performance of our proposed method on two food datasets, ZSFooD and UECFOOD-256. Moreover, SeeDS also maintains effectiveness on general ZSD datasets, PASCAL VOC and MS COCO. The code and dataset can be found at https://github.com/LanceZPF/SeeDS.
    摘要 食物检测已成为食物计算中的基本任务,支持多媒体应用程序,包括食物推荐和饮食监测。要处理实际场景,食物检测需要本地化和识别新的食物对象,需要零Instance检测(ZSD)。然而,食物的semantic特征和内部特征多样性带来了ZSD方法在细腻食物类型之间分辨的挑战。为解决这个问题,我们提出了Semantic Separable Diffusion Synthesizer(SeeDS)框架,用于零Instance食物检测(ZSFD)。SeeDS包括两个模块:Semantic Separable Synthesizing Module(S$^3$M)和Region Feature Denoising Diffusion Model(RFDDM)。S$^3$M学习了复杂的食物特征的分离 semantic representation,从原料和菜系中提取出各种Semantic信息,并将其拼接成特征。RFDDM使用了一种新的扩散模型,生成了多样化的区域特征,并通过细腻的合成特征提高了ZSFD的性能。我们的提议方法在两个食物数据集上实现了状态的ZSFD性能,即ZSFooD和UECFOOD-256。此外,SeeDS还保持了在通用ZSD数据集上的有效性,包括PASCAL VOC和MS COCO。codes和数据集可以在https://github.com/LanceZPF/SeeDS中找到。

PatchProto Networks for Few-shot Visual Anomaly Classification

  • paper_url: http://arxiv.org/abs/2310.04688
  • repo_url: None
  • paper_authors: Jian Wang, Yue Zhuo
  • for: 本研究は、缺乏异常样本的实际问题に针对的,即几何学学习(FSL)。
  • methods: 该研究提出了一种基于几何学学习的方法,称为PatchProto网络,它仅提取了异常区域的 CNN特征,作为几何学学习的原型。
  • results: 对于MVTec-AD数据集,相比基本的几何学学习分类器,PatchProto网络显著提高了几何学学习异常分类精度。
    Abstract The visual anomaly diagnosis can automatically analyze the defective products, which has been widely applied in industrial quality inspection. The anomaly classification can classify the defective products into different categories. However, the anomaly samples are hard to access in practice, which impedes the training of canonical machine learning models. This paper studies a practical issue that anomaly samples for training are extremely scarce, i.e., few-shot learning (FSL). Utilizing the sufficient normal samples, we propose PatchProto networks for few-shot anomaly classification. Different from classical FSL methods, PatchProto networks only extract CNN features of defective regions of interest, which serves as the prototypes for few-shot learning. Compared with basic few-shot classifier, the experiment results on MVTec-AD dataset show PatchProto networks significantly improve the few-shot anomaly classification accuracy.
    摘要 《图像异常诊断可以自动分析异常产品,广泛应用于工业质量检测。异常分类可以将异常产品分为不同类别。然而,异常样本在实践中很难获取,这限制了 canonical 机器学习模型的训练。本文研究了实际上异常样本训练很 scarce 的问题,即 few-shot learning(FSL)。使用 suffcient normal samples,我们提议 PatchProto 网络 для few-shot 异常分类。与传统 FSL 方法不同,PatchProto 网络只提取异常区域的 CNN 特征,这些特征 serves 为 few-shot 学习的原型。与基本 FSL 分类器比较,我们在 MVTec-AD 数据集上进行了实验,结果显示 PatchProto 网络可以显著提高 few-shot 异常分类精度。》Note: "异常" (ànòu) in the text refers to "anomaly" or "defect", and "异常样本" (ànòu yàngxì) refers to "anomalous samples" or "defective products".

High Visual-Fidelity Learned Video Compression

  • paper_url: http://arxiv.org/abs/2310.04679
  • repo_url: None
  • paper_authors: Meng Li, Yibo Shi, Jing Wang, Yunqi Huang
  • for: 提高视频应用的质量,提出高视质量学习视频压缩框架(HVFVC)。
  • methods: 提出一种新的信任度基于的特征重建方法,以解决新出现的区域重建质量不佳的问题,并使用周期赔偿损失来抑制减采融合操作和优化导致的检查板抖波问题。
  • results: 对比VVC标准,HVFVC在50%的比特率下达到了出色的感知质量,显著超越VVC标准。
    Abstract With the growing demand for video applications, many advanced learned video compression methods have been developed, outperforming traditional methods in terms of objective quality metrics such as PSNR. Existing methods primarily focus on objective quality but tend to overlook perceptual quality. Directly incorporating perceptual loss into a learned video compression framework is nontrivial and raises several perceptual quality issues that need to be addressed. In this paper, we investigated these issues in learned video compression and propose a novel High Visual-Fidelity Learned Video Compression framework (HVFVC). Specifically, we design a novel confidence-based feature reconstruction method to address the issue of poor reconstruction in newly-emerged regions, which significantly improves the visual quality of the reconstruction. Furthermore, we present a periodic compensation loss to mitigate the checkerboard artifacts related to deconvolution operation and optimization. Extensive experiments have shown that the proposed HVFVC achieves excellent perceptual quality, outperforming the latest VVC standard with only 50% required bitrate.
    摘要 (Simplified Chinese translation)随着视频应用的增长需求,许多高级学习视频压缩方法已经被开发出来,比传统方法在PSNR等 объекive 质量指标上表现更高。现有方法主要关注对象质量,忽略了人类视觉质量。直接在学习视频压缩框架中引入人类视觉损失是非常困难的,并且会引起一些人类视觉质量问题。在这篇论文中,我们investigated these issues in learned video compression and propose a novel High Visual-Fidelity Learned Video Compression framework (HVFVC). Specifically, we design a novel confidence-based feature reconstruction method to address the issue of poor reconstruction in newly-emerged regions, which significantly improves the visual quality of the reconstruction. Furthermore, we present a periodic compensation loss to mitigate the checkerboard artifacts related to deconvolution operation and optimization. Extensive experiments have shown that the proposed HVFVC achieves excellent perceptual quality, outperforming the latest VVC standard with only 50% required bitrate.

AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

  • paper_url: http://arxiv.org/abs/2310.04677
  • repo_url: https://github.com/rongzhao-zhang/ag-crc
  • paper_authors: Rongzhao Zhang, Zhian Bai, Ruoying Yu, Wenrao Pang, Lingyun Wang, Lifeng Zhu, Xiaofan Zhang, Huan Zhang, Weiguo Hu
  • for: 本研究旨在帮助解决肠RECTAL癌(CRC)从 computed tomography(CT)成像中进行精准的肿瘤分 segmentation,通过利用现有的 deep learning 模型生成的多器官分割(MOS)masks 和自适应的学习策略。
  • methods: 本研究提出了一种新的 Anatomy-Guided 分 segmentation框架,包括获取 MOS masks,提取更加稳定的器官兴趣(OOI)mask,以及一种基于器官的自适应训练 patch sampling 策略和一种基于管道器官的自适应学习方法。
  • results: 对于 two CRC segmentation 数据集,提出的方法可以 Achieve 5% 到 9% 的 Dice 指标提升,而且ablation study 进一步证明每个提出的组件的有效性。
    Abstract When delineating lesions from medical images, a human expert can always keep in mind the anatomical structure behind the voxels. However, although high-quality (though not perfect) anatomical information can be retrieved from computed tomography (CT) scans with modern deep learning algorithms, it is still an open problem how these automatically generated organ masks can assist in addressing challenging lesion segmentation tasks, such as the segmentation of colorectal cancer (CRC). In this paper, we develop a novel Anatomy-Guided segmentation framework to exploit the auto-generated organ masks to aid CRC segmentation from CT, namely AG-CRC. First, we obtain multi-organ segmentation (MOS) masks with existing MOS models (e.g., TotalSegmentor) and further derive a more robust organ of interest (OOI) mask that may cover most of the colon-rectum and CRC voxels. Then, we propose an anatomy-guided training patch sampling strategy by optimizing a heuristic gain function that considers both the proximity of important regions (e.g., the tumor or organs of interest) and sample diversity. Third, we design a novel self-supervised learning scheme inspired by the topology of tubular organs like the colon to boost the model performance further. Finally, we employ a masked loss scheme to guide the model to focus solely on the essential learning region. We extensively evaluate the proposed method on two CRC segmentation datasets, where substantial performance improvement (5% to 9% in Dice) is achieved over current state-of-the-art medical image segmentation models, and the ablation studies further evidence the efficacy of every proposed component.
    摘要 当分类医学图像时,人类专家总是可以保持在脏部结构的背景下。然而,尽管现代深度学习算法可以从 computed tomography(CT)扫描机器学习得到高质量(尚未完美)的解剖信息,但是仍然是一个开放的问题如何使用自动生成的器官涂抹来帮助困难的肿瘤分割任务,例如肿瘤分割。在这篇论文中,我们开发了一种新的 Anatomy-Guided 分割框架,以利用自动生成的器官涂抹来帮助 CT 上肿瘤分割,即 AG-CRC。首先,我们通过现有的 MOS 模型(例如 TotalSegmentor)来获得多个器官分割(MOS)mask,并 derivate 更加robust的器官对象(OOI)涂抹,可以覆盖大部分的肠Rectum和肿瘤 voxels。然后,我们提出了一种基于器官解剖 topology的自适应训练 patch sampling 策略,通过优化一个启发函数来考虑Important region(例如肿瘤或器官对象)的 proximity 和样本多样性。第三,我们设计了一种基于管状器官 topology 的自适应学习方案,以进一步提高模型性能。最后,我们采用了一种 masked loss 方案,以导引模型尽量只学习重要的学习区域。我们对两个 CRC 分割数据集进行了广泛的评估,并实现了现有医学图像分割模型的显著性能提高(5% 到 9% 的 Dice),并且ablation studies 进一步证明了每一个提posed component的有效性。

EasyPhoto: Your Smart AI Photo Generator

  • paper_url: http://arxiv.org/abs/2310.04672
  • repo_url: https://github.com/aigc-apps/sd-webui-EasyPhoto
  • paper_authors: Ziheng Wu, Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Xing Shi, Jun Huang
  • for: 这个论文是为了介绍一个名为EasyPhoto的WebUI插件,它使得用户可以通过提供5-20个相关图像来生成人工智能肖像。
  • methods: 这个插件使用了Gradio库和Stable Diffusion模型,并通过训练LoRA模型来进行特征提取和生成AI照片。
  • results: 该插件可以根据用户提供的图像生成多种不同的AI照片,包括使用自定义模板和拓展SDXL模型来生成具有更多元素的图像。Here’s the full text in Simplified Chinese:
  • for: 这个论文是为了介绍一个名为EasyPhoto的WebUI插件,它使得用户可以通过提供5-20个相关图像来生成人工智能肖像。
  • methods: 这个插件使用了Gradio库和Stable Diffusion模型,并通过训练LoRA模型来进行特征提取和生成AI照片。
  • results: 该插件可以根据用户提供的图像生成多种不同的AI照片,包括使用自定义模板和拓展SDXL模型来生成具有更多元素的图像。I hope that helps! Let me know if you have any other questions.
    Abstract Stable Diffusion web UI (SD-WebUI) is a comprehensive project that provides a browser interface based on Gradio library for Stable Diffusion models. In this paper, We propose a novel WebUI plugin called EasyPhoto, which enables the generation of AI portraits. By training a digital doppelganger of a specific user ID using 5 to 20 relevant images, the finetuned model (according to the trained LoRA model) allows for the generation of AI photos using arbitrary templates. Our current implementation supports the modification of multiple persons and different photo styles. Furthermore, we allow users to generate fantastic template image with the strong SDXL model, enhancing EasyPhoto's capabilities to deliver more diverse and satisfactory results. The source code for EasyPhoto is available at: https://github.com/aigc-apps/sd-webui-EasyPhoto. We also support a webui-free version by using diffusers: https://github.com/aigc-apps/EasyPhoto. We are continuously enhancing our efforts to expand the EasyPhoto pipeline, making it suitable for any identification (not limited to just the face), and we enthusiastically welcome any intriguing ideas or suggestions.
    摘要 stable diffusion网UI(SD-WebUI)是一个完整的项目,提供基于Gradio库的浏览器界面 для稳定扩散模型。在这篇论文中,我们提出了一个新的WebUI插件called EasyPhoto,它允许用户通过训练一个特定用户ID的数字双胞虫(使用5-20个相关图片进行训练)来生成人工智能照片。我们的当前实现支持多个人物和不同的照片风格的修改。此外,我们允许用户使用强大的SDXL模型来生成备受喜欢的模板图片,从而提高EasyPhoto的能力,为用户提供更多的多样化和满意的结果。EasyPhoto的源代码位于:https://github.com/aigc-apps/sd-webui-EasyPhoto。我们还支持一个无WebUI版本,使用diffusers:https://github.com/aigc-apps/EasyPhoto。我们 continuous enhance our efforts to expand the EasyPhoto pipeline, making it suitable for any identification(不限于脸部),欢迎任何有趣的想法或建议。

Visual Abductive Reasoning Meets Driving Hazard Prediction: Problem Formulation and Dataset

  • paper_url: http://arxiv.org/abs/2310.04671
  • repo_url: https://github.com/dhpr-dataset/dhpr-dataset
  • paper_authors: Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani
  • for: 本研究旨在预测司机驾驶过程中可能会遇到的危险。
  • methods: 该研究使用单个输入图像, captured by 汽车前视摄像头,进行预测。与传统驾驶危险预测方法不同,这种方法不是通过计算模拟或视频异常检测来实现。而是通过高级推理,从静止图像中预测未来事件。
  • results: 该研究创建了一个名为 DHPR(驾驶危险预测和推理)数据集,包含15W个摄像头图像和相应的车速、假设危险描述和场景中可见元素。这些数据被人工标注员标注,以便识别危险场景并提供可能在几秒钟后发生的危险描述。研究发现,使用不同方法的基eline表现不佳,标志着该领域还有很多研究空间。
    Abstract This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from static images. The problem needs predicting and reasoning about future events based on uncertain observations, which falls under visual abductive reasoning. To enable research in this understudied area, a new dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is created. The dataset consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. These are annotated by human annotators, who identify risky scenes and provide descriptions of potential accidents that could occur a few seconds later. We present several baseline methods and evaluate their performance on our dataset, identifying remaining issues and discussing future directions. This study contributes to the field by introducing a novel problem formulation and dataset, enabling researchers to explore the potential of multi-modal AI for driving hazard prediction.
    摘要 Translated into Simplified Chinese:这篇论文关注在车辆驾驶中预测司机可能遇到的危险问题,使用车载 видео记录机拍摄的单个输入图像。与现有方法不同,这些方法基于计算机生成的模拟或视频异常检测,而这些研究强调高级推理从静止图像中。问题需要预测和理解未来事件,基于不确定的观察,这属于视觉推理。为促进这个未explored的领域的研究,这篇论文创建了名为DHPR(驾驶危险预测和理解)数据集,该数据集包含15,000个摄像头图像街景,每个图像都有速度、假设的危险描述和场景中可见的视觉元素。这些被人工标注员标注为危险场景和可能在几秒钟后发生的危险描述。这篇论文提出了多种基线方法,并对其表现进行评估,并识别未解决的问题和未来方向。这篇论文对驾驶多模态AI的潜在应用做出了贡献,开发了一个新的问题定义和数据集,为研究人员提供了探索可能的多模态AI驾驶危险预测的机会。

Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition

  • paper_url: http://arxiv.org/abs/2310.04664
  • repo_url: None
  • paper_authors: Jie Zhu, Yuan Zong, Jingang Shi, Cheng Lu, Hongli Chang, Wenming Zheng
  • for: 本研究强调微表情识别(MER)领域,提出了一种可靠和灵活的深度学习方法,即学习排名开始-发生-结束表示(LTR3O)。
  • methods: LTR3O方法使用了一种名为3O结构的动态减少大小序列结构,该结构包括开始、发生和结束帧,以表示微表情(ME)。这种结构使得可以后续学习ME特异性特征。LTR3O方法还使用了多个3O表示候选者来对每个ME样本进行排名和补做,以确保3O表示候选者的情感表达能够与大表情(MaM)的分布相一致。
  • results: 实验结果表明,LTR3O方法在三个常用的ME数据库(CASME II、SMIC、SAMM)上表现出了优于当前状态艺术MER方法的可靠性和灵活性。
    Abstract This paper focuses on the research of micro-expression recognition (MER) and proposes a flexible and reliable deep learning method called learning to rank onset-occurring-offset representations (LTR3O). The LTR3O method introduces a dynamic and reduced-size sequence structure known as 3O, which consists of onset, occurring, and offset frames, for representing micro-expressions (MEs). This structure facilitates the subsequent learning of ME-discriminative features. A noteworthy advantage of the 3O structure is its flexibility, as the occurring frame is randomly extracted from the original ME sequence without the need for accurate frame spotting methods. Based on the 3O structures, LTR3O generates multiple 3O representation candidates for each ME sample and incorporates well-designed modules to measure and calibrate their emotional expressiveness. This calibration process ensures that the distribution of these candidates aligns with that of macro-expressions (MaMs) over time. Consequently, the visibility of MEs can be implicitly enhanced, facilitating the reliable learning of more discriminative features for MER. Extensive experiments were conducted to evaluate the performance of LTR3O using three widely-used ME databases: CASME II, SMIC, and SAMM. The experimental results demonstrate the effectiveness and superior performance of LTR3O, particularly in terms of its flexibility and reliability, when compared to recent state-of-the-art MER methods.
    摘要

VLAttack: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

  • paper_url: http://arxiv.org/abs/2310.04655
  • repo_url: https://github.com/ericyinyzy/VLAttack
  • paper_authors: Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, Fenglong Ma
  • for: 这篇论文旨在探讨预训练的语音视觉(VL)模型在多modal任务上的敌方性 robustness。
  • methods: 该论文提出了一种新的实际任务,通过预训练VL模型生成图像和文本杂乱样本,用于攻击黑盒子细化模型。其中,图像级别使用一种新的块级相似攻击策略(BSA)学习图像杂乱,文本级别采用现有的文本攻击策略独立于图像模式攻击。多modal级别则提出了一种迭代交叉搜索攻击(ICSA)方法,在图像和文本两个纬度上不断更新敌意样本。
  • results: 该论文通过对三种广泛使用的VL预训练模型进行六种任务的攻击,实现了所有任务的攻击成功率最高,而与现有基线相比,表明预训练VL模型的部署存在重要的缺点。
    Abstract Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLAttack to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new block-wise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack three widely-used VL pretrained models for six tasks on eight datasets. Experimental results show that the proposed VLAttack framework achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a significant blind spot in the deployment of pre-trained VL models. Codes will be released soon.
    摘要 视觉语言(VL)预训模型已经在许多多modal任务上表现出色,但是它们的抗击机制尚未得到全面的探索。现有的方法主要集中在白盒 Setting下进行抗击,这是不现实的。在这篇论文中,我们目的是调查一种新的实用任务,用预训 VL 模型来针对黑盒精度调整模型在不同下游任务上的攻击。为此,我们提出了 VLAttack 攻击框架,用于生成攻击样本。在单modal水平上,我们提出了一种新的块级相似攻击(BSA)策略,用于学习图像攻击,以破坏通用表示。此外,我们采用了现有的文本攻击策略,独立于图像Modal攻击,生成文本攻击。在多modal水平上,我们设计了一种新的迭代交叉搜索攻击(ICSA)方法,用于更新 periodic 的攻击图像文本对,从单modal水平的输出开始。我们对三个广泛使用的 VL 预训模型进行了六个任务的攻击,并对八个数据集进行了广泛的实验。实验结果表明,我们的 VLAttack 框架在所有任务上取得了最高的攻击成功率,超过了现有的基线,这表明预训 VL 模型在部署中存在一定的盲点。代码即将发布。

X-Transfer: A Transfer Learning-Based Framework for Robust GAN-Generated Fake Image Detection

  • paper_url: http://arxiv.org/abs/2310.04639
  • repo_url: None
  • paper_authors: Lei Zhang, Hao Chen, Shu Hu, Bin Zhu, Xi Wu, Jinrong Hu, Xin Wang
  • for: 这个研究旨在提出一个新的生成随机网络侦测算法,以解决在各种领域中的伪造图像问题。
  • methods: 本研究使用了两个姐妹神经网络,通过平行的梯度传递来强化传播学习。此外,我们还结合了AUC损失函数和梯度估计损失函数,以提高模型的性能。
  • results: 我们在多个脸部图像dataset上进行了广泛的实验,结果显示我们的模型在传播学习方法上优于常规的传播方法,并且在非脸部dataset上也获得了出色的表现,这证明了我们的模型具有更广泛的应用前景。
    Abstract Generative adversarial networks (GANs) have remarkably advanced in diverse domains, especially image generation and editing. However, the misuse of GANs for generating deceptive images raises significant security concerns, including face replacement and fake accounts, which have gained widespread attention. Consequently, there is an urgent need for effective detection methods to distinguish between real and fake images. Some of the current research centers around the application of transfer learning. Nevertheless, it encounters challenges such as knowledge forgetting from the original dataset and inadequate performance when dealing with imbalanced data during training. To alleviate the above issues, this paper introduces a novel GAN-generated image detection algorithm called X-Transfer. This model enhances transfer learning by utilizing two sibling neural networks that employ interleaved parallel gradient transmission. This approach also effectively mitigates the problem of excessive knowledge forgetting. In addition, we combine AUC loss term and cross-entropy loss to enhance the model's performance comprehensively. The AUC loss approximates the AUC metric using WMW statistics, ensuring differentiability and improving the performance of traditional AUC evaluation. We carry out comprehensive experiments on multiple facial image datasets. The results show that our model outperforms the general transferring approach, and the best accuracy achieves 99.04%, which is increased by approximately 10%. Furthermore, we demonstrate excellent performance on non-face datasets, validating its generality and broader application prospects.
    摘要 生成对抗网络(GAN)在多个领域取得了突出的进步,尤其是图像生成和修改。然而,GAN的滥用可能导致生成假图像的安全问题,如人脸替换和假账户,这些问题已经吸引了广泛的关注。因此,有一项急需有效的检测方法来 отличить真实图像和假图像。一些当前的研究集中在应用传输学习。然而,它遇到了一些挑战,如原始数据集中的知识忘记和训练过程中的数据不均衡。为了解决以上问题,本文提出了一种新的GAN生成图像检测算法called X-Transfer。这种模型在使用两个兄弟神经网络之间的平行梯度传输方法来增强传输学习。此外,我们还将AUC损失函数和交叉熵损失函数结合使用,以提高模型的性能。我们对多个人脸图像集进行了广泛的实验,结果显示,我们的模型在普通的传输方法上出色,并且最高准确率达到99.04%,提高了约10%。此外,我们还证明了我们的模型在非人脸图像集上表现出色,证明其通用性和更广泛的应用前景。

Metadata-Conditioned Generative Models to Synthesize Anatomically-Plausible 3D Brain MRIs

  • paper_url: http://arxiv.org/abs/2310.04630
  • repo_url: None
  • paper_authors: Wei Peng, Tomas Bosschieter, Jiahong Ouyang, Robert Paul, Ehsan Adeli, Qingyu Zhao, Kilian M. Pohl
  • for: 这个论文旨在提高神经成像研究的数据多样性,使用生成AI模型生成年龄和性别相关的Synthetic MRI数据,以便更好地理解大脑结构和功能的变化。
  • methods: 该论文提出了一种新的生成模型,叫做BrainSynth,可以生成metadata-conditioned的Synthetic MRI数据,并且可以评估这些数据的可读性和生物学可信度。
  • results: 研究发现,使用BrainSynth生成的Synthetic MRI数据中的大脑区域具有较高的生物学可信度,并且可以帮助提高神经成像研究中的训练效果。
    Abstract Generative AI models hold great potential in creating synthetic brain MRIs that advance neuroimaging studies by, for example, enriching data diversity. However, the mainstay of AI research only focuses on optimizing the visual quality (such as signal-to-noise ratio) of the synthetic MRIs while lacking insights into their relevance to neuroscience. To gain these insights with respect to T1-weighted MRIs, we first propose a new generative model, BrainSynth, to synthesize metadata-conditioned (e.g., age- and sex-specific) MRIs that achieve state-of-the-art visual quality. We then extend our evaluation with a novel procedure to quantify anatomical plausibility, i.e., how well the synthetic MRIs capture macrostructural properties of brain regions, and how accurately they encode the effects of age and sex. Results indicate that more than half of the brain regions in our synthetic MRIs are anatomically accurate, i.e., with a small effect size between real and synthetic MRIs. Moreover, the anatomical plausibility varies across cortical regions according to their geometric complexity. As is, our synthetic MRIs can significantly improve the training of a Convolutional Neural Network to identify accelerated aging effects in an independent study. These results highlight the opportunities of using generative AI to aid neuroimaging research and point to areas for further improvement.
    摘要 优化生成AI模型可以创造出高质量的人工大脑MRI图像,从而提高神经成像研究的数据多样性。然而,主流的AI研究仅专注于提高生成图像的视觉质量(如信号噪声比)而忽略了它们对神经科学的意义。为了获得这些意义,我们首先提出了一种新的生成模型——BrainSynth,用于生成基于metadata(如年龄和性别)的 conditioned MRI图像,并达到了当前最高的视觉质量。然后,我们延展我们的评估方法,包括一种新的评估方法来衡量生成图像的吸收性,即评估生成图像是否能够准确捕捉大脑区域的宏结构特征,以及是否能够正确表达年龄和性别的效果。结果显示,我们的生成图像中的大脑区域有超过半数是吸收正确的,即与实际MRI图像之间的效果效果较小。此外,吸收正确性随 cortical 区域的几何复杂性变化。总之,我们的生成图像可以大幅提高一种Convolutional Neural Network的训练,以识别加速年龄效应。这些结果表明使用生成AI可以帮助神经成像研究,并指出了进一步改进的方向。

cs.AI - 2023-10-07

Balancing Specialized and General Skills in LLMs: The Impact of Modern Tuning and Data Strategy

  • paper_url: http://arxiv.org/abs/2310.04945
  • repo_url: None
  • paper_authors: Zheng Zhang, Chen Zheng, Da Tang, Ke Sun, Yukun Ma, Yingtong Bu, Xun Zhou, Liang Zhao
  • for: 这 paper 的目的是为大型语言模型(LLMs)特циализирован的商业化任务进行精细调整和评估。
  • methods: 这 paper 使用的方法包括:1) 精心混合域外和通用数据进行 fine-tuning,以实现一个优化的权衡 между通用和专业技能; 2) 设计了一个全面的评估框架,包括45个问题,以评估模型在 fonctionally 相关的维度上的表现,如可靠性、一致性和业务影响; 3) 分析模型大小和连续训练对纪录的影响,以导引有效的资源分配。
  • results: 这 paper 的结果表明,通过采用这些方法,可以很好地权衡通用语言能力和专业技能,并且可以提供有用的指导方针 для商业和研究人员在特циализирован任务上调整 LLMs。
    Abstract This paper introduces a multifaceted methodology for fine-tuning and evaluating large language models (LLMs) for specialized monetization tasks. The goal is to balance general language proficiency with domain-specific skills. The methodology has three main components: 1) Carefully blending in-domain and general-purpose data during fine-tuning to achieve an optimal balance between general and specialized capabilities; 2) Designing a comprehensive evaluation framework with 45 questions tailored to assess performance on functionally relevant dimensions like reliability, consistency, and business impact; 3) Analyzing how model size and continual training influence metrics to guide efficient resource allocation during fine-tuning. The paper details the design, data collection, analytical techniques, and results validating the proposed frameworks. It aims to provide businesses and researchers with actionable insights on effectively adapting LLMs for specialized contexts. We also intend to make public the comprehensive evaluation framework, which includes the 45 tailored questions and their respective scoring guidelines, to foster transparency and collaboration in adapting LLMs for specialized tasks.
    摘要
  1. Blending in-domain and general-purpose data during fine-tuning to achieve an optimal balance between general and specialized capabilities.2. Developing a comprehensive evaluation framework with 45 questions tailored to assess performance on functionally relevant dimensions such as reliability, consistency, and business impact.3. Analyzing how model size and continual training influence metrics to guide efficient resource allocation during fine-tuning.The paper details the design, data collection, analytical techniques, and results validating the proposed frameworks, providing actionable insights for businesses and researchers on effectively adapting LLMs for specialized contexts. Additionally, the comprehensive evaluation framework, including the 45 tailored questions and their respective scoring guidelines, will be made public to foster transparency and collaboration in adapting LLMs for specialized tasks.

Reliable Test-Time Adaptation via Agreement-on-the-Line

  • paper_url: http://arxiv.org/abs/2310.04941
  • repo_url: None
  • paper_authors: Eungyeup Kim, Mingjie Sun, Aditi Raghunathan, Zico Kolter
  • for: 本文主要针对测试时适应(TTA)方法的可靠性和可重复性问题进行研究,尤其是在不同的分布转移下进行适应后模型的评估和调参问题。
  • methods: 本文使用了多种TTA方法,包括随机抽象、权重学习和卷积神经网络等方法,并进行了广泛的实验研究以评估这些方法的可靠性和可重复性。
  • results: 本文的研究发现,TTAed模型具有强的同步特征(agreement-on-the-line phenomenon),即在不同的分布转移下,TTAed模型的预测结果呈现出一定的线性关系。基于这个发现,本文提出了一些解决TTA方法的可靠性和可重复性问题的方法,包括无 labels 数据上的OOD性能估计、无标签信息的模型 calibration 和无标签验证数据的hyperparameter 调参等方法。通过广泛的实验研究,本文证明了这些方法可以准确地评估TTA方法的性能和可靠性,并且可以在不同的分布转移下实现模型的适应和calibration。
    Abstract Test-time adaptation (TTA) methods aim to improve robustness to distribution shifts by adapting models using unlabeled data from the shifted test distribution. However, there remain unresolved challenges that undermine the reliability of TTA, which include difficulties in evaluating TTA performance, miscalibration after TTA, and unreliable hyperparameter tuning for adaptation. In this work, we make a notable and surprising observation that TTAed models strongly show the agreement-on-the-line phenomenon (Baek et al., 2022) across a wide range of distribution shifts. We find such linear trends occur consistently in a wide range of models adapted with various hyperparameters, and persist in distributions where the phenomenon fails to hold in vanilla models (i.e., before adaptation). We leverage these observations to make TTA methods more reliable in three perspectives: (i) estimating OOD accuracy (without labeled data) to determine when TTA helps and when it hurts, (ii) calibrating TTAed models without label information, and (iii) reliably determining hyperparameters for TTA without any labeled validation data. Through extensive experiments, we demonstrate that various TTA methods can be precisely evaluated, both in terms of their improvements and degradations. Moreover, our proposed methods on unsupervised calibration and hyperparameters tuning for TTA achieve results close to the ones assuming access to ground-truth labels, in terms of both OOD accuracy and calibration error.
    摘要
  1. 无标签数据来评估 OOD 精度,以确定 TTA 是否有助于或妨碍。2. 使用无标签数据来准确调整 TTAed 模型的误差。3. 使用无标签数据来可靠地确定 TTA 过程中的 гипер参数。我们通过广泛的实验,证明了不同的 TTA 方法可以准确地评估,并且我们的提议的方法可以在无标签数据下达到与 assuming 访问真实标签数据的结果相似的水平,以 both OOD 精度和误差来衡量。

Diff-Transfer: Model-based Robotic Manipulation Skill Transfer via Differentiable Physics Simulation

  • paper_url: http://arxiv.org/abs/2310.04930
  • repo_url: None
  • paper_authors: Yuqi Xiang, Feitong Chen, Qinsi Wang, Yang Gang, Xiang Zhang, Xinghao Zhu, Xingyu Liu, Lin Shao
  • For: 本文旨在提出一种基于差分 физи学仿真的精准转移框架,以便让智能机器人在完成类似 yet novel 任务时,能够快速转移已经熟悉的技能。* Methods: 本文提出了一种新的转移框架,称为 $\textit{Diff-Transfer}$,它利用差分 физи学仿真来效率地转移机器人技能。具体来说, $\textit{Diff-Transfer}$ 在任务空间中找到可行的任务路径,并在每对邻近点上适应已知的动作,以解决另一个子任务。这种适应是通过差分 физи学仿真的梯度信息引导的。* Results: 作者在实验中使用 $\textit{Diff-Transfer}$ 执行了四个复杂的转移任务,包括机器人抓取和移动等,并通过了全面的实验。详细的实验结果和视频可以参考https://sites.google.com/view/difftransfer。
    Abstract The capability to transfer mastered skills to accomplish a range of similar yet novel tasks is crucial for intelligent robots. In this work, we introduce $\textit{Diff-Transfer}$, a novel framework leveraging differentiable physics simulation to efficiently transfer robotic skills. Specifically, $\textit{Diff-Transfer}$ discovers a feasible path within the task space that brings the source task to the target task. At each pair of adjacent points along this task path, which is two sub-tasks, $\textit{Diff-Transfer}$ adapts known actions from one sub-task to tackle the other sub-task successfully. The adaptation is guided by the gradient information from differentiable physics simulations. We propose a novel path-planning method to generate sub-tasks, leveraging $Q$-learning with a task-level state and reward. We implement our framework in simulation experiments and execute four challenging transfer tasks on robotic manipulation, demonstrating the efficacy of $\textit{Diff-Transfer}$ through comprehensive experiments. Supplementary and Videos are on the website https://sites.google.com/view/difftransfer
    摘要 “智能机器人需要具备将已经学习的技能应用于类似 yet novel 任务的能力。在这个工作中,我们介绍了 $\textit{Diff-Transfer}$ 框架,利用可微的物理学习来有效地传递 робо机能力。具体来说,$\textit{Diff-Transfer}$ 发现可以在任务空间中找到可行的任务路径,将源任务转化为目标任务。在每对相邻的任务点上,$\textit{Diff-Transfer}$ 采用已知的动作改进一个子任务,以解决另一个子任务。改进是通过可微的物理学习的梯度信息引导的。我们提出了一种新的路径规划方法,基于 $Q$-学习和任务级状态和奖励。我们在 simulator 中实现了我们的框架,并在机器人 manipulate 等四个具有挑战性的转移任务中进行了实验,证明了 $\textit{Diff-Transfer}$ 的有效性。补充和视频可以在网站 https://sites.google.com/view/difftransfer 上找到。”Note that Simplified Chinese is used here, which is the most widely used variety of Chinese. If you prefer Traditional Chinese, I can provide that version as well.

Crystal: Introspective Reasoners Reinforced with Self-Feedback

  • paper_url: http://arxiv.org/abs/2310.04921
  • repo_url: https://github.com/liujch1998/crystal
  • paper_authors: Jiacheng Liu, Ramakanth Pasunuru, Hannaneh Hajishirzi, Yejin Choi, Asli Celikyilmaz
  • For: The paper aims to improve the performance and interpretability of commonsense reasoning using knowledge-augmented reasoning methods.* Methods: The proposed method, called Crystal, introspects for knowledge statements related to a given question and makes an informed prediction grounded in the previously introspected knowledge. The knowledge introspection and knowledge-grounded reasoning modes of the model are tuned via reinforcement learning to mutually adapt.* Results: Crystal significantly outperforms both the standard supervised finetuning and chain-of-thought distilled methods, and enhances the transparency of the commonsense reasoning process.
    Abstract Extensive work has shown that the performance and interpretability of commonsense reasoning can be improved via knowledge-augmented reasoning methods, where the knowledge that underpins the reasoning process is explicitly verbalized and utilized. However, existing implementations, including "chain-of-thought" and its variants, fall short in capturing the introspective nature of knowledge required in commonsense reasoning, and in accounting for the mutual adaptation between the generation and utilization of knowledge. We propose a novel method to develop an introspective commonsense reasoner, Crystal. To tackle commonsense problems, it first introspects for knowledge statements related to the given question, and subsequently makes an informed prediction that is grounded in the previously introspected knowledge. The knowledge introspection and knowledge-grounded reasoning modes of the model are tuned via reinforcement learning to mutually adapt, where the reward derives from the feedback given by the model itself. Experiments show that Crystal significantly outperforms both the standard supervised finetuning and chain-of-thought distilled methods, and enhances the transparency of the commonsense reasoning process. Our work ultimately validates the feasibility and potential of reinforcing a neural model with self-feedback.
    摘要 广泛的研究表明,可以通过增强知识推理的方法来提高常识推理的性能和可读性,其中包括使用知识推理方法,其中的知识是明确地表达出来并利用。但是现有的实现方法,如“链子”和其变体,对于常识推理中的 introspective 知识需求和知识生成和使用之间的互动,缺乏捕捉。我们提出了一种新的方法,即 Crystal,以解决常识问题。它首先 introspects 相关的知识含义,然后根据这些知识做出了 Informed 的预测,这些预测是基于之前 introspected 的知识。知识 introspection 和知识推理模式在模型中被调整通过反射学习,以便相互适应。实验结果显示,Crystal 与标准的监督调整和链子混合方法相比,有着很高的表现。我们的工作终于验证了将 neural 模型与自己的反馈整合的可能性。>>>

Robust Network Pruning With Sparse Entropic Wasserstein Regression

  • paper_url: http://arxiv.org/abs/2310.04918
  • repo_url: None
  • paper_authors: Lei You, Hei Victor Cheng
  • for: 本研究推出了一种针对噪声梯度的高效神经网络剔除技术,通过计算empirical Fisher Information Matrix (FIM)来实现。
  • methods: 我们引入了一种Entropic Wasserstein regression (EWR)方法,利用最优运输 (OT) 问题的几何特点,可以有效地减少噪声。
  • results: 我们的方法在不同的网络模型上进行了广泛的实验,与当前最佳方法(SoTA)的网络剔除算法相比,我们的方法在网络大小或目标稀疏性较大时表现更佳,尤其是在噪声存在的情况下,增加约6%的准确率和8%的测试损失。
    Abstract This study unveils a cutting-edge technique for neural network pruning that judiciously addresses noisy gradients during the computation of the empirical Fisher Information Matrix (FIM). We introduce an entropic Wasserstein regression (EWR) formulation, capitalizing on the geometric attributes of the optimal transport (OT) problem. This is analytically showcased to excel in noise mitigation by adopting neighborhood interpolation across data points. The unique strength of the Wasserstein distance is its intrinsic ability to strike a balance between noise reduction and covariance information preservation. Extensive experiments performed on various networks show comparable performance of the proposed method with state-of-the-art (SoTA) network pruning algorithms. Our proposed method outperforms the SoTA when the network size or the target sparsity is large, the gain is even larger with the existence of noisy gradients, possibly from noisy data, analog memory, or adversarial attacks. Notably, our proposed method achieves a gain of 6% improvement in accuracy and 8% improvement in testing loss for MobileNetV1 with less than one-fourth of the network parameters remaining.
    摘要

On Accelerating Diffusion-based Molecular Conformation Generation in SE(3)-invariant Space

  • paper_url: http://arxiv.org/abs/2310.04915
  • repo_url: None
  • paper_authors: Zihan Zhou, Ruiying Liu, Tianshu Yu
  • for: 本研究旨在加速在SE(3)-不变空间中的分析生成模型,以提高其在实际应用中的效率。
  • methods: 本研究使用了对现有方法的约错分析,以了解SE(3)-不变空间中的扩散机制。基于这种理解,我们开发了一种新的加速方案,以提高分析生成模型的运行速度。
  • results: 实验表明,我们的加速方案可以在SE(3)-不变空间中生成高质量的分析结果,与现有方法相比,具有50倍至100倍的速度提升。
    Abstract Diffusion-based generative models in SE(3)-invariant space have demonstrated promising performance in molecular conformation generation, but typically require solving stochastic differential equations (SDEs) with thousands of update steps. Till now, it remains unclear how to effectively accelerate this procedure explicitly in SE(3)-invariant space, which greatly hinders its wide application in the real world. In this paper, we systematically study the diffusion mechanism in SE(3)-invariant space via the lens of approximate errors induced by existing methods. Thereby, we develop more precise approximate in SE(3) in the context of projected differential equations. Theoretical analysis is further provided as well as empirical proof relating hyper-parameters with such errors. Altogether, we propose a novel acceleration scheme for generating molecular conformations in SE(3)-invariant space. Experimentally, our scheme can generate high-quality conformations with 50x--100x speedup compared to existing methods.
    摘要 Diffusion-based生成模型在SE(3)-不变空间中表现出了优秀的表现,但通常需要解决数千步隐藏微分方程(SDEs)。直到现在,未知如何明确地加速这个过程,这大大阻碍了它在实际世界中的广泛应用。在这篇论文中,我们系统地研究了SE(3)-不变空间中的扩散机制,通过对现有方法的误差引入的精度来评估。此外,我们还提供了相关的理论分析和实验证明,关于参数与这些误差之间的关系。总之,我们提出了一种新的加速方案,可以在SE(3)-不变空间中高速生成分子结构。实验表明,我们的方案可以比现有方法快速50-100倍,生成高质量的分子结构。

Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

  • paper_url: http://arxiv.org/abs/2310.04914
  • repo_url: None
  • paper_authors: Avinash Madasu, Anahita Bhiwandiwalla, Vasudev Lal
  • for: 这些论文的目的是研究基础的多模态模型是否可以适应视频任务,以及这种方法的效果。
  • methods: 这些论文使用的方法是将图文模型适应视频任务,并对这种方法的性能进行评估。
  • results: 研究发现,基础的图文模型在视频理解任务上表现出色,特别是在视频识别、视频检索和视频多选任务上。然而,它们在视频问答和视频描述任务上表现较差。这些发现反映了将图文模型适应视频任务的效果。
    Abstract Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
    摘要 基础多Modal模型在大规模图片文本对或视频文本对上预训练后显示出强大的通用能力。然而,与图片文本模型不同,不可能在预训练视频文本模型,因为收集大规模干净对齐数据的困难,以及预训练阶段的计算成本呈指数增长。因此,关键的问题是:图片文本模型能否适应视频任务,是否有任何优势使用这些模型而不是直接预训练在视频上?在这项工作中,我们关注这个问题,通过对Foundational image-text模型在视频理解任务上的总体能力进行详细的研究。我们对9种基础图片文本模型在多种视频任务上进行了多样化的测试,包括视频动作识别(视频AR)、视频检索(视频RT)、视频问答(视频QA)、视频多选(视频MC)和视频描述(视频CP)。我们的实验结果表明,图片文本模型在视频AR、视频RT和视频MC方面表现出色,而在视频描述和视频QA方面表现较差。这些发现反映了适应多种视频任务的图片文本模型,而不需要费时的预训练步骤。

Faithful Knowledge Graph Explanations for Commonsense Reasoning

  • paper_url: http://arxiv.org/abs/2310.04910
  • repo_url: None
  • paper_authors: Weihe Zhai, Arkaitz Zubiaga, Bingquan Liu
  • for: 本研究旨在提高知识图(KG)基于解释的准确性和可靠性。
  • methods: 本研究提出了两项主要贡献:首先,我们提出了两种量化指标——图共识度和图准确度——来衡量知识图基于解释的准确性。其次,我们引入了一种新的培训方法,即具有一定的一致性规范的Consistent GNN(CGNN),以提高解释的准确性。
  • results: 我们的分析表明,使用原始模型预测的方法可能会导致知识图中的预测结果与原始模型预测结果不同。而我们提出的CGNN方法能够提高图共识度和图准确度,这表明了它在生成更准确的解释方面的潜力。
    Abstract While fusing language models (LMs) and knowledge graphs (KGs) has become common in commonsense question answering research, enabling faithful chain-of-thought explanations in these models remains an open problem. One major weakness of current KG-based explanation techniques is that they overlook the faithfulness of generated explanations during evaluation. To address this gap, we make two main contributions: (1) We propose and validate two quantitative metrics - graph consistency and graph fidelity - to measure the faithfulness of KG-based explanations. (2) We introduce Consistent GNN (CGNN), a novel training method that adds a consistency regularization term to improve explanation faithfulness. Our analysis shows that predictions from KG often diverge from original model predictions. The proposed CGNN approach boosts consistency and fidelity, demonstrating its potential for producing more faithful explanations. Our work emphasises the importance of explicitly evaluating suggest a path forward for developing architectures for faithful graph-based explanations.
    摘要 当拓展语言模型(LM)和知识图(KG)的研究成为常见的现象,使得 faithful chain-of-thought 解释在这些模型中保持开放问题。现有的 KG 解释技术的一个主要弱点是忽略生成的解释的忠实程度 durante la evaluación。为了解决这个漏洞,我们作出了两个主要贡献:1. 我们提出了两个量化指标 - 图共识性和图准确性 - 来衡量 KG 解释的忠实度。2. 我们介绍了一种新的训练方法,称为 Consistent GNN(CGNN),该方法添加了一个准确性规则来提高解释的忠实度。我们的分析表明,KG 的预测结果与原始模型的预测结果经常存在差异。CGNN 方法可以提高准确性和忠实度,这表明其在生成更 faithful 的解释方面具有潜在的优势。我们的工作强调了评估 KG 解释的忠实度的重要性,并建议一种发展 faithful graph-based 解释体系的可能之路。

Generative AI May Prefer to Present National-level Characteristics of Cities Based on Stereotypical Geographic Impressions at the Continental Level

  • paper_url: http://arxiv.org/abs/2310.04897
  • repo_url: None
  • paper_authors: Shan Ye
  • for: 测试中文基于生成人工智能平台“文心一个”的图像渲染能力,以及该平台是否会描绘出不同国家城市景观的多样性。
  • methods: 通过使用“文心一个”平台生成不同国家城市街景图像,然后对这些图像进行分析和评估,以了解该平台的图像渲染能力和可能存在的偏见。
  • results: 研究发现,“文心一个”平台生成的图像可能带有大洲水平的偏见,表现出不同国家城市的经济发展水平和现代化程度。此外,这些生成图像不能充分表达不同国家城市的多样性。使用这些图像进行地理教育或宣传活动可能会巩固人们对各国的偏见。
    Abstract A simple experiment was conducted to test the ability of the Chinese-based generative artificial intelligence (AI) platform, Wenxin Yige, to render images of urban street views of different countries. The study found that images generated by this AI platform may contain continental-level stereotypes in terms of showing the level of economic development and modernization. Street view images generated from Wenxin Yige do not adequately represent the diverse range of urban landscapes found across different nations. Using these generated images for geography education or outreach initiatives could inadvertently strengthen people's existing stereotypical views about individual countries.
    摘要 一项简单的实验测试了基于中文的生成式人工智能平台“文心易歌”的图像生成能力,以测试它是否能够生成不同国家城市视图的图像。研究发现,由这个AI平台生成的图像可能带有大洲水平的刻板印象,表现出不同国家的经济发展和现代化水平。这些生成的图像无法准确表达不同国家城市的多样化风貌,使用这些图像进行地理教育或宣传活动可能会巩固人们对各国的刻板印象。

Cell Tracking-by-detection using Elliptical Bounding Boxes

  • paper_url: http://arxiv.org/abs/2310.04895
  • repo_url: https://github.com/LucasKirsten/Deep-Cell-Tracking-EBB
  • paper_authors: Lucas N. Kirsten, Cláudio R. Jung
  • for: 这 paper 的目的是提出一种基于经典 tracking-by-detection 模式的细胞检测和跟踪方法,以避免大量的标注数据。
  • methods: 该方法使用 oriented ellipses 来 aproximate 细胞形状,然后使用通用 oriented object detectors 来在每帧中标识细胞。 global data association 算法使用 probability distance metrics 来explore temporal cell similarity。
  • results: 该方法可以 achieves detection and tracking results competitively with state-of-the-art techniques that require considerably more extensive data annotation。Here is the summary in English for reference:
  • for: The purpose of this paper is to propose a new approach based on the classical tracking-by-detection paradigm for cell detection and tracking, which alleviates the need for extensive annotated data.
  • methods: The method approximates cell shapes as oriented ellipses and uses generic-purpose oriented object detectors to identify cells in each frame. A global data association algorithm explores temporal cell similarity using probability distance metrics.
  • results: The method achieves detection and tracking results competitively with state-of-the-art techniques that require considerably more extensive data annotation.
    Abstract Cell detection and tracking are paramount for bio-analysis. Recent approaches rely on the tracking-by-model evolution paradigm, which usually consists of training end-to-end deep learning models to detect and track the cells on the frames with promising results. However, such methods require extensive amounts of annotated data, which is time-consuming to obtain and often requires specialized annotators. This work proposes a new approach based on the classical tracking-by-detection paradigm that alleviates the requirement of annotated data. More precisely, it approximates the cell shapes as oriented ellipses and then uses generic-purpose oriented object detectors to identify the cells in each frame. We then rely on a global data association algorithm that explores temporal cell similarity using probability distance metrics, considering that the ellipses relate to two-dimensional Gaussian distributions. Our results show that our method can achieve detection and tracking results competitively with state-of-the-art techniques that require considerably more extensive data annotation. Our code is available at: https://github.com/LucasKirsten/Deep-Cell-Tracking-EBB.
    摘要 维度分析中的细胞检测和跟踪是非常重要的。现有的方法大多基于跟踪-by-模型演化 paradigm,通常是通过训练端到终的深度学习模型来检测和跟踪细胞在帧中的承诺果。然而,这些方法需要庞大量的注解数据,它们是时间消耗的和特殊的注解员。本工作提出了一种新的方法,基于经典的跟踪-by-检测 paradigm,可以减少注解数据的需求。更准确地说,我们将细胞形状 aproximated为方向几何体,然后使用通用的方向对象检测器来在每帧中识别细胞。我们然后采用了全球数据协调算法,通过考虑细胞形状相似性的时间序列距离度量,来实现细胞跟踪。我们的结果表明,我们的方法可以与state-of-the-art技术相比,实现检测和跟踪结果,而不需要庞大量的注解数据。我们的代码可以在:https://github.com/LucasKirsten/Deep-Cell-Tracking-EBB中找到。

Question-focused Summarization by Decomposing Articles into Facts and Opinions and Retrieving Entities

  • paper_url: http://arxiv.org/abs/2310.04880
  • repo_url: None
  • paper_authors: Krutika Sarode, Shashidhar Reddy Javaji, Vishal Kalakonnavar
  • for: 这个研究旨在利用自然语言处理技术预测股票价格波动,具体来说是早期发现经济、政治、社会和技术变革,以便捕捉市场机会。
  • methods: 该方法包括从新闻文章中提取突出的事实,然后将这些事实与实体组合成 tuples,并使用这些 tuples 获取市场变化的摘要,最后将所有摘要合并为整个文章的摘要总结。
  • results: 研究希望通过分析Wikipedia数据和经济学人报道来建立公司和实体之间的关系,并使用大语言模型GPT 3.5来获取摘要和形成最终摘要。研究的最终目标是开发一个全面的系统,为金融分析师和投资者提供更加有知见的决策工具,以便早期发现市场趋势和事件。
    Abstract This research focuses on utilizing natural language processing techniques to predict stock price fluctuations, with a specific interest in early detection of economic, political, social, and technological changes that can be leveraged for capturing market opportunities. The proposed approach includes the identification of salient facts and events from news articles, then use these facts to form tuples with entities which can be used to get summaries of market changes for particular entity and then finally combining all the summaries to form a final abstract summary of the whole article. The research aims to establish relationships between companies and entities through the analysis of Wikipedia data and articles from the Economist. Large Language Model GPT 3.5 is used for getting the summaries and also forming the final summary. The ultimate goal of this research is to develop a comprehensive system that can provide financial analysts and investors with more informed decision-making tools by enabling early detection of market trends and events.
    摘要 这项研究探讨了使用自然语言处理技术预测股票价格波动,具体来说是早期检测经济、政治、社会和技术变化,以便捕捉市场机会。提出的方法包括从新闻文章中提取重要的事实,然后将这些事实与实体组合成Tuple,并使用这些Tuple获取市场变化的摘要。最终,将所有摘要合并为总摘要。研究的目标是通过分析Wikipedia数据和经济学人报道来建立公司和实体之间的关系。使用大语言模型GPT 3.5来获取摘要和组合总摘要。该研究的最终目标是开发一个全面的系统,为金融分析师和投资者提供更多的 Informed Decision-making 工具,以便早期检测市场趋势和事件。

Hybrid Recommendation System using Graph Neural Network and BERT Embeddings

  • paper_url: http://arxiv.org/abs/2310.04878
  • repo_url: None
  • paper_authors: Shashidhar Reddy Javaji, Krutika Sarode
  • for: 这种模型是为了提供个性化的动画推荐,以满足不同用户的兴趣和需求。
  • methods: 该模型使用图神经网络(GNN)和句子转换器嵌入来预测不同用户的动画推荐,同时考虑了动画的特征和用户对不同动画的交互。
  • results: 该模型不仅可以为用户提供个性化的动画推荐,还可以预测特定用户对某部动画的评分。
    Abstract Recommender systems have emerged as a crucial component of the modern web ecosystem. The effectiveness and accuracy of such systems are critical for providing users with personalized recommendations that meet their specific interests and needs. In this paper, we introduce a novel model that utilizes a Graph Neural Network (GNN) in conjunction with sentence transformer embeddings to predict anime recommendations for different users. Our model employs the task of link prediction to create a recommendation system that considers both the features of anime and user interactions with different anime. The hybridization of the GNN and transformer embeddings enables us to capture both inter-level and intra-level features of anime data.Our model not only recommends anime to users but also predicts the rating a specific user would give to an anime. We utilize the GraphSAGE network for model building and weighted root mean square error (RMSE) to evaluate the performance of the model. Our approach has the potential to significantly enhance the accuracy and effectiveness of anime recommendation systems and can be extended to other domains that require personalized recommendations.
    摘要 现代网络生态系统中,推荐系统已成为一种重要的组成部分。推荐系统的有效性和准确性对于为用户提供个性化推荐是非常重要的。在这篇论文中,我们介绍了一种新的模型,该模型利用图神经网络(GNN)和句子转换器嵌入来预测不同用户的动画推荐。我们的模型通过链接预测任务来创建一个考虑用户和动画特征的推荐系统。我们将GNN和句子转换器嵌入结合使用,以便捕捉动画数据中的内部和外部特征。我们的模型不仅为用户推荐动画,还预测特定用户对于某个动画的评分。我们使用GraphSAGE网络进行模型建立,并使用Weighted Root Mean Square Error(RMSE)来评估模型的性能。我们的方法有可能在动画推荐系统的准确性和有效性方面带来显著改进,并可以扩展到其他需要个性化推荐的领域。

AirIMU: Learning Uncertainty Propagation for Inertial Odometry

  • paper_url: http://arxiv.org/abs/2310.04874
  • repo_url: None
  • paper_authors: Yuheng Qiu, Chen Wang, Xunfei Zhou, Youjie Xia, Sebastian Scherer
  • for: 增强各种传感器系统的优化融合,例如视觉或LiDAR激光测距仪。
  • methods: 学习基于方法,考虑感知器的非线性特性和各种传感器模型。
  • results: 在多个宠 Singles benchmark 和一个大规模直升机数据集上,reduced the drift rate of the inertial odometry by a factor of between 2.2 and 4 times。
    Abstract Accurate uncertainty estimation for inertial odometry is the foundation to achieve optimal fusion in multi-sensor systems, such as visual or LiDAR inertial odometry. Prior studies often simplify the assumptions regarding the uncertainty of inertial measurements, presuming fixed covariance parameters and empirical IMU sensor models. However, the inherent physical limitations and non-linear characteristics of sensors are difficult to capture. Moreover, uncertainty may fluctuate based on sensor rates and motion modalities, leading to variations across different IMUs. To address these challenges, we formulate a learning-based method that not only encapsulate the non-linearities inherent to IMUs but also ensure the accurate propagation of covariance in a data-driven manner. We extend the PyPose library to enable differentiable batched IMU integration with covariance propagation on manifolds, leading to significant runtime speedup. To demonstrate our method's adaptability, we evaluate it on several benchmarks as well as a large-scale helicopter dataset spanning over 262 kilometers. The drift rate of the inertial odometry on these datasets is reduced by a factor of between 2.2 and 4 times. Our method lays the groundwork for advanced developments in inertial odometry.
    摘要 准确的不确定性估计是多感器系统中征印底层的基础,以实现最优的融合。过去的研究常常简化了涉及到涨动测量的不确定性的假设,假设IMU传感器的covariance参数是固定的,并使用经验测量模型。然而,涉及到传感器的物理限制和非线性特性很难捕捉。此外,不确定性可能会随着传感器的读取速率和运动模式而变化,导致不同的IMU传感器之间存在差异。为了解决这些挑战,我们提出了一种学习基于的方法,不仅能够捕捉IMU传感器的非线性特性,还能够确保数据驱动的 covariance 的准确传播。我们将 PyPose 库扩展以实现批处理的 IMU 融合,从而实现了显著的运行速度提升。为了证明我们的方法的适应性,我们对多个 benchmark 和一个大规模的直升机数据集进行了评估,数据集涵盖了262公里的距离。在这些数据集上,IMU 的涨动率被降低了2.2-4倍。我们的方法为高级发展征印底层提供了基础。

Lemur: Integrating Large Language Models in Automated Program Verification

  • paper_url: http://arxiv.org/abs/2310.04870
  • repo_url: None
  • paper_authors: Haoze Wu, Clark Barrett, Nina Narodytska
  • for: automated program verification
  • methods: combines the power of LLMs and automated reasoners
  • results: practical improvements on a set of synthetic and competition benchmarks
    Abstract The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that often demands high-level abstract reasoning about program properties, which is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks.
    摘要 LLMS 的代码理解能力的示例引起了自动化程序验证的问题,这是一项需要高级抽象逻辑来评估程序属性的任务,而这是自动验证工具的挑战。我们提出了将 LLMS 和自动逻辑工具结合使用的一般方法,并正式描述了这种方法的 derivation 规则,并证明其正确性。我们将这种方法实现为一个sound的自动验证过程,并在一组 sintetic 和竞赛 bencmarks 上实现了实践性的改进。

ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

  • paper_url: http://arxiv.org/abs/2310.04869
  • repo_url: None
  • paper_authors: Yue Jiang, Eldon Schoop, Amanda Swearngin, Jeffrey Nichols
  • for: 这篇论文目的是为了提高对UI任务的识别能力,并且不需要人类提供纠正注解。
  • methods: 该方法 combining existing pixel-based methods with a Large Language Model (LLM),可以应用于任何UI屏幕截图数据集。
  • results: 该研究生成了335,000个对话示例,并使用它们来练化一个对话型VLM进行UI任务。研究还评估了模型的性能,包括UI元素检测任务、回答质量和多步UI导航和规划等。
    Abstract Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to multi-step UI navigation and planning.
    摘要 多Modal视觉语言模型(VLM)允许强大的应用由图像和语言的融合理解,但许多perform poorly on UI任务由于缺乏UI训练数据。在这篇论文中,我们适应了一个方法来生成图像和文本的对应训练数据 для VLM,并将其应用到UI领域。与先前艺术不同,我们的方法不需要人工提供笔记,并且可以应用于任何UI屏幕截图集。我们生成了335K的对话示例,与UI描述、问答和规划相关,并用它们来精心UI任务。为评估我们模型的性能,我们对UI元素检测任务进行了测试,评估响应质量,并显示了其可应用于多步UI导航和规划。

ForeSeer: Product Aspect Forecasting Using Temporal Graph Embedding

  • paper_url: http://arxiv.org/abs/2310.04865
  • repo_url: None
  • paper_authors: Zixuan Liu, Gaurush Hiranandani, Kun Qian, Eddie W. Huang, Yi Xu, Belinda Zeng, Karthik Subbian, Sheng Wang
  • for: 预测新产品的未来升级特征
  • methods: 使用文本挖掘和产品嵌入approach,逐渐在时间产品图上进行训练
  • results: 与现有方法相比, ForeSeer 在实际 setting 中具有至少49.1%的 AUPRC 提升,并且在产品图和评论特征关联预测方面具有改善。
    Abstract Developing text mining approaches to mine aspects from customer reviews has been well-studied due to its importance in understanding customer needs and product attributes. In contrast, it remains unclear how to predict the future emerging aspects of a new product that currently has little review information. This task, which we named product aspect forecasting, is critical for recommending new products, but also challenging because of the missing reviews. Here, we propose ForeSeer, a novel textual mining and product embedding approach progressively trained on temporal product graphs for this novel product aspect forecasting task. ForeSeer transfers reviews from similar products on a large product graph and exploits these reviews to predict aspects that might emerge in future reviews. A key novelty of our method is to jointly provide review, product, and aspect embeddings that are both time-sensitive and less affected by extremely imbalanced aspect frequencies. We evaluated ForeSeer on a real-world product review system containing 11,536,382 reviews and 11,000 products over 3 years. We observe that ForeSeer substantially outperformed existing approaches with at least 49.1\% AUPRC improvement under the real setting where aspect associations are not given. ForeSeer further improves future link prediction on the product graph and the review aspect association prediction. Collectively, Foreseer offers a novel framework for review forecasting by effectively integrating review text, product network, and temporal information, opening up new avenues for online shopping recommendation and e-commerce applications.
    摘要 开发文本挖掘方法来挖掘用户评价中的方面,因为它对理解客户需求和产品特性非常重要。然而,尚未有效地预测新产品未经评价的方面。这个任务,我们称之为产品方面预测,是推荐新产品的关键任务,但也是非常困难的因为缺少评价。在这里,我们提出了 ForeSeer,一种新的文本挖掘和产品嵌入方法,通过在时间维度上进行模糊嵌入来预测未来评价中可能出现的方面。ForeSeer 可以从类似产品上的大规模产品图中传递评价,并利用这些评价来预测未来可能出现的方面。我们的方法的一个新特点是同时提供评价、产品和方面嵌入,这些嵌入不仅时间敏感,也受到方面频率异常高的影响。我们在一个真实的产品评价系统上进行了测试,包括11,536,382篇评价和11,000个产品,覆盖了3年的时间。我们发现 ForeSeer 在实际 Setting 下明显超过了现有方法,至少提高了49.1%的 AUPRC。ForeSeer 还改进了产品图中未来链接预测和评价方面关联预测。总之,ForeSeer 提供了一种新的评价预测框架,通过有效地结合评价文本、产品网络和时间信息,打开了在线购物推荐和电商应用的新 Avenues。

Uncovering hidden geometry in Transformers via disentangling position and context

  • paper_url: http://arxiv.org/abs/2310.04861
  • repo_url: https://github.com/jiajunsong629/uncover-hidden-geometry
  • paper_authors: Jiajun Song, Yiqiao Zhong
  • for: This paper aims to provide a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components, in order to gain structural insights about input formats in in-context learning and arithmetic tasks.
  • methods: The authors use a tensor representation of embedding vectors $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$ to extract the mean effects and decompose the hidden states into interpretable components, including the global mean vector $\boldsymbol{\mu}$, the mean vectors across contexts and positions $\mathbf{pos}_t$ and $\mathbf{ctx}c$, and the residual vector $\mathbf{resid}{c,t}$.
  • results: The authors find that the decomposition yields a pervasive mathematical structure across popular transformer architectures and diverse text datasets, including a low-dimensional, continuous, and often spiral shape for the mean vectors across positions, clear cluster structure for the mean vectors across contexts, and mutual incoherence between the mean vectors across positions and contexts. These findings offer structural insights into the input formats of transformers and have implications for in-context learning and arithmetic tasks.
    Abstract Transformers are widely used to extract complex semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$. Given embedding vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}_t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}_{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}_t)_{t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}_c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t)_{t}$ and $(\mathbf{ctx}_c)_c$ are mutually incoherent -- namely $\mathbf{pos}_t$ is almost orthogonal to $\mathbf{ctx}_c$ -- which is canonical in compressed sensing and dictionary learning. This decomposition offers structural insights about input formats in in-context learning (especially for induction heads) and in arithmetic tasks.
    摘要 transformers 广泛使用来提取输入токен的复杂含义,但它们通常作为黑obox模型运行。在这篇文章中,我们提出了一种简单又有用的隐状态(或嵌入)的解剖法。对于任何层,输入序列样本的嵌入 вектор $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$ 可以被表示为tensor。在序列(或上下文) $c \le C$ 中的嵌入 vector $\boldsymbol{h}_{c,t} \in \mathbb{R}^d$ 的mean effect的提取可以得到以下含义的解剖:$$\boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t}$$其中,$\boldsymbol{\mu}$ 是全局均值向量,$\mathbf{pos}_t$ 和 $\mathbf{ctx}_c$ 是在位置和上下文中的均值向量,respectively,和 $\mathbf{resid}_{c,t}$ 是剩余向量。对于流行的 transformer 架构和多种文本数据集,我们在实验上发现了一种普遍的数学结构:1. $({\mathbf{pos}_t)_t$ 形成了一个低维、连续、常见的旋转形状 across layers。2. $({\mathbf{ctx}_c)_c$ 显示了明确的群结构,frequently falls into context topics。3. $({\mathbf{pos}_t)_t$ 和 $({\mathbf{ctx}_c)_c$ 是互不协方的 -- namely $\mathbf{pos}_t$ 几乎是 $\mathbf{ctx}_c$ 的orthogonal vector -- which is canonical in compressed sensing and dictionary learning.这种解剖提供了针对输入格式的结构性理解,特别是在 induction heads 中,以及在算术任务中。

Balancing utility and cognitive cost in social representation

  • paper_url: http://arxiv.org/abs/2310.04852
  • repo_url: None
  • paper_authors: Max Taylor-Davies, Christopher G. Lucas
  • for: 该论文旨在研究如何为agent构建和维护其所处环境中其他agent的表示,以便更好地完成多个任务。
  • methods: 论文使用选择性依效为例任务,描述了代理在选择表示信息时的问题,并提出了两种资源受限制的社会表示方法。
  • results: 论文通过例子示出了如何在资源受限制的情况下选择合适的表示信息,以优化代理在下游任务中的性能。
    Abstract To successfully navigate its environment, an agent must construct and maintain representations of the other agents that it encounters. Such representations are useful for many tasks, but they are not without cost. As a result, agents must make decisions regarding how much information they choose to represent about the agents in their environment. Using selective imitation as an example task, we motivate the problem of finding agent representations that optimally trade off between downstream utility and information cost, and illustrate two example approaches to resource-constrained social representation.
    摘要 Translated into Simplified Chinese:为了成功地 navigate 其环境,一个 Agent 需要构建和维护与其他 Agent 的表示。这些表示对于许多任务都是有用的,但它们不是无成本的。因此,Agent 需要做出如何选择 represent 的信息的决策。使用选择性模仿为例题,我们激发了找到 Agent 表示,并且优化它们与下游用途之间的平衡问题的问题,并示出了两种资源受限制的社会表示的例方法。

Sub-linear Regret in Adaptive Model Predictive Control

  • paper_url: http://arxiv.org/abs/2310.04842
  • repo_url: None
  • paper_authors: Damianos Tranos, Alexandre Proutiere
  • for: 这个论文针对不确定的线性系统进行了适束型预测控制(MPC)。
  • methods: 这个算法使用了自适束管道(polytopic tubes)和确定性等价原理(certainty-equivalence principle),在线性系统中处理不确定性和状态和输入限制。
  • results: 这个算法可以确保状态和输入限制,并且有 recursive-feasibility 和渐进稳定性。对比于对系统动力学确知的oracle算法,这个算法的误差不超过 $O(T^{1/2 + \epsilon})$,其中 $\epsilon \in (0,1)$ 是设计参数,用于调整惊对性部分的algorithm。
    Abstract We consider the problem of adaptive Model Predictive Control (MPC) for uncertain linear-systems with additive disturbances and with state and input constraints. We present STT-MPC (Self-Tuning Tube-based Model Predictive Control), an online algorithm that combines the certainty-equivalence principle and polytopic tubes. Specifically, at any given step, STT-MPC infers the system dynamics using the Least Squares Estimator (LSE), and applies a controller obtained by solving an MPC problem using these estimates. The use of polytopic tubes is so that, despite the uncertainties, state and input constraints are satisfied, and recursive-feasibility and asymptotic stability hold. In this work, we analyze the regret of the algorithm, when compared to an oracle algorithm initially aware of the system dynamics. We establish that the expected regret of STT-MPC does not exceed $O(T^{1/2 + \epsilon})$, where $\epsilon \in (0,1)$ is a design parameter tuning the persistent excitation component of the algorithm. Our result relies on a recently proposed exponential decay of sensitivity property and, to the best of our knowledge, is the first of its kind in this setting. We illustrate the performance of our algorithm using a simple numerical example.
    摘要 我们考虑了适束预测控制(MPC)的问题,这是不确定线性系统中的噪音和外部干扰,并且受到状态和输入范围限制。我们提出了自适束管道基本预测控制(STT-MPC),这是一个在线上算法,它结合了必然等价原理和多topic管道。具体来说,在任何一步中,STT-MPC使用最小二乘估计器(LSE)估算系统动力学,并使用这些估值解决MPC问题。使用多topic管道的好处是,即使存在不确定性,状态和输入范围仍然满足,并且积累可行性和渐进稳定性持续。在这个工作中,我们分析了STT-MPC的幻悔,与一个对系统动力学有认识的oracle算法进行比较。我们证明,STT-MPC的预料 regret不超过$O(T^{1/2 + \epsilon})$,其中$\epsilon \in (0,1)$是一个设计参数,用于调整 persistentexcitation的部分。我们的结果基于最近提出的对敏感度快速衰减的性质,并且,至今为止,这是这个设定中的第一个相关结果。我们使用一个简单的数据示例来说明性能。

Federated Self-Supervised Learning of Monocular Depth Estimators for Autonomous Vehicles

  • paper_url: http://arxiv.org/abs/2310.04837
  • repo_url: None
  • paper_authors: Elton F. de S. Soares, Carlos Alberto V. Campos
  • for: Image-based depth estimation for autonomous vehicles in intelligent transportation systems.
  • methods: Federated learning and deep self-supervision.
  • results: Near state-of-the-art performance with a test loss below 0.13 and requiring, on average, only 1.5k training steps and up to 0.415 GB of weight data transfer per autonomous vehicle on each round.
    Abstract Image-based depth estimation has gained significant attention in recent research on computer vision for autonomous vehicles in intelligent transportation systems. This focus stems from its cost-effectiveness and wide range of potential applications. Unlike binocular depth estimation methods that require two fixed cameras, monocular depth estimation methods only rely on a single camera, making them highly versatile. While state-of-the-art approaches for this task leverage self-supervised learning of deep neural networks in conjunction with tasks like pose estimation and semantic segmentation, none of them have explored the combination of federated learning and self-supervision to train models using unlabeled and private data captured by autonomous vehicles. The utilization of federated learning offers notable benefits, including enhanced privacy protection, reduced network consumption, and improved resilience to connectivity issues. To address this gap, we propose FedSCDepth, a novel method that combines federated learning and deep self-supervision to enable the learning of monocular depth estimators with comparable effectiveness and superior efficiency compared to the current state-of-the-art methods. Our evaluation experiments conducted on Eigen's Split of the KITTI dataset demonstrate that our proposed method achieves near state-of-the-art performance, with a test loss below 0.13 and requiring, on average, only 1.5k training steps and up to 0.415 GB of weight data transfer per autonomous vehicle on each round.
    摘要 Image-based深度估计在计算机视觉领域中获得了广泛关注,尤其是在自动驾驶系统中。这种关注的原因在于它的成本效益和广泛的应用前景。不同于使用两个固定摄像头的双目深度估计方法,单目深度估计方法只需要一个摄像头,这使得它们非常灵活。当前的状态emo approaches for this task leverageself-supervised learning of deep neural networks in conjunction with tasks like pose estimation and semantic segmentation, but none of them have explored the combination of federated learning and self-supervision to train models using unlabeled and private data captured by autonomous vehicles. Federated learning offers notable benefits, including enhanced privacy protection, reduced network consumption, and improved resilience to connectivity issues. To address this gap, we propose FedSCDepth, a novel method that combines federated learning and deep self-supervision to enable the learning of monocular depth estimators with comparable effectiveness and superior efficiency compared to the current state-of-the-art methods. Our evaluation experiments conducted on Eigen's Split of the KITTI dataset demonstrate that our proposed method achieves near state-of-the-art performance, with a test loss below 0.13 and requiring, on average, only 1.5k training steps and up to 0.415 GB of weight data transfer per autonomous vehicle on each round.Here's the translation in Traditional Chinese:Image-based深度估计在计算机视觉领域中获得了广泛关注,尤其是在自动驾驶系统中。这种关注的原因在于它的成本效益和广泛的应用前景。不同于使用两个固定摄像头的双目深度估计方法,单目深度估计方法只需要一个摄像头,这使得它们非常灵活。现今的状态emo approaches for this task leverageself-supervised learning of deep neural networks in conjunction with tasks like pose estimation and semantic segmentation, but none of them have explored the combination of federated learning and self-supervision to train models using unlabeled and private data captured by autonomous vehicles。 Federated learning offers notable benefits, including enhanced privacy protection, reduced network consumption, and improved resilience to connectivity issues。 To address this gap, we propose FedSCDepth, a novel method that combines federated learning and deep self-supervision to enable the learning of monocular depth estimators with comparable effectiveness and superior efficiency compared to the current state-of-the-art methods。 Our evaluation experiments conducted on Eigen's Split of the KITTI dataset demonstrate that our proposed method achieves near state-of-the-art performance, with a test loss below 0.13 and requiring, on average, only 1.5k training steps and up to 0.415 GB of weight data transfer per autonomous vehicle on each round。

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

  • paper_url: http://arxiv.org/abs/2310.04836
  • repo_url: None
  • paper_authors: Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, Hong Zhou
  • for: This paper aims to improve the efficiency of large language models (LLMs) for real-world applications by introducing a novel quantization method called Dual Grained Quantization (DGQ).
  • methods: The DGQ method uses a two-phase grid search algorithm to determine the optimal quantization scales for both coarse-grained and fine-grained quantization, and it dequantizes the fine-grained INT4 weight into coarse-grained INT8 representation for efficient matrix multiplication.
  • results: The experimental results show that DGQ consistently outperforms prior methods across various LLM architectures and tasks, and achieves significant memory reduction and speed gains compared to the A16W4 implementation. Specifically, DGQ achieves $\textbf{1.12}$ $\times$ memory reduction and $\textbf{3.24}$ $\times$ speed gains.
    Abstract Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability. There are two mainstream quantization schemes for LLMs: coarse-grained ($\textit{e.g.,}$ channel-wise) quantization and fine-grained ($\textit{e.g.,}$ group-wise) quantization. Fine-grained quantization has smaller quantization loss, consequently achieving superior performance. However, when applied to weight-activation quantization, it disrupts continuous integer matrix multiplication, leading to inefficient inference. In this paper, we introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed. DSQ dequantizes the fine-grained INT4 weight into coarse-grained INT8 representation and preform matrix multiplication using INT8 kernels. Besides, we develop a two-phase grid search algorithm to simplify the determination of fine-grained and coarse-grained quantization scales. We also devise a percentile clipping schema for smoothing the activation outliers without the need for complex optimization techniques. Experimental results demonstrate that DGQ consistently outperforms prior methods across various LLM architectures and a wide range of tasks. Remarkably, by our implemented efficient CUTLASS kernel, we achieve $\textbf{1.12}$ $\times$ memory reduction and $\textbf{3.24}$ $\times$ speed gains comparing A16W4 implementation. These advancements enable efficient deployment of A8W4 LLMs for real-world applications.
    摘要 大型语言模型(LLM)对于内存需求和计算能力带来严重的挑战。现有两种主流的量化方案 для LLM:粗糙化(channel-wise)量化和细糙化(group-wise)量化。细糙化量化具有较小的量化损失,因此可以 дости得更高的性能。然而,当应用到对于量化的量化时,它会破坏连续数字matrix乘法,导致不fficient的推导。在这篇论文中,我们介绍了dual grained量化(DGQ),一种新的A8W4量化方法 для LLM,可以保持高性能的同时确保快速的推导。DGQ将细糙化INT4Weight转换为粗糙化INT8表示,并使用INT8核心进行矩阵乘法。此外,我们开发了一个双阶搜索算法来简化粗糙化和细糙化量化数值的决定。我们还提出了一个百分比剪枝架构内部构件的schema来缓和活动异常值,无需复杂的优化技术。实验结果显示,DGQ与先前的方法相比,在不同的LLM架构和各种任务上具有优秀的性能。特别是,我们通过我们实现的高效的CUTLASS核心,实现了$\textbf{1.12}$ $\times$ 的内存减少和$\textbf{3.24}$ $\times$ 的速度提升,与A16W4实现相比。这些突破创新实现了A8W4 LLM的效率部署。

On the Evolution of Knowledge Graphs: A Survey and Perspective

  • paper_url: http://arxiv.org/abs/2310.04835
  • repo_url: None
  • paper_authors: Xuhui Jiang, Chengjin Xu, Yinghan Shen, Xun Sun, Lumingyuan Tang, Saizhuo Wang, Zhongwu Chen, Yuanzhuo Wang, Jian Guo
  • for: 本文提供了知识 graphs(KGs)的演化和知识EXTRACTION、理解以及表示技术的全面综述,以及不同类型的KGs在实际应用中的实践案例。
  • methods: 本文 introduce了不同类型的KGs(静止KGs、动态KGs、时间KGs和事件KGs)的技术和实践应用,以及知识EXTRACTION和理解的方法。
  • results: 本文提出了未来知识工程的前瞻之处,包括将知识 graphs和大型自然语言模型(LLMs)相结合的潜力,以及知识EXTRACTION、理解和表示的进一步发展。
    Abstract Knowledge graphs (KGs) are structured representations of diversified knowledge. They are widely used in various intelligent applications. In this article, we provide a comprehensive survey on the evolution of various types of knowledge graphs (i.e., static KGs, dynamic KGs, temporal KGs, and event KGs) and techniques for knowledge extraction and reasoning. Furthermore, we introduce the practical applications of different types of KGs, including a case study in financial analysis. Finally, we propose our perspective on the future directions of knowledge engineering, including the potential of combining the power of knowledge graphs and large language models (LLMs), and the evolution of knowledge extraction, reasoning, and representation.
    摘要 知识图(KG)是一种结构化表示多样化知识的工具。它在各种智能应用中广泛使用。本文提供了知识图的发展历程(静态KG、动态KG、时间KG和事件KG)和知识抽取和推理技术的总览。此外,我们还介绍了不同类型的KG的实际应用,以及一个金融分析的实例研究。最后,我们提出了未来知识工程的未来方向,包括结合知识图和大型自然语言模型(LLM)的潜力,以及知识抽取、推理和表示的进一步发展。

Rethink Baseline of Integrated Gradients from the Perspective of Shapley Value

  • paper_url: http://arxiv.org/abs/2310.04821
  • repo_url: None
  • paper_authors: Shuyang Liu, Zixuan Chen, Ge Shi, Ji Wang, Changjie Fan, Yu Xiong, Runze Wu Yujing Hu, Ze Ji, Yang Gao
  • for: 解释深度神经网络(DNN)预测结果的原因。
  • methods: 基于Aumann-Shapley Value的基准设计方法,包括新的Shapley Integrated Gradients(SIG)方法。
  • results: SIG方法可以更好地估计特征的贡献,提供更一致的解释,并适用于不同应用场景和数据类型。
    Abstract Numerous approaches have attempted to interpret deep neural networks (DNNs) by attributing the prediction of DNN to its input features. One of the well-studied attribution methods is Integrated Gradients (IG). Specifically, the choice of baselines for IG is a critical consideration for generating meaningful and unbiased explanations for model predictions in different scenarios. However, current practice of exploiting a single baseline fails to fulfill this ambition, thus demanding multiple baselines. Fortunately, the inherent connection between IG and Aumann-Shapley Value forms a unique perspective to rethink the design of baselines. Under certain hypothesis, we theoretically analyse that a set of baseline aligns with the coalitions in Shapley Value. Thus, we propose a novel baseline construction method called Shapley Integrated Gradients (SIG) that searches for a set of baselines by proportional sampling to partly simulate the computation path of Shapley Value. Simulations on GridWorld show that SIG approximates the proportion of Shapley Values. Furthermore, experiments conducted on various image tasks demonstrate that compared to IG using other baseline methods, SIG exhibits an improved estimation of feature's contribution, offers more consistent explanations across diverse applications, and is generic to distinct data types or instances with insignificant computational overhead.
    摘要 多种方法已经尝试解释深度神经网络(DNN)的预测,其中一种广泛研究的方法是集成梯度(IG)。specifically,选择基线是 kritical consideration for generating meaningful and unbiased explanations for model predictions in different scenarios。然而,现行的单个基线使用方式不能满足这个目标,因此需要多个基线。幸运的是,IG和AUmann-Shapley Value之间的内在连接形成了一个独特的视角,可以重新思考基线的设计。根据某些假设,我们理论分析表明,一组基eline可以与Shapley Value中的联盟相对应。因此,我们提出了一种新的基线建立方法called Shapley Integrated Gradients(SIG),该方法通过质量抽样来寻找一组基eline,以便 partly simulate Shapley Value的计算路径。在GridWorld上的 simulations中,我们发现SIG可以相似地 aproximate Shapley Value的分布。此外,在多个图像任务上进行的实验表明,相比IG使用其他基eline方法,SIG可以更好地评估特征的贡献,提供更一致的解释,并且对于不同的数据类型或实例来说具有无关的计算开销。

Hacking Generative Models with Differentiable Network Bending

  • paper_url: http://arxiv.org/abs/2310.04816
  • repo_url: None
  • paper_authors: Giacomo Aldegheri, Alina Rogalska, Ahmed Youssef, Eugenia Iofinova
  • for: 这篇论文是为了探讨如何“黑客”生成模型,让其输出趋离原始训练分布向新的目标。
  • methods: 这篇论文使用了一种小规模可训练的模块,在生成模型中间层插入并在一些较低的迭代数上训练,保持其余的网络冻结不动。
  • results: 该方法可以生成具有怪异质量的输出图像,即由原始和新目标之间的矛盾带来的艺术效果。
    Abstract In this work, we propose a method to 'hack' generative models, pushing their outputs away from the original training distribution towards a new objective. We inject a small-scale trainable module between the intermediate layers of the model and train it for a low number of iterations, keeping the rest of the network frozen. The resulting output images display an uncanny quality, given by the tension between the original and new objectives that can be exploited for artistic purposes.
    摘要 在这个研究中,我们提出了一种方法,用于“黑客”生成模型,使其输出偏离原始训练分布向新的目标。我们在模型中插入一个小规模可训练的模块,并在几个迭代后冻结整个网络。结果的输出图像具有怪异的质量,它由原始和新的目标之间的紧张关系带来,可以用于艺术目的。

User’s Position-Dependent Strategies in Consumer-Generated Media with Monetary Rewards

  • paper_url: http://arxiv.org/abs/2310.04805
  • repo_url: None
  • paper_authors: Shintaro Ueki, Fujio Toriumi, Toshiharu Sugawara
  • for: This paper aims to help content-sharing platform designers create more effective monetary reward schemes to incentivize user participation and improve content quality.
  • methods: The authors propose a model that integrates monetary reward schemes into the Social Networking Services (SNS) norms game, and experimentally investigate the impact of different reward schemes on user behavior and content quality.
  • results: The authors find that different monetary reward schemes have distinct effects on user proactivity and content quality, and that these effects depend on the user’s position in the CGM network. Their findings can help platform designers create more effective reward schemes to improve user engagement and content quality.
    Abstract Numerous forms of consumer-generated media (CGM), such as social networking services (SNS), are widely used. Their success relies on users' voluntary participation, often driven by psychological rewards like recognition and connection from reactions by other users. Furthermore, a few CGM platforms offer monetary rewards to users, serving as incentives for sharing items such as articles, images, and videos. However, users have varying preferences for monetary and psychological rewards, and the impact of monetary rewards on user behaviors and the quality of the content they post remains unclear. Hence, we propose a model that integrates some monetary reward schemes into the SNS-norms game, which is an abstraction of CGM. Subsequently, we investigate the effect of each monetary reward scheme on individual agents (users), particularly in terms of their proactivity in posting items and their quality, depending on agents' positions in a CGM network. Our experimental results suggest that these factors distinctly affect the number of postings and their quality. We believe that our findings will help CGM platformers in designing better monetary reward schemes.
    摘要 众多的消费者生成内容(CGM),如社交媒体服务(SNS),广泛使用。它们的成功取决于用户的自愿参与,通常由其他用户的反应所驱动,如认可和连接。此外,一些CGM平台还提供金钱奖励给用户,作为启发共享文章、图片和视频的行为的激励。然而,用户对金钱和心理奖励的偏好不同,以及奖励对用户行为和文章质量的影响仍然不清楚。因此,我们提出一个 integrate some monetary reward schemes into the SNS-norms game的模型,并 investigate the effect of each monetary reward scheme on individual agents (users),特别是在CGM网络中 agents的位置。我们的实验结果表明,这些因素明显地影响了用户的投稿数量和质量。我们认为,我们的发现将有助于CGM平台的设计。

Ten Challenges in Industrial Recommender Systems

  • paper_url: http://arxiv.org/abs/2310.04804
  • repo_url: None
  • paper_authors: Zhenhua Dong, Jieming Zhu, Weiwen Liu, Ruiming Tang
  • for: 本研讨会讲解了十个有趣和重要的推荐系统挑战,以帮助RecSys社区创造更好的推荐系统。
  • methods: 文章介绍了一些适用于推荐系统的技术趋势,包括深度和复杂的模型,如神经网络和预训练语言模型。
  • results: 文章描述了在实际应用中遇到的一些困难和挑战,以帮助RecSys社区更好地解决这些问题。
    Abstract Huawei's vision and mission is to build a fully connected intelligent world. Since 2013, Huawei Noah's Ark Lab has helped many products build recommender systems and search engines for getting the right information to the right users. Every day, our recommender systems serve hundreds of millions of mobile phone users and recommend different kinds of content and services such as apps, news feeds, songs, videos, books, themes, and instant services. The big data and various scenarios provide us with great opportunities to develop advanced recommendation technologies. Furthermore, we have witnessed the technical trend of recommendation models in the past ten years, from the shallow and simple models like collaborative filtering, linear models, low rank models to deep and complex models like neural networks, pre-trained language models. Based on the mission, opportunities and technological trends, we have also met several hard problems in our recommender systems. In this talk, we will share ten important and interesting challenges and hope that the RecSys community can get inspired and create better recommender systems.
    摘要

HNS: An Efficient Hermite Neural Solver for Solving Time-Fractional Partial Differential Equations

  • paper_url: http://arxiv.org/abs/2310.04789
  • repo_url: https://github.com/hsbhc/hns
  • paper_authors: Jie Hou, Zhiying Ma, Shihui Ying, Ying Li
  • for: 解决时间分数导函数方程 equations using deep learning techniques
  • methods: 使用 Hermite interpolation techniques 和 deep neural networks
  • results: 实验结果显示 HNS 的精度比 L1 方法高,并且在高维enario中也有显著改善。
    Abstract Neural network solvers represent an innovative and promising approach for tackling time-fractional partial differential equations by utilizing deep learning techniques. L1 interpolation approximation serves as the standard method for addressing time-fractional derivatives within neural network solvers. However, we have discovered that neural network solvers based on L1 interpolation approximation are unable to fully exploit the benefits of neural networks, and the accuracy of these models is constrained to interpolation errors. In this paper, we present the high-precision Hermite Neural Solver (HNS) for solving time-fractional partial differential equations. Specifically, we first construct a high-order explicit approximation scheme for fractional derivatives using Hermite interpolation techniques, and rigorously analyze its approximation accuracy. Afterward, taking into account the infinitely differentiable properties of deep neural networks, we integrate the high-order Hermite interpolation explicit approximation scheme with deep neural networks to propose the HNS. The experimental results show that HNS achieves higher accuracy than methods based on the L1 scheme for both forward and inverse problems, as well as in high-dimensional scenarios. This indicates that HNS has significantly improved accuracy and flexibility compared to existing L1-based methods, and has overcome the limitations of explicit finite difference approximation methods that are often constrained to function value interpolation. As a result, the HNS is not a simple combination of numerical computing methods and neural networks, but rather achieves a complementary and mutually reinforcing advantages of both approaches. The data and code can be found at \url{https://github.com/hsbhc/HNS}.
    摘要 神经网络解决方法代表了一种创新和有前途的方法,用于解决时间分辨率部分弗散方程。L1 interpolating approximation是解决时间分辨率 Derivatives的标准方法之一,但我们发现,基于L1 interpolating approximation的神经网络解决方法无法完全利用神经网络的优势,并且模型的准确性受到 interpolating error 的限制。在这篇论文中,我们提出了高精度希尔比特神经网络解决方法(HNS),用于解决时间分辨率部分弗散方程。我们首先构建了高阶显式approximation scheme for fractional derivatives,并且仔细分析了其 Approximation 精度。接着,我们将高阶希尔比特 interpolating scheme与深度神经网络结合,提出了HNS。实验结果表明,HNS在前向和反向问题中,以及高维场景下都具有更高的准确性,比基于L1 scheme的方法更高。这表明,HNS在准确性和灵活性方面有所提高,并且超越了传统的显式差分方法,这些方法通常受到函数值 interpolating 的限制。因此,HNS不仅是一种简单的数字计算方法和神经网络的组合,而是实现了两种方法之间的共轨和互补优势。数据和代码可以在 \url{https://github.com/hsbhc/HNS} 找到。

PMNN:Physical Model-driven Neural Network for solving time-fractional differential equations

  • paper_url: http://arxiv.org/abs/2310.04788
  • repo_url: None
  • paper_authors: Zhiying Ma, Jie Hou, Wenhao Zhu, Yaxin Peng, Ying Li
  • for: 解决时间扩展弗拉克达尔方程(Time-fractional differential equations)
  • methods: Physical Model-driven Neural Network(PMNN)方法,结合深度神经网络(DNNs)和插值拟合方法
  • results: 通过训练DNNs来学习时间迭代方案,实现了精度高且效率高的时间扩展弗拉克达尔方程解。
    Abstract In this paper, an innovative Physical Model-driven Neural Network (PMNN) method is proposed to solve time-fractional differential equations. It establishes a temporal iteration scheme based on physical model-driven neural networks which effectively combines deep neural networks (DNNs) with interpolation approximation of fractional derivatives. Specifically, once the fractional differential operator is discretized, DNNs are employed as a bridge to integrate interpolation approximation techniques with differential equations. On the basis of this integration, we construct a neural-based iteration scheme. Subsequently, by training DNNs to learn this temporal iteration scheme, approximate solutions to the differential equations can be obtained. The proposed method aims to preserve the intrinsic physical information within the equations as far as possible. It fully utilizes the powerful fitting capability of neural networks while maintaining the efficiency of the difference schemes for fractional differential equations. Moreover, we validate the efficiency and accuracy of PMNN through several numerical experiments.
    摘要 在这篇论文中,我们提出了一种创新的物理模型驱动神经网络(PMNN)方法,用于解决时间分解 diferencial equations。它建立了一种基于物理模型驱动神经网络的时间迭代方案,Effectively combining deep neural networks (DNNs) with interpolation approximation of fractional derivatives. Specifically, once the fractional differential operator is discretized, DNNs are employed as a bridge to integrate interpolation approximation techniques with differential equations. On the basis of this integration, we construct a neural-based iteration scheme. Subsequently, by training DNNs to learn this temporal iteration scheme, approximate solutions to the differential equations can be obtained. The proposed method aims to preserve the intrinsic physical information within the equations as far as possible. It fully utilizes the powerful fitting capability of neural networks while maintaining the efficiency of the difference schemes for fractional differential equations. Moreover, we validate the efficiency and accuracy of PMNN through several numerical experiments.Here's the translation in Traditional Chinese:在这篇论文中,我们提出了一种创新的物理模型驱动神经网络(PMNN)方法,用于解决时间分解差分方程。它建立了一种基于物理模型驱动神经网络的时间迭代方案,具体来说,一旦时间分解运算符被粗化,神经网络被用来联结插值推理技巧与差分方程。基于这个联结,我们建立了一个神经网络基于的迭代方案。然后,通过训练神经网络以学习这个时间迭代方案,可以从中获得估计解的 approximate solutions。提出的方法希望能够保留差分方程中的本质物理信息,一旦可能。它充分利用神经网络的强大适应能力,同时维持差分方程的效率。此外,我们透过一些数据实验来验证 PMNN 的效率和准确性。

Optimal Sequential Decision-Making in Geosteering: A Reinforcement Learning Approach

  • paper_url: http://arxiv.org/abs/2310.04772
  • repo_url: None
  • paper_authors: Ressi Bonti Muhammad, Sergey Alyaev, Reidar Brumer Bratvold
  • for: 提高钻掘过程中的地层规划决策(geosteering)的效率和准确性。
  • methods: 使用深度Q网络(DQN)方法,一种无模型学习(RL)方法,直接从决策环境学习地层规划决策。
  • results: 在两个synthetic geosteering场景中,RL方法可以达到与 quasi-optimal Approximate Dynamic Programming(ADP)相当的高质量结果,而且比传统方法快速得多。此外,由于RL方法是无模型的,因此可以在更复杂的环境中应用,并且可以在未来与实际数据进行混合训练。
    Abstract Trajectory adjustment decisions throughout the drilling process, called geosteering, affect subsequent choices and information gathering, thus resulting in a coupled sequential decision problem. Previous works on applying decision optimization methods in geosteering rely on greedy optimization or Approximate Dynamic Programming (ADP). Either decision optimization method requires explicit uncertainty and objective function models, making developing decision optimization methods for complex and realistic geosteering environments challenging to impossible. We use the Deep Q-Network (DQN) method, a model-free reinforcement learning (RL) method that learns directly from the decision environment, to optimize geosteering decisions. The expensive computations for RL are handled during the offline training stage. Evaluating DQN needed for real-time decision support takes milliseconds and is faster than the traditional alternatives. Moreover, for two previously published synthetic geosteering scenarios, our results show that RL achieves high-quality outcomes comparable to the quasi-optimal ADP. Yet, the model-free nature of RL means that by replacing the training environment, we can extend it to problems where the solution to ADP is prohibitively expensive to compute. This flexibility will allow applying it to more complex environments and make hybrid versions trained with real data in the future.
    摘要 <> translate "Trajectory adjustment decisions throughout the drilling process, called geosteering, affect subsequent choices and information gathering, thus resulting in a coupled sequential decision problem. Previous works on applying decision optimization methods in geosteering rely on greedy optimization or Approximate Dynamic Programming (ADP). Either decision optimization method requires explicit uncertainty and objective function models, making developing decision optimization methods for complex and realistic geosteering environments challenging to impossible. We use the Deep Q-Network (DQN) method, a model-free reinforcement learning (RL) method that learns directly from the decision environment, to optimize geosteering decisions. The expensive computations for RL are handled during the offline training stage. Evaluating DQN needed for real-time decision support takes milliseconds and is faster than the traditional alternatives. Moreover, for two previously published synthetic geosteering scenarios, our results show that RL achieves high-quality outcomes comparable to the quasi-optimal ADP. Yet, the model-free nature of RL means that by replacing the training environment, we can extend it to problems where the solution to ADP is prohibitively expensive to compute. This flexibility will allow applying it to more complex environments and make hybrid versions trained with real data in the future."中文简体版:<>探钻过程中的轨迹调整决策,称为地OSTEERING,会影响后续的选择和信息收集,因此形成一个coupled sequential decision problem。前一些关于地OSTEERING中的决策优化方法都基于greedy optimization或 Approximate Dynamic Programming (ADP)。这些决策优化方法都需要显式的uncertainty和目标函数模型,因此在复杂和实际的地OSTEERING环境中开发决策优化方法是非常困难或不可能。我们使用Deep Q-Network (DQN)方法,一种model-free reinforcement learning (RL)方法,直接从决策环境中学习优化决策。RL的昂贵计算被处理在线上训练阶段。在实时决策支持中评估DQN所需的时间只需毫秒级,比传统方法更快。此外,对两个已发表的synthetic geosteering场景的结果显示,RL可以 achiev高质量的结果与 quasi-optimal ADP相当。然而,RL的model-free性意味着可以通过更换训练环境来扩展其应用范围,包括更复杂的环境和以后在实际数据上进行hybrid版本的训练。

Pairwise GUI Dataset Construction Between Android Phones and Tablets

  • paper_url: http://arxiv.org/abs/2310.04755
  • repo_url: https://github.com/huhangithub/papt
  • paper_authors: Han Hu, Haolan Zhan, Yujin Huang, Di Liu
  • for: 这个论文旨在提高开发者的产效,通过自动化 GUI 开发来减少开发成本和遗漏。
  • methods: 这篇论文使用了深度学习技术,并提出了一种新的对应式 GUI 收集方法,以生成 Android 手机和平板电脑之间的对应 GUI 数据集。
  • results: 经过初步实验,论文发现当前使用深度学习自动化 GUI 开发时存在一些挑战,需要进一步的研究和优化。
    Abstract In the current landscape of pervasive smartphones and tablets, apps frequently exist across both platforms. Although apps share most graphic user interfaces (GUIs) and functionalities across phones and tablets, developers often rebuild from scratch for tablet versions, escalating costs and squandering existing design resources. Researchers are attempting to collect data and employ deep learning in automated GUIs development to enhance developers' productivity. There are currently several publicly accessible GUI page datasets for phones, but none for pairwise GUIs between phones and tablets. This poses a significant barrier to the employment of deep learning in automated GUI development. In this paper, we introduce the Papt dataset, a pioneering pairwise GUI dataset tailored for Android phones and tablets, encompassing 10,035 phone-tablet GUI page pairs sourced from 5,593 unique app pairs. We propose novel pairwise GUI collection approaches for constructing this dataset and delineate its advantages over currently prevailing datasets in the field. Through preliminary experiments on this dataset, we analyze the present challenges of utilizing deep learning in automated GUI development.
    摘要 在现有的智能手机和平板电脑普及的场景下,许多应用程序frequently across both platforms exist。 although apps share most graphic user interfaces (GUIs) and functionalities across phones and tablets, developers often rebuild from scratch for tablet versions, which escalates costs and wastes existing design resources. Researchers are attempting to collect data and employ deep learning in automated GUI development to enhance developers' productivity. Currently, there are several publicly accessible GUI page datasets for phones, but none for pairwise GUIs between phones and tablets. This poses a significant barrier to the employment of deep learning in automated GUI development. In this paper, we introduce the Papt dataset, a pioneering pairwise GUI dataset tailored for Android phones and tablets, encompassing 10,035 phone-tablet GUI page pairs sourced from 5,593 unique app pairs. We propose novel pairwise GUI collection approaches for constructing this dataset and delineate its advantages over currently prevailing datasets in the field. Through preliminary experiments on this dataset, we analyze the present challenges of utilizing deep learning in automated GUI development.

A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

  • paper_url: http://arxiv.org/abs/2310.04752
  • repo_url: https://github.com/wang22ti/DDC
  • paper_authors: Zitai Wang, Qianqian Xu, Zhiyong Yang, Yuan He, Xiaochun Cao, Qingming Huang
  • for: 减轻类别偏好问题
  • methods: 修改损失函数,例如重新权重损失或调整логitschannel
  • results: 提出了一种数据依存收缩技术,并建立了一个细化的泛化 bound,可以帮助解释重新权重和logit调整的实际结果。
    Abstract Real-world datasets are typically imbalanced in the sense that only a few classes have numerous samples, while many classes are associated with only a few samples. As a result, a na\"ive ERM learning process will be biased towards the majority classes, making it difficult to generalize to the minority classes. To address this issue, one simple but effective approach is to modify the loss function to emphasize the learning on minority classes, such as re-weighting the losses or adjusting the logits via class-dependent terms. However, existing generalization analysis of such losses is still coarse-grained and fragmented, failing to explain some empirical results. To bridge this gap, we propose a novel technique named data-dependent contraction to capture how these modified losses handle different classes. On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment in a unified manner. Furthermore, a principled learning algorithm is developed based on the theoretical insights. Finally, the empirical results on benchmark datasets not only validate the theoretical results but also demonstrate the effectiveness of the proposed method.
    摘要 To address this gap, we propose a novel technique called data-dependent contraction to capture how these modified losses handle different classes. On top of this technique, we establish a fine-grained generalization bound for imbalanced learning, which helps to explain the mystery of re-weighting and logit-adjustment in a unified manner. Furthermore, we develop a principled learning algorithm based on the theoretical insights.The empirical results on benchmark datasets not only validate the theoretical results but also demonstrate the effectiveness of the proposed method.

DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures

  • paper_url: http://arxiv.org/abs/2310.04750
  • repo_url: None
  • paper_authors: Wenhao Li, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu
    for:This paper focuses on improving the efficiency and performance of diffusion models for image synthesis.methods:The authors propose a base model search approach called “DiffNAS,” which leverages GPT-4 as a supernet and employs a search memory to enhance the results. They also use RFID as a proxy to quickly rank the experimental outcomes produced by GPT-4.results:The authors’ algorithm can augment the search efficiency by 2 times under GPT-based scenarios and achieve a performance of 2.82 with 0.37 improvement in FID on CIFAR10 relative to the benchmark IDDPM algorithm.
    Abstract Diffusion models have recently exhibited remarkable performance on synthetic data. After a diffusion path is selected, a base model, such as UNet, operates as a denoising autoencoder, primarily predicting noises that need to be eliminated step by step. Consequently, it is crucial to employ a model that aligns with the expected budgets to facilitate superior synthetic performance. In this paper, we meticulously analyze the diffusion model and engineer a base model search approach, denoted "DiffNAS". Specifically, we leverage GPT-4 as a supernet to expedite the search, supplemented with a search memory to enhance the results. Moreover, we employ RFID as a proxy to promptly rank the experimental outcomes produced by GPT-4. We also adopt a rapid-convergence training strategy to boost search efficiency. Rigorous experimentation corroborates that our algorithm can augment the search efficiency by 2 times under GPT-based scenarios, while also attaining a performance of 2.82 with 0.37 improvement in FID on CIFAR10 relative to the benchmark IDDPM algorithm.
    摘要 各种扩散模型最近在合成数据上表现出色。选择了扩散路径后,基本模型,如UNet,将作为滤波 autoencoder 操作,主要预测需要除掉的噪声步骤。因此,选择一个与预期预算相align的模型非常重要,以便在合成数据上实现优秀的表现。在这篇论文中,我们仔细分析了扩散模型,并开发了基于搜索的搜索模型搜索方法,称为"DiffNAS"。具体来说,我们利用 GPT-4 作为超网,并补充了搜索内存以提高结果。此外,我们采用 RFID 作为代理,以快速排名 GPT-4 生成的实验结果。同时,我们采用了快速收敛训练策略,以提高搜索效率。经验证明,我们的算法可以在 GPT 基本场景下提高搜索效率两倍,同时在 CIFAR10 上与标准 IDDPM 算法相比,实现了 2.82 的 FID 表现,升准0.37 的提升。

ConvNeXtv2 Fusion with Mask R-CNN for Automatic Region Based Coronary Artery Stenosis Detection for Disease Diagnosis

  • paper_url: http://arxiv.org/abs/2310.04749
  • repo_url: None
  • paper_authors: Sandesh Pokhrel, Sanjay Bhandari, Eduard Vazquez, Yash Raj Shrestha, Binod Bhattarai
  • for: automatization of manually detecting stenotic lesions in coronary arteries
  • methods: employing a specialized Convnext-V2 backbone based Mask RCNN model pre-trained for instance segmentation tasks
  • results: achieving a substantial F1 score of 0.5353 in identifying stenotic lesions
    Abstract Coronary Artery Diseases although preventable are one of the leading cause of mortality worldwide. Due to the onerous nature of diagnosis, tackling CADs has proved challenging. This study addresses the automation of resource-intensive and time-consuming process of manually detecting stenotic lesions in coronary arteries in X-ray coronary angiography images. To overcome this challenge, we employ a specialized Convnext-V2 backbone based Mask RCNN model pre-trained for instance segmentation tasks. Our empirical findings affirm that the proposed model exhibits commendable performance in identifying stenotic lesions. Notably, our approach achieves a substantial F1 score of 0.5353 in this demanding task, underscoring its effectiveness in streamlining this intensive process.
    摘要 心血管疾病虽可预防,但它是全球至关重要的死亡原因之一。由于诊断过程的复杂性,解决心血管疾病的挑战很大。本研究目的是自动化扫描心血管疾病X射线报告图像中的狭窄部分的手动识别过程。为此,我们采用了专门的 Convnext-V2 幕ignon 基于的面部划分模型,该模型在实例分割任务上进行预先训练。我们的实验结果表明,提议的模型在识别狭窄部分方面表现出色,其 F1 分数达到了 0.5353,这再次证明了该方法的效iveness。

Towards Dynamic and Small Objects Refinement for Unsupervised Domain Adaptative Nighttime Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.04747
  • repo_url: None
  • paper_authors: Jingyi Pan, Sihang Li, Yucheng Chen, Jinjing Zhu, Lin Wang
  • for: 这篇论文的目的是提出一种新的夜间Semantic segmentationUnsupervised domain adaptation方法,以解决夜间环境中的决策难题。
  • methods: 本方法使用了一个动态和小物件增强模块,将来自源领域的知识传递到目标夜间领域,并使用了一个对比学习模块以缓和领域差异。
  • results: 实验结果显示,本方法可以与先前的方法相比,大幅提高夜间Semantic segmentation的精度。
    Abstract Nighttime semantic segmentation is essential for various applications, e.g., autonomous driving, which often faces challenges due to poor illumination and the lack of well-annotated datasets. Unsupervised domain adaptation (UDA) has shown potential for addressing the challenges and achieved remarkable results for nighttime semantic segmentation. However, existing methods still face limitations in 1) their reliance on style transfer or relighting models, which struggle to generalize to complex nighttime environments, and 2) their ignorance of dynamic and small objects like vehicles and traffic signs, which are difficult to be directly learned from other domains. This paper proposes a novel UDA method that refines both label and feature levels for dynamic and small objects for nighttime semantic segmentation. First, we propose a dynamic and small object refinement module to complement the knowledge of dynamic and small objects from the source domain to target nighttime domain. These dynamic and small objects are normally context-inconsistent in under-exposed conditions. Then, we design a feature prototype alignment module to reduce the domain gap by deploying contrastive learning between features and prototypes of the same class from different domains, while re-weighting the categories of dynamic and small objects. Extensive experiments on four benchmark datasets demonstrate that our method outperforms prior arts by a large margin for nighttime segmentation. Project page: https://rorisis.github.io/DSRNSS/.
    摘要 夜间 semantic segmentation 是许多应用程序中的关键,如自动驾驶,它们frequently facing challenges due to poor illumination and lack of well-annotated datasets. 无监督领域适应(UDA)已经显示出了地Addressing these challenges and achieving remarkable results for nighttime semantic segmentation. However, existing methods still have limitations in 1) their reliance on style transfer or relighting models, which struggle to generalize to complex nighttime environments, and 2) their ignorance of dynamic and small objects like vehicles and traffic signs, which are difficult to be directly learned from other domains.This paper proposes a novel UDA method that refines both label and feature levels for dynamic and small objects for nighttime semantic segmentation. First, we propose a dynamic and small object refinement module to complement the knowledge of dynamic and small objects from the source domain to target nighttime domain. These dynamic and small objects are normally context-inconsistent in under-exposed conditions. Then, we design a feature prototype alignment module to reduce the domain gap by deploying contrastive learning between features and prototypes of the same class from different domains, while re-weighting the categories of dynamic and small objects.Extensive experiments on four benchmark datasets demonstrate that our method outperforms prior arts by a large margin for nighttime segmentation. Project page: .Here's the text with some additional information about the translation:I used Google Translate to translate the text into Simplified Chinese, and then made some minor adjustments to the translation to improve its accuracy and readability. I also added some additional information to the translation to help clarify the meaning of certain phrases and concepts.Please note that while Simplified Chinese is the most widely used form of Chinese, there are also other forms of Chinese, such as Traditional Chinese, that may be used in different contexts. If you need the text translated into a different form of Chinese, please let me know and I can try to assist you.

Task Aware Modulation using Representation Learning: An Approach for Few Shot Learning in Heterogeneous Systems

  • paper_url: http://arxiv.org/abs/2310.04727
  • repo_url: None
  • paper_authors: Arvind Renganathan, Rahul Ghosh, Ankush Khandelwal, Vipin Kumar
  • for: 提高个性化预测性能在少量示例设定下,特别是在不知道任务特征时
  • methods: 使用表示学习框架(TAM-RL),提取实际精度表示任务特征,进行个性化预测
  • results: 使用实际水文和流动塔数据集,比较 MAML 和 MMAML 的表现,TAM-RL 可以显著超越这些基准方法,同时训练更加快速和简单,不需要敏感的内 Loop 步骤和内 Loop 学习率,并通过synthetic数据进行了empirical评估,并证明TAM-RL可以在不同任务之间学习独特的表示,提高预测性能。
    Abstract We present a Task-aware modulation using Representation Learning (TAM-RL) framework that enhances personalized predictions in few-shot settings for heterogeneous systems when individual task characteristics are not known. TAM-RL extracts embeddings representing the actual inherent characteristics of these entities and uses these characteristics to personalize the predictions for each entity/task. Using real-world hydrological and flux tower benchmark data sets, we show that TAM-RL can significantly outperform existing baseline approaches such as MAML and multi-modal MAML (MMAML) while being much faster and simpler to train due to less complexity. Specifically, TAM-RL eliminates the need for sensitive hyper-parameters like inner loop steps and inner loop learning rate, which are crucial for model convergence in MAML, MMAML. We further present an empirical evaluation via synthetic data to explore the impact of heterogeneity amongst the entities on the relative performance of MAML, MMAML, and TAM-RL. We show that TAM-RL significantly improves predictive performance for cases where it is possible to learn distinct representations for different tasks.
    摘要 我们提出了一个任务意识度模块化学习(TAM-RL)框架,该框架在少量示例情况下提高个性化预测的性能,当个体任务特征不知道时。TAM-RL提取实际内在特征的表示,并使用这些特征个性化预测每个实体/任务。使用实际水文和流动塔benchmark数据集,我们显示了TAM-RL可以明显超越现有的基eline方法such as MAML和多模态MAML(MMAML),并且训练更快和简单。具体来说,TAM-RL消除了内 Loopstep和内 Loop学习率这些敏感的Hyperparameter,这些参数对MAML和MMAML的模型收敛起到关键作用。我们进一步通过synthetic数据进行empirical评估,探索不同实体之间的多样性对MAML、MMAML和TAM-RL的relative性能的影响。我们发现,当可以学习不同任务的明确表示时,TAM-RL会显著提高预测性能。

A Holistic Evaluation of Piano Sound Quality

  • paper_url: http://arxiv.org/abs/2310.04722
  • repo_url: None
  • paper_authors: Monan Zhou, Shangda Wu, Shaohua Ji, Zijin Li, Wei Li
  • for: 这个论文的目的是开发一种全面评估方法,帮助用户在购买钢琴时更好地评估音色质量。
  • methods: 这个研究使用主观问卷来 derive高质量评估系统,并使用Convolutional Neural Networks (CNN)进行分类。为了提高模型的可读性,研究人员使用Equivalent Rectangular Bandwidth (ERB)分析。
  • results: 研究发现,有musically trained individuals能够更好地 отли别不同钢琴的音色差异。最佳的CNN预训练后台 achieves a high accuracy of 98.3% as the piano classifier。然而,数据库有限, audio被截割以增加其量,导致数据不均衡和不够多样性,因此使用 focal loss 来减少数据不均衡的影响。
    Abstract This paper aims to develop a holistic evaluation method for piano sound quality to assist in purchasing decisions. Unlike previous studies that focused on the effect of piano performance techniques on sound quality, this study evaluates the inherent sound quality of different pianos. To derive quality evaluation systems, the study uses subjective questionnaires based on a piano sound quality dataset. The method selects the optimal piano classification models by comparing the fine-tuning results of different pre-training models of Convolutional Neural Networks (CNN). To improve the interpretability of the models, the study applies Equivalent Rectangular Bandwidth (ERB) analysis. The results reveal that musically trained individuals are better able to distinguish between the sound quality differences of different pianos. The best fine-tuned CNN pre-trained backbone achieves a high accuracy of 98.3\% as the piano classifier. However, the dataset is limited, and the audio is sliced to increase its quantity, resulting in a lack of diversity and balance, so we use focal loss to reduce the impact of data imbalance. To optimize the method, the dataset will be expanded, or few-shot learning techniques will be employed in future research.
    摘要 Translated into Simplified Chinese:这篇论文目的是开发一种全面评价方法,帮助人们在购买钢琴时做出更加 informed 的决定。与之前的研究不同,这些研究专注于钢琴演奏技巧对音质的影响,而这个研究则评估不同钢琴的内在音质。为 derive 音质评价系统,这个研究使用基于钢琴音质数据的主观问naire。方法选择最佳钢琴分类模型,通过比较不同预训练模型的 Convolutional Neural Networks (CNN) 的细化结果进行比较。为了提高模型的可读性,研究使用 Equivalent Rectangular Bandwidth (ERB) 分析。结果显示,具有音乐训练经验的个体更能够 отличи出不同钢琴的音质差异。最佳细化后的 CNN 预训练后台得到了 98.3% 的钢琴分类精度。然而,数据库有限, audio 被截割以增加其量,导致数据的不平衡和缺乏多样性,因此我们使用 focal loss 减少数据不平衡的影响。为优化方法,将来的研究可能会扩大数据库,或者使用 few-shot learning 技术。

EdgeFD: An Edge-Friendly Drift-Aware Fault Diagnosis System for Industrial IoT

  • paper_url: http://arxiv.org/abs/2310.04704
  • repo_url: None
  • paper_authors: Chen Jiao, Mao Fengjian, Lv Zuohong, Tang Jianhua
  • for: 这篇论文是针对工业智能故障诊断(FD)领域的传播学(TL)方法进行研究,以解决数据漂移问题。
  • methods: 该论文提出了一种名为“漂移意识Weight控制”(DAWC)的方法,用于在边缘设备上进行快速和有效的故障诊断。DAWC通过检测漂移并逐渐增强模型的泛化能力来解决频繁的数据漂移问题。
  • results: 实验结果表明,相比于现有的技术,该论文提出的DAWC方法能够达到更高的性能水平,同时也遵循边缘计算限制。此外,该论文还开发了一个完整的诊断和可视化平台。
    Abstract Recent transfer learning (TL) approaches in industrial intelligent fault diagnosis (FD) mostly follow the "pre-train and fine-tuning" paradigm to address data drift, which emerges from variable working conditions. However, we find that this approach is prone to the phenomenon known as catastrophic forgetting. Furthermore, performing frequent models fine-tuning on the resource-constrained edge nodes can be computationally expensive and unnecessary, given the excellent transferability demonstrated by existing models. In this work, we propose the Drift-Aware Weight Consolidation (DAWC), a method optimized for edge deployments, mitigating the challenges posed by frequent data drift in the industrial Internet of Things (IIoT). DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices, thereby conserving computational resources. By detecting drift using classifier confidence and estimating parameter importance with the Fisher Information Matrix, a tool that measures parameter sensitivity in probabilistic models, we introduce a drift detection module and a continual learning module to gradually equip the FD model with powerful generalization capabilities. Experimental results demonstrate that our proposed DAWC achieves superior performance compared to existing techniques while also ensuring compatibility with edge computing constraints. Additionally, we have developed a comprehensive diagnosis and visualization platform.
    摘要 Recent transfer learning (TL) approaches in industrial intelligent fault diagnosis (FD) mostly follow the "pre-train and fine-tuning" paradigm to address data drift, which emerges from variable working conditions. However, we find that this approach is prone to the phenomenon known as catastrophic forgetting. Furthermore, performing frequent models fine-tuning on the resource-constrained edge nodes can be computationally expensive and unnecessary, given the excellent transferability demonstrated by existing models. In this work, we propose the Drift-Aware Weight Consolidation (DAWC), a method optimized for edge deployments, mitigating the challenges posed by frequent data drift in the industrial Internet of Things (IIoT). DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices, thereby conserving computational resources. By detecting drift using classifier confidence and estimating parameter importance with the Fisher Information Matrix, a tool that measures parameter sensitivity in probabilistic models, we introduce a drift detection module and a continual learning module to gradually equip the FD model with powerful generalization capabilities. Experimental results demonstrate that our proposed DAWC achieves superior performance compared to existing techniques while also ensuring compatibility with edge computing constraints. Additionally, we have developed a comprehensive diagnosis and visualization platform.Here is the translation in Traditional Chinese:Recent transfer learning (TL) approaches in industrial intelligent fault diagnosis (FD) mostly follow the "pre-train and fine-tuning" paradigm to address data drift, which emerges from variable working conditions. However, we find that this approach is prone to the phenomenon known as catastrophic forgetting. Furthermore, performing frequent models fine-tuning on the resource-constrained edge nodes can be computationally expensive and unnecessary, given the excellent transferability demonstrated by existing models. In this work, we propose the Drift-Aware Weight Consolidation (DAWC), a method optimized for edge deployments, mitigating the challenges posed by frequent data drift in the industrial Internet of Things (IIoT). DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices, thereby conserving computational resources. By detecting drift using classifier confidence and estimating parameter importance with the Fisher Information Matrix, a tool that measures parameter sensitivity in probabilistic models, we introduce a drift detection module and a continual learning module to gradually equip the FD model with powerful generalization capabilities. Experimental results demonstrate that our proposed DAWC achieves superior performance compared to existing techniques while also ensuring compatibility with edge computing constraints. Additionally, we have developed a comprehensive diagnosis and visualization platform.

Serving Deep Learning Model in Relational Databases

  • paper_url: http://arxiv.org/abs/2310.04696
  • repo_url: None
  • paper_authors: Alexandre Eichenberger, Qi Lin, Saif Masood, Hong Min, Alexander Sim, Jie Wang, Yida Wang, Kesheng Wu, Binhang Yuan, Lixi Zhou, Jia Zou
  • for: 本研究旨在探讨如何在关系数据上执行深度学习(DL)模型,以满足不同的商业和科学领域的需求。
  • methods: 本文提出了三种重要的架构方法:DL-Centric architecture、UDF-Centric architecture和Relation-Centric architecture。这三种架构各有优势,但是它们之间存在许多挑战,需要进行融合和中间件技术的研究。
  • results: 本研究发现了许多 интеграción gap和挑战,并提出了一些创新的解决方案,以实现一种可以满足各种数据挑战的数据处理和深度学习执行平台。
    Abstract Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains, sparking growing interest recently. In this visionary paper, we embark on a comprehensive exploration of representative architectures to address the requirement. We highlight three pivotal paradigms: The state-of-the-artDL-Centricarchitecture offloadsDL computations to dedicated DL frameworks. The potential UDF-Centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the database system. The potentialRelation-Centricarchitecture aims to represent a large-scale tensor computation through relational operators. While each of these architectures demonstrates promise in specific use scenarios, we identify urgent requirements for seamless integration of these architectures and the middle ground between these architectures. We delve into the gaps that impede the integration and explore innovative strategies to close them. We present a pathway to establish a novel database system for enabling a broad class of data-intensive DL inference applications.
    摘要 优化深度学习(DL)模型在关系数据上的应用已成为不同领域的关键需求,最近吸引了很多关注。在这篇visionary论文中,我们进行了全面的探索,探讨了代表性的建筑方案。我们提出了三个重要的思想:1. 现状顶尖DL-Centric架构,将DL计算外送到专门的DL框架上。2. UDF-Centric架构,将一个或多个张量计算包装在用户定义函数(UDF)内部。3. Relation-Centric架构,通过关系运算来表示大规模张量计算。各种架构在特定使用场景中都有承诺,但是我们认为这些架构之间的协同和中间地带的融合是必要的。我们描述了这些架构之间的差距和融合的难点,并提出了创新的策略来填补这些差距。最后,我们提出了一种新的数据库系统,用于支持广泛的数据敏感DL推理应用。

Robustness-enhanced Uplift Modeling with Adversarial Feature Desensitization

  • paper_url: http://arxiv.org/abs/2310.04693
  • repo_url: None
  • paper_authors: Zexu Sun, Bowei He, Ming Ma, Jiakai Tang, Yuchen Wang, Chen Ma, Dugang Liu
  • for: 本文旨在解决在实际应用中存在的robustness挑战,提出了一种可能的解释,并采用了两个特定模块来进行稳定性提升。
  • methods: 本文提出了一种基于反对抗训练和软 interpolate操作的特殊feature敏感性增强策略,以及一种基于多标签模型的feature选择模块。
  • results: 经过广泛的实验 validate,RUAD可以更好地解决在线广告的feature敏感性问题,同时也能够保持和不同的uplift模型的兼容性。
    Abstract Uplift modeling has shown very promising results in online marketing. However, most existing works are prone to the robustness challenge in some practical applications. In this paper, we first present a possible explanation for the above phenomenon. We verify that there is a feature sensitivity problem in online marketing using different real-world datasets, where the perturbation of some key features will seriously affect the performance of the uplift model and even cause the opposite trend. To solve the above problem, we propose a novel robustness-enhanced uplift modeling framework with adversarial feature desensitization (RUAD). Specifically, our RUAD can more effectively alleviate the feature sensitivity of the uplift model through two customized modules, including a feature selection module with joint multi-label modeling to identify a key subset from the input features and an adversarial feature desensitization module using adversarial training and soft interpolation operations to enhance the robustness of the model against this selected subset of features. Finally, we conduct extensive experiments on a public dataset and a real product dataset to verify the effectiveness of our RUAD in online marketing. In addition, we also demonstrate the robustness of our RUAD to the feature sensitivity, as well as the compatibility with different uplift models.
    摘要 《增强模型在线市场营销中的应用显示了非常有前途。然而,现有的大多数工作受到实际应用中的Robustness挑战。在这篇论文中,我们首先提出了这种现象的可能的解释。我们使用了不同的实际数据集,证明了在线市场营销中存在特征敏感性问题,其中一些关键特征的修改会严重影响增强模型的性能,甚至导致反趋势。为解决以上问题,我们提议一种基于反对抗训练和软插值操作的robustness-enhanced增强模型框架(RUAD)。具体来说,我们的RUAD可以更好地减轻增强模型中特征敏感性通过两个定制模块,包括一个联合多标签模型来确定输入特征中关键子集和一个反对抗训练和软插值操作来强化模型对这些选择的特征的抗训练。最后,我们对公共数据集和真实产品数据集进行了广泛的实验,以证明RUAD在线市场营销中的有效性。此外,我们还证明了RUAD对特征敏感性的稳定性,以及与不同的增强模型相容性。》Note: Please note that the translation is in Simplified Chinese, which is used in mainland China. If you need Traditional Chinese, please let me know.

Understanding and Improving Adversarial Attacks on Latent Diffusion Model

  • paper_url: http://arxiv.org/abs/2310.04687
  • repo_url: https://github.com/caradryanliang/improvedadvdm
  • paper_authors: Boyang Zheng, Chumeng Liang, Xiaoyu Wu, Yan Liu
  • for: 保护个人隐私和安全数据,防止未经授权的艺术作品复制和谣言生成。
  • methods: 基于理论基础的隐 diffusion 模型(LDM)对抗攻击,通过一个统一的目标导向对抗攻击进行优化。
  • results: 比对现有方法,提出了一种更加强大和有效的对抗攻击方法,可以在不同的状态对抗攻击下进行普适化。
    Abstract Latent Diffusion Model (LDM) has emerged as a leading tool in image generation, particularly with its capability in few-shot generation. This capability also presents risks, notably in unauthorized artwork replication and misinformation generation. In response, adversarial attacks have been designed to safeguard personal images from being used as reference data. However, existing adversarial attacks are predominantly empirical, lacking a solid theoretical foundation. In this paper, we introduce a comprehensive theoretical framework for understanding adversarial attacks on LDM. Based on the framework, we propose a novel adversarial attack that exploits a unified target to guide the adversarial attack both in the forward and the reverse process of LDM. We provide empirical evidences that our method overcomes the offset problem of the optimization of adversarial attacks in existing methods. Through rigorous experiments, our findings demonstrate that our method outperforms current attacks and is able to generalize over different state-of-the-art few-shot generation pipelines based on LDM. Our method can serve as a stronger and efficient tool for people exposed to the risk of data privacy and security to protect themselves in the new era of powerful generative models. The code is available on GitHub: https://github.com/CaradryanLiang/ImprovedAdvDM.git.
    摘要 Latent Diffusion Model (LDM) 已成为图像生成领域的主导工具,特别是它的几何批处能力。这种能力也存在风险,包括未经授权的艺术作品复制和谣言生成。为了保护个人隐私和安全,我们提出了一种全面的理论基础 для对LDM的逆攻击。基于这种基础,我们提出了一种新的逆攻击方法,通过一个统一的目标导引LDM的前向和反向过程中的逆攻击。我们提供了实验证据,表明我们的方法可以超越现有的优化问题,并且可以泛化到不同的state-of-the-art几何批处管道中。我们的方法可以作为一种更加强大和高效的工具,为人们在新的强大生成模型时代保护自己的隐私和安全。代码可以在GitHub上获取:https://github.com/CaradryanLiang/ImprovedAdvDM.git。

Data-Centric Financial Large Language Models

  • paper_url: http://arxiv.org/abs/2310.17784
  • repo_url: None
  • paper_authors: Zhixuan Chu, Huaiyu Guo, Xinyuan Zhou, Yijia Wang, Fei Yu, Hong Chen, Wanqing Xu, Xin Lu, Qing Cui, Longfei Li, Jun Zhou, Sheng Li
  • for: This paper aims to improve the performance of large language models (LLMs) in financial tasks by using a data-centric approach and multitask prompt-based finetuning.
  • methods: The proposed method uses a financial LLM (FLLM) and abductive augmentation reasoning (AAR) to generate training data and preprocess the input data.
  • results: The data-centric FLLM with AAR achieves state-of-the-art performance on financial analysis and interpretation tasks, outperforming baseline financial LLMs designed for raw text. Additionally, a new benchmark for financial analysis and interpretation is open-sourced.
    Abstract Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance. LLMs have difficulty reasoning about and integrating all relevant information. We propose a data-centric approach to enable LLMs to better handle financial tasks. Our key insight is that rather than overloading the LLM with everything at once, it is more effective to preprocess and pre-understand the data. We create a financial LLM (FLLM) using multitask prompt-based finetuning to achieve data pre-processing and pre-understanding. However, labeled data is scarce for each task. To overcome manual annotation costs, we employ abductive augmentation reasoning (AAR) to automatically generate training data by modifying the pseudo labels from FLLM's own outputs. Experiments show our data-centric FLLM with AAR substantially outperforms baseline financial LLMs designed for raw text, achieving state-of-the-art on financial analysis and interpretation tasks. We also open source a new benchmark for financial analysis and interpretation. Our methodology provides a promising path to unlock LLMs' potential for complex real-world domains.
    摘要 Our key insight is that it is more effective to preprocess and pre-understand the data rather than overloading the LLM with everything at once. To achieve this, we create a financial LLM (FLLM) using multitask prompt-based finetuning. However, labeled data is scarce for each task, which can be costly and time-consuming to obtain.To overcome this challenge, we employ abductive augmentation reasoning (AAR) to automatically generate training data by modifying the pseudo labels from FLLM's own outputs. This approach allows us to create a large amount of training data without relying on manual annotation.Our experiments show that our data-centric FLLM with AAR substantially outperforms baseline financial LLMs designed for raw text, achieving state-of-the-art performance on financial analysis and interpretation tasks. We also open source a new benchmark for financial analysis and interpretation.Our methodology provides a promising path to unlock LLMs' potential for complex real-world domains like finance. By preprocessing and pre-understanding the data, we can enable LLMs to better handle tasks that require a deep understanding of financial concepts and terminology.

Automatic and Efficient Customization of Neural Networks for ML Applications

  • paper_url: http://arxiv.org/abs/2310.04685
  • repo_url: None
  • paper_authors: Yuhan Liu, Chengcheng Wan, Kuntai Du, Henry Hoffmann, Junchen Jiang, Shan Lu, Michael Maire
  • for: 这种 исследование旨在解决现有的机器学习(ML)API问题,即不同应用程序对ML API输出的不同响应。
  • methods: 该研究使用了77个实际应用程序,总共使用了6个ML API提供商的API,以探索这些应用程序如何使用ML API输出来影响它们的决策过程。
  • results: 研究发现,使用ChameleonAPI优化框架可以减少不正确的应用程序决策数量,相比基准值,减少了43%。
    Abstract ML APIs have greatly relieved application developers of the burden to design and train their own neural network models -- classifying objects in an image can now be as simple as one line of Python code to call an API. However, these APIs offer the same pre-trained models regardless of how their output is used by different applications. This can be suboptimal as not all ML inference errors can cause application failures, and the distinction between inference errors that can or cannot cause failures varies greatly across applications. To tackle this problem, we first study 77 real-world applications, which collectively use six ML APIs from two providers, to reveal common patterns of how ML API output affects applications' decision processes. Inspired by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, we show that ChameleonAPI reduces incorrect application decisions by 43%.
    摘要 机器学习(ML)API对应用开发者的负担减轻了设计和训练自己的神经网络模型的负担--用一行python代码调用API可以将图像中的对象分类。然而,这些API提供的预训练模型无论如何都是不变的,这可能是优化的问题,因为不同应用程序中ML推断错误的影响不同。为解决这个问题,我们首先研究了77个实际应用程序,它们共用六个ML API从两家提供商,以探索这些应用程序如何使用ML API的输出来影响它们的决策过程。 inspirited by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI提供了一个分析器,可以自动分析应用程序,生成一个摘要,用于生成每个应用程序的特定损失函数,只有API输出错误对应用程序的决策有影响。 ChameleonAPI使用这个损失函数来高效地训练每个应用程序特定的神经网络模型,并通过现有接口部署到应用程序中。相比基准选择所有商业ML API中最佳的选择,我们显示了ChameleonAPI可以将错误应用程序决策减少43%。

VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

  • paper_url: http://arxiv.org/abs/2310.04681
  • repo_url: None
  • paper_authors: Yayun He, Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao
  • for: 提高短语音识别性能(Speaker Verification,SV),特别是处理短时间语音信号。
  • methods: 我们提出了一种新的架构,即VoiceExtender,该架构使用两个导航 diffusion model,包括内置的speaker embedding(SE)导航 diffusion model,以及一个基于扩展的 diffusion model-based sample generator,以使用SE指导来增强语音特征。
  • results: 我们的方法在VoxCeleb1数据集上进行了广泛的实验,与基准方法相比,我们的方法在短语音Conditions下(0.5, 1.0, 1.5, 2.0秒)实现了相对改善46.1%, 35.7%, 10.4%, 5.7%。
    Abstract Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.
    摘要 <>translate_language=zh-CN spreaker verification (SV) perfomance degenerates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender, which provides a promising solution for improving SV perfomance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

  • paper_url: http://arxiv.org/abs/2310.04680
  • repo_url: None
  • paper_authors: Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite
  • for: 这个研究探讨了大语言模型(LLM)中缩放参数数量对其核心能力的影响。
  • methods: 研究使用了两种自然缩放技术:weight pruning和训练小型或大型模型(dense scaling),并对两种核心能力:在预训练中提供的信息回忆和在推理中处理信息进行分析。
  • results: 研究发现,通过减少模型大小超过30%(via either scaling approach)会显著降低在预训练中提供的信息回忆的能力。然而,减少模型大小60-70%可以保留在context中的多种信息处理方式,包括从长文本中检索答案和从句子中学习参数化函数。这种行为表明缩放模型大小对于信息回忆和在context中学习有不同的影响。
    Abstract How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.
    摘要 如何在大语言模型(LLM)中缩放参数数量影响其核心能力?我们研究了两种自然缩放技术——重量剪裁和训练小或大模型,我们称之为笛卡尔缩放——对两个LLM核心能力的影响:(a)在预训练时提供的信息回忆和(b)在推理时接受的信息处理。我们编排了一组任务,以帮助分离这两种能力。我们发现,减少模型大小超过30%(通过任何缩放方法)会导致预训练中提供的信息回忆降低。然而,减少模型大小60-70%以上将保留在上下文中的各种信息处理方式,包括从长上下文中检索答案以及从上下文中学习参数化函数。这种行为表明,缩放模型大小对预训练中信息回忆和上下文中信息处理具有不同的影响。两种缩放技术的行为表明,缩放模型大小在预训练中信息回忆和上下文中信息处理之间存在本质差异。

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

  • paper_url: http://arxiv.org/abs/2310.04673
  • repo_url: None
  • paper_authors: Jiaming Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, Chang Zhou, Zhijie Yan, Shiliang Zhang
  • for: 这篇论文的目的是提出一种基于Transformer框架的通用语言模型,用于音频识别、理解和生成。
  • methods: 这篇论文使用了一种组合 kontinuous和精确的特征来编码输入音频,然后使用一个大型Decoder-onlyTransformer语言模型进行多任务超级vised学习。
  • results: 实验表明,LauraGPT在多种音频处理标准准点上达到了或超过现有最佳性能。
    Abstract Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. However, there has been limited research on applying similar frameworks to audio tasks. Previously proposed large language models for audio tasks either lack sufficient quantitative evaluations, or are limited to tasks for recognizing and understanding audio content, or significantly underperform existing state-of-the-art (SOTA) models. In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation. LauraGPT is a versatile language model that can process both audio and text inputs and generate outputs in either modalities. It can perform a wide range of tasks related to content, semantics, paralinguistics, and audio-signal analysis. Some of its noteworthy tasks include automatic speech recognition, speech-to-text translation, text-to-speech synthesis, machine translation, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding. To achieve this goal, we use a combination of continuous and discrete features for audio. We encode input audio into continuous representations using an audio encoder and decode output audio from discrete codec codes. We then fine-tune a large decoder-only Transformer-based language model on multiple audio-to-text, text-to-audio, audio-to-audio, and text-to-text tasks using a supervised multitask learning approach. Extensive experiments show that LauraGPT achieves competitive or superior performance compared to existing SOTA models on various audio processing benchmarks.
    摘要 生成预训练 transformer(GPT)模型在不同的自然语言处理任务上实现了卓越的表现。然而,对于音频任务来说,有限的研究是在应用相似的框架。以前提出的大语言模型 для音频任务 either lack sufficient quantitative evaluations, or 只能完成音频内容认识和理解任务,或者明显地下perform existing state-of-the-art(SOTA)模型。在这篇论文中,我们提出LauraGPT,一个统一的GPT模型,用于音频认识、理解和生成。LauraGPT 是一种灵活的语言模型,可以处理音频和文本输入,并生成输出在不同的modalities。它可以执行各种内容、semantics、paralinguistics和音频信号分析相关的任务。LauraGPT 的一些吸引人的任务包括自动 speech recognition、speech-to-text翻译、文本-speech合成、机器翻译、speech enhancement、自动音频标题、speech emotion recognition和 spoken language understanding。为了实现这个目标,我们使用了一种组合 continuous和 discrete 特征来编码输入音频。我们使用 audio encoder 编码输入音频,并将输出音频解码为 discrete codec codes。然后,我们在多个 audio-to-text、text-to-audio、audio-to-audio和 text-to-text 任务上精度 fine-tune 一个大型 decoder-only Transformer-based language model。经验表明,LauraGPT 在多种音频处理标准准则上实现了竞争或更高的表现。

Label-free Node Classification on Graphs with Large Language Models (LLMS)

  • paper_url: http://arxiv.org/abs/2310.04668
  • repo_url: https://github.com/currytang/llmgnn
  • paper_authors: Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, Jiliang Tang
  • for: 这个研究的目的是发展一个没有标签的节点分类框架,即LLM-GNN,以便在节点资料上进行分类。
  • methods: 这个研究使用了大型自然语言模型(LLM)和节点神经网络(GNN)两种不同的技术,并融合它们以获得更好的性能。具体来说,LLMs是用来标签一小部分节点,然后GNNs是用来在这些标签下训练来进行预测。
  • results: 实验结果显示,LLM-GNN可以在广泛的数据集上达到74.9%的精度,而且在训练成本下than 1 dollar。
    Abstract In recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset \products with a cost less than 1 dollar.
    摘要 recent years, there have been remarkable advancements in node classification achieved by Graph Neural Networks (GNNs). However, they necessitate abundant high-quality labels to ensure promising performance. In contrast, Large Language Models (LLMs) exhibit impressive zero-shot proficiency on text-attributed graphs. Yet, they face challenges in efficiently processing structural data and suffer from high inference costs. In light of these observations, this work introduces a label-free node classification on graphs with LLMs pipeline, LLM-GNN. It amalgamates the strengths of both GNNs and LLMs while mitigating their limitations. Specifically, LLMs are leveraged to annotate a small portion of nodes, and then GNNs are trained on LLMs' annotations to make predictions for the remaining large portion of nodes. The implementation of LLM-GNN faces a unique challenge: how can we actively select nodes for LLMs to annotate and consequently enhance the GNN training? How can we leverage LLMs to obtain annotations of high quality, representativeness, and diversity, thereby enhancing GNN performance with less cost? To tackle this challenge, we develop an annotation quality heuristic and leverage the confidence scores derived from LLMs to advanced node selection. Comprehensive experimental results validate the effectiveness of LLM-GNN. In particular, LLM-GNN can achieve an accuracy of 74.9% on a vast-scale dataset with a cost less than 1 dollar.

HalluciDet: Hallucinating RGB Modality for Person Detection Through Privileged Information

  • paper_url: http://arxiv.org/abs/2310.04662
  • repo_url: None
  • paper_authors: Heitor Rapela Medeiros, Fidel A. Guerrero Pena, Masih Aminbeidokhti, Thomas Dubail, Eric Granger, Marco Pedersoli
  • for: 这个论文是用于对于具有大跨模式转换的Visul recognition任务中的人像检测,以提高检测性能。
  • methods: 这个论文使用了一种叫做HalluciDet的IR-RGB图像转换模型,这个模型不是将RGB图像转换为IR图像,而是将IR图像转换为一个能够增强物体的新图像表示。
  • results: 这个论文的实验结果显示,使用HalluciDet模型可以大幅提高人像检测精度,并且比起使用state-of-the-art图像转换方法以及对IR图像进行精度调整,能够更好地适应训练数据的差异。
    Abstract A powerful way to adapt a visual recognition model to a new domain is through image translation. However, common image translation approaches only focus on generating data from the same distribution of the target domain. In visual recognition tasks with complex images, such as pedestrian detection on aerial images with a large cross-modal shift in data distribution from Infrared (IR) to RGB images, a translation focused on generation might lead to poor performance as the loss focuses on irrelevant details for the task. In this paper, we propose HalluciDet, an IR-RGB image translation model for object detection that, instead of focusing on reconstructing the original image on the IR modality, is guided directly on reducing the detection loss of an RGB detector, and therefore avoids the need to access RGB data. This model produces a new image representation that enhances the object of interest in the scene and greatly improves detection performance. We empirically compare our approach against state-of-the-art image translation methods as well as with the commonly used fine-tuning on IR, and show that our method improves detection accuracy in most cases, by exploiting the privileged information encoded in a pre-trained RGB detector.
    摘要 一种强大的方法是通过图像翻译来适应新领域的视觉识别模型。然而,常见的图像翻译方法只关注生成数据与目标领域的同分布。在复杂的视觉任务中,如人员检测在航空图像上的红外(IR)到RGB图像之间的大跨模态差,一种专注于生成的翻译方法可能会导致性能下降,因为损失将关注无关于任务的细节。在这篇论文中,我们提出了HalluciDet,一种用于对象检测的IR-RGB图像翻译模型。这种模型不是专注于重建原始IR图像,而是通过直接减少RGB检测器的检测损失来指导,因此不需要访问RGB数据。这种模型生成的新图像表示法可以增强Scene中的对象,并大幅提高检测性能。我们对比了我们的方法与现有的图像翻译方法以及通常使用的RGB数据练习,并证明了我们的方法在大多数情况下可以提高检测精度,通过利用预训练的RGB检测器中嵌入的特权信息。

Do self-supervised speech and language models extract similar representations as human brain?

  • paper_url: http://arxiv.org/abs/2310.04645
  • repo_url: None
  • paper_authors: Peili Chen, Linyang He, Li Fu, Lu Fan, Edward F. Chang, Yuanning Li
  • for: 这个论文主要研究了 SSL 模型在语言理解中的表现,以及它们与大脑活动的相似性。
  • methods: 研究者使用了两种代表性 SSL 模型,即 Wav2Vec2.0 和 GPT-2,来评估大脑预测性能。
  • results: 研究发现,这两种模型在 auditory cortex 中都能准确预测语音响应,并且它们之间的大脑预测相互吻合。另外,共享的语音上下文信息在这两种模型中占据了大脑活动变化的主要贡献,超过静态 semantics 和 lower-level acoustic-phonetic 信息。这些结果表明 SSL 模型中的语音上下文表示 converge 到大脑 beneath speech perception 的网络,并且它们与大脑的语言处理机制相似。
    Abstract Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and language tasks. Our findings reveal that both models accurately predict speech responses in the auditory cortex, with a significant correlation between their brain predictions. Notably, shared speech contextual information between Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain activity, surpassing static semantic and lower-level acoustic-phonetic information. These results underscore the convergence of speech contextual representations in SSL models and their alignment with the neural network underlying speech perception, offering valuable insights into both SSL models and the neural basis of speech and language processing.
    摘要 <>使用自我超vision学习(SSL)训练的语音和语言模型在语音和语言识别过程中具有强的对应性。然而,由于它们的不同训练方式,是否与同样的神经元方面相关仍然未知。我们直接解决这个问题,通过评估两个代表性SSL模型——Wav2Vec2.0和GPT-2——在语音和语言任务中的脑预测性能。我们发现,这两个模型在听觉核心区域中准确预测语音回应,并且它们的脑预测显示了显著的相关性。值得注意的是,在Wav2Vec2.0和GPT-2之间共享的语音上下文信息占主要的解释变量的比重,超过静态semantic和低级别的语音学习信息。这些结果表明SSL模型中的语音上下文表示具有共同性,与语音识别神经网络下的神经元表示相吻合,为SSL模型和语音和语言处理的神经基础提供了有价值的发现。

Automatic Anonymization of Swiss Federal Supreme Court Rulings

  • paper_url: http://arxiv.org/abs/2310.04632
  • repo_url: None
  • paper_authors: Joel Niklaus, Robin Mamié, Matthias Stürmer, Daniel Brunner, Marcel Gygli
  • for: 本研究旨在提高释放法院判决的公众发布需要遵循适当的匿名化方法,以保护所有参与人员,如果必要。
  • methods: 本研究使用现有系统, комбиines 不同的传统计算方法和人工专家。我们在本研究中增强了现有的匿名化软件,使用大量标注的实体需要匿名化。我们比较了基于 BERT 的模型和基于域内数据预处理的模型。
  • results: 我们的结果表明,使用域内数据预处理模型可以进一步提高 F1 分数,高于现有模型的提升。我们的研究示例了,结合现有的匿名化方法,如常见表达式,与机器学习结合可以进一步减少人工劳动,提高自动建议。
    Abstract Releasing court decisions to the public relies on proper anonymization to protect all involved parties, where necessary. The Swiss Federal Supreme Court relies on an existing system that combines different traditional computational methods with human experts. In this work, we enhance the existing anonymization software using a large dataset annotated with entities to be anonymized. We compared BERT-based models with models pre-trained on in-domain data. Our results show that using in-domain data to pre-train the models further improves the F1-score by more than 5\% compared to existing models. Our work demonstrates that combining existing anonymization methods, such as regular expressions, with machine learning can further reduce manual labor and enhance automatic suggestions.
    摘要 发布法院判决到公众需要采用正确的匿名化方法保护所有参与者,如果必要。瑞士联邦最高法院使用现有系统,这个系统组合了不同的传统计算方法和人工专家。在这项工作中,我们提高了现有的匿名化软件使用大量标注需要匿名化的数据集。我们比较了基于BERT的模型和基于域内数据预处理的模型。我们的结果表明,使用域内数据预处理模型可以提高F1分数超过5% compare to现有模型。我们的工作表明,将现有的匿名化方法,如常见表达式,与机器学习结合可以进一步减少人工劳动并提高自动建议。

SERA:Sample Efficient Reward Augmentation in offline-to-online Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2310.19805
  • repo_url: None
  • paper_authors: Ziqi Zhang, Xiao Xiong, Zifeng Zhuang, Jinxin Liu, Donglin Wang
  • for: The paper aims to improve the performance of online fine-tuning in reinforcement learning (RL) by addressing the issue of diminished exploration in direct fine-tuning of offline pre-trained policies.
  • methods: The proposed method, called Sample Efficient Reward Augmentation (SERA), uses a generalized reward augmentation framework to improve exploration during online fine-tuning. SERA includes two components: State Marginal Matching (SMM) and penalization of out-of-distribution (OOD) state actions.
  • results: The paper demonstrates that SERA consistently and effectively enhances the performance of various offline algorithms in offline-to-online problems, achieving better online fine-tuning results. Additionally, SERA is versatile and can be effortlessly plugged into various RL algorithms to improve online fine-tuning and ensure sustained asymptotic improvement.
    Abstract A prospective application of offline reinforcement learning (RL) involves initializing a pre-trained policy using existing static datasets for subsequent online fine-tuning. However, direct fine-tuning of the offline pre-trained policy often results in sub-optimal performance. A primary reason is that offline conservative methods diminish the agent's capability of exploration, thereby impacting online fine-tuning performance. To enhance exploration during online fine-tuning and thus enhance the overall online fine-tuning performance, we introduce a generalized reward augmentation framework called Sample Efficient Reward Augmentation (SERA). SERA aims to improve the performance of online fine-tuning by designing intrinsic rewards that encourage the agent to explore. Specifically, it implicitly implements State Marginal Matching (SMM) and penalizes out-of-distribution (OOD) state actions, thus encouraging agents to cover the target state density, and achieving better online fine-tuning results. Additionally, SERA can be effortlessly plugged into various RL algorithms to improve online fine-tuning and ensure sustained asymptotic improvement, showing the versatility as well as the effectiveness of SERA. Moreover, extensive experimental results will demonstrate that when conducting offline-to-online problems, SERA consistently and effectively enhances the performance of various offline algorithms.
    摘要 可能的应用是将预训练的策略初始化使用现有的静止数据集进行后续的线上精练。然而,直接精练预训练的预训策略通常会导致下线精练性能不佳。一个主要的原因是预训保守的方法会对机器人的探索能力有限制,因此影响线上精练性能。为了增强线上精练中的探索和总体性能,我们介绍一个通用的奖励增强框架,即样本优化奖励(SERA)。SERA的目标是通过设计内在奖励来提高线上精练的表现。具体而言,它隐式实现了状态范围匹配(SMM)和外部状态动作的惩罚,因此鼓励机器人覆盖目标状态密度,并获得更好的线上精练结果。此外,SERA可以轻松地插入到不同的RL算法中,以提高线上精练并确保持长期上升,显示了SERA的多样性和有效性。此外,广泛的实验结果显示,当进行阶段性问题时,SERA可靠地和有效地提高不同的预训算法的表现。

cs.CL - 2023-10-07

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU

  • paper_url: http://arxiv.org/abs/2310.04928
  • repo_url: https://github.com/fajri91/indommlu
  • paper_authors: Fajri Koto, Nurul Aisyah, Haonan Li, Timothy Baldwin
  • for: 这个论文的目的是评估大型自然语言处理模型(LLM)在印度尼西亚文化和语言方面的能力。
  • methods: 这篇论文使用了多个任务的语言理解准确率来评估LLM的能力,并采用了专业教师编写的14,981个问题,涵盖了印度尼西亚的主要教育水平和语言。
  • results: 研究发现,GPT-3.5只能在印度尼西亚的基础教育水平上达到了标准,而其对当地印度尼西亚语言和文化的认知有限。其他较小的模型如BLOOMZ和Falcon也表现在更低的水平上。
    Abstract Although large language models (LLMs) are often pre-trained on large-scale multilingual texts, their reasoning abilities and real-world knowledge are mainly evaluated based on English datasets. Assessing LLM capabilities beyond English is increasingly vital but hindered due to the lack of suitable datasets. In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia. By employing professional teachers, we obtain 14,981 questions across 64 tasks and education levels, with 46% of the questions focusing on assessing proficiency in the Indonesian language and knowledge of nine local languages and cultures in Indonesia. Our empirical evaluations show that GPT-3.5 only manages to pass the Indonesian primary school level, with limited knowledge of local Indonesian languages and culture. Other smaller models such as BLOOMZ and Falcon perform at even lower levels.
    摘要 (对大型语言模型(LLM)的认知能力和实际世界知识主要是通过英文数据集进行评估,但是评估 LLM 以外的能力具有越来越重要的 significations,但是受到数据不足的阻碍。在这个工作中,我们引入了印尼文化和语言的多任务语言理解测试 benchmark,名为 IndoMMLU,它包含印尼教育系统中的问题,从小学到大学入学考试。我们雇用了职业教师,获得了14,981个问题,涵盖64个任务和教育水平,其中46%的问题是评估印尼语言和九种地方语言和文化的知识。我们的实际评估显示,GPT-3.5只能通过印尼小学的水平,具有有限的本地印尼语言和文化知识。其他较小的模型,如BLOOMZ和Falcon,在更低的水平上表现。)

GradXKG: A Universal Explain-per-use Temporal Knowledge Graph Explainer

  • paper_url: http://arxiv.org/abs/2310.04889
  • repo_url: None
  • paper_authors: Chenhan Yuan, Hoda Eldardiry
  • for: 本研究的目的是提高 temporally knowledge graph (TKG) reasoning 模型的解释性,以便更好地理解 TKG 模型如何做出某个预测。
  • methods: 本研究使用了 two-stage gradient-based 方法,包括一个 Grad-CAM-inspired RGCN explainer 和一个 integrated gradients explainer,以解释 RGCN 基于 TKG 模型的预测。
  • results: 实验结果表明,GradXKG 可以快速提供对 TKG 模型预测的解释,并且可以帮助解释 TKG 模型如何使用时间维度来理解事实的演变。此外,GradXKG 可以对多种 RGCN 基于 TKG 模型进行解释,并且可以提供具体的时间点上的最重要节点的信息。
    Abstract Temporal knowledge graphs (TKGs) have shown promise for reasoning tasks by incorporating a temporal dimension to represent how facts evolve over time. However, existing TKG reasoning (TKGR) models lack explainability due to their black-box nature. Recent work has attempted to address this through customized model architectures that generate reasoning paths, but these recent approaches have limited generalizability and provide sparse explanatory output. To enable interpretability for most TKGR models, we propose GradXKG, a novel two-stage gradient-based approach for explaining Relational Graph Convolution Network (RGCN)-based TKGR models. First, a Grad-CAM-inspired RGCN explainer tracks gradients to quantify each node's contribution across timesteps in an efficient "explain-per-use" fashion. Second, an integrated gradients explainer consolidates importance scores for RGCN outputs, extending compatibility across diverse TKGR architectures based on RGCN. Together, the two explainers highlight the most critical nodes at each timestep for a given prediction. Our extensive experiments demonstrated that, by leveraging gradient information, GradXKG provides insightful explanations grounded in the model's logic in a timely manner for most RGCN-based TKGR models. This helps address the lack of interpretability in existing TKGR models and provides a universal explanation approach applicable across various models.
    摘要 temporal knowledge graphs (TKGs) 有潜力用于理解任务 by incorporating a temporal dimension to represent how facts evolve over time. However, existing TKG reasoning (TKGR) models lack explainability due to their black-box nature. Recent work has attempted to address this through customized model architectures that generate reasoning paths, but these recent approaches have limited generalizability and provide sparse explanatory output. To enable interpretability for most TKGR models, we propose GradXKG, a novel two-stage gradient-based approach for explaining Relational Graph Convolution Network (RGCN)-based TKGR models. First, a Grad-CAM-inspired RGCN explainer tracks gradients to quantify each node's contribution across timesteps in an efficient "explain-per-use" fashion. Second, an integrated gradients explainer consolidates importance scores for RGCN outputs, extending compatibility across diverse TKGR architectures based on RGCN. Together, the two explainers highlight the most critical nodes at each timestep for a given prediction. Our extensive experiments demonstrated that, by leveraging gradient information, GradXKG provides insightful explanations grounded in the model's logic in a timely manner for most RGCN-based TKGR models. This helps address the lack of interpretability in existing TKGR models and provides a universal explanation approach applicable across various models.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Prompt-to-OS (P2OS): Revolutionizing Operating Systems and Human-Computer Interaction with Integrated AI Generative Models

  • paper_url: http://arxiv.org/abs/2310.04875
  • repo_url: None
  • paper_authors: Gabriele Tolomei, Cesare Campagnano, Fabrizio Silvestri, Giovanni Trappolini
  • for: 这篇论文旨在推动人机交互的重大变革,替代传统操作系统的概念。
  • methods: 这种新思维使用大量生成模型,如语言和扩散模型,作为计算机和用户之间的中间层。用户可以通过自然语言交流来与计算机进行交互,而不需要显式命令或复杂的导航。
  • results: 这种新方法不仅简化了用户交互,还开创了个性化体验的新可能性。生成模型可以根据用户的偏好进行适应,通过学习用户输入来不断改善其理解和回答生成能力。此外,它还提供了更多的可访问性,让用户可以通过语音或文本进行交互,适应不同的沟通方式。
    Abstract In this paper, we present a groundbreaking paradigm for human-computer interaction that revolutionizes the traditional notion of an operating system. Within this innovative framework, user requests issued to the machine are handled by an interconnected ecosystem of generative AI models that seamlessly integrate with or even replace traditional software applications. At the core of this paradigm shift are large generative models, such as language and diffusion models, which serve as the central interface between users and computers. This pioneering approach leverages the abilities of advanced language models, empowering users to engage in natural language conversations with their computing devices. Users can articulate their intentions, tasks, and inquiries directly to the system, eliminating the need for explicit commands or complex navigation. The language model comprehends and interprets the user's prompts, generating and displaying contextual and meaningful responses that facilitate seamless and intuitive interactions. This paradigm shift not only streamlines user interactions but also opens up new possibilities for personalized experiences. Generative models can adapt to individual preferences, learning from user input and continuously improving their understanding and response generation. Furthermore, it enables enhanced accessibility, as users can interact with the system using speech or text, accommodating diverse communication preferences. However, this visionary concept raises significant challenges, including privacy, security, trustability, and the ethical use of generative models. Robust safeguards must be in place to protect user data and prevent potential misuse or manipulation of the language model. While the full realization of this paradigm is still far from being achieved, this paper serves as a starting point for envisioning this transformative potential.
    摘要 在这篇论文中,我们提出了一种革命性的人机交互模式,推翻了传统的操作系统概念。在这个创新的框架下,用户对机器的请求由一个相互连接的生成AI模型系统处理,这些模型可以与或甚至代替传统的软件应用程序集成。核心在这个转变中是大型生成模型,如语言和扩散模型,它们成为用户和计算机之间的中间件。这一革命性的方法利用了先进的语言模型的能力,让用户可以通过自然语言对话来与计算机进行交互,不需要显式命令或复杂的导航。语言模型理解和解释用户的提示,生成和显示上下文相关的和有意义的回答,从而实现了流畅和直观的交互。这种模式不仅减少了用户交互的复杂性,还开启了新的个性化体验的可能性。生成模型可以根据个人偏好进行适应,通过用户的输入学习和不断改进其理解和回答生成能力。此外,它还实现了更高的可用性,用户可以通过语音或文本进行交互,适应不同的交流方式。然而,这种visionary概念也存在重大挑战,包括隐私、安全、信任性和生成模型的伦理使用。需要在设置robust的安全措施,保护用户数据,避免可能的滥用或 manipulate语言模型。虽然这种模式的实现仍然远未到达,但这篇论文作为一个开创的起点,激发了这种可能性的探索。

End-to-End Lip Reading in Romanian with Cross-Lingual Domain Adaptation and Lateral Inhibition

  • paper_url: http://arxiv.org/abs/2310.04858
  • repo_url: None
  • paper_authors: Emilian-Claudiu Mănescu, Răzvan-Alexandru Smădu, Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin Pop
  • for: 本研究旨在提高lip reading或视觉speech recognition的效果,特别是在罕用语言 datasets 上。
  • methods: 本研究使用了多种架构和优化技术,包括丰富的正则化方法和cross-lingual domain adaptation。
  • results: 我们的提议方法可以达到状态之巅的效果,并且通过添加无标注视频来帮助模型学习语言不变的特征。
    Abstract Lip reading or visual speech recognition has gained significant attention in recent years, particularly because of hardware development and innovations in computer vision. While considerable progress has been obtained, most models have only been tested on a few large-scale datasets. This work addresses this shortcoming by analyzing several architectures and optimizations on the underrepresented, short-scale Romanian language dataset called Wild LRRo. Most notably, we compare different backend modules, demonstrating the effectiveness of adding ample regularization methods. We obtain state-of-the-art results using our proposed method, namely cross-lingual domain adaptation and unlabeled videos from English and German datasets to help the model learn language-invariant features. Lastly, we assess the performance of adding a layer inspired by the neural inhibition mechanism.
    摘要 lip 读或视觉语音识别在最近几年内获得了广泛关注,尤其是因为硬件开发和计算机视觉领域的创新。虽然取得了显著进步,大多数模型仅在几个大规模数据集上进行测试。这项工作 addresses 这一缺点,通过分析多种架构和优化技术,对较少 repre sented、短规模的罗马尼亚语言数据集 Wild LRRo 进行分析。最 notable 是,我们比较了不同的后端模块,示出了添加充足的正则化方法的效iveness。我们使用我们提出的方法,即语言无关特征学习,将英语和德语数据集的无标注视频与英语和德语数据集的标注视频进行交互学习。最后,我们评估了基于神经抑制机制的层的性能。

Parameterizing Context: Unleashing the Power of Parameter-Efficient Fine-Tuning and In-Context Tuning for Continual Table Semantic Parsing

  • paper_url: http://arxiv.org/abs/2310.04801
  • repo_url: None
  • paper_authors: Yongrui Chen, Shenyu Zhang, Guilin Qi, Xinnan Guo
  • for: 训练一个 continual table semantic parser,能够在不同任务下翻译自然语言到 SQL 表达式,但只提供有限的训练示例。
  • methods: 提出了一种 novel 的方法,结合 parameter-efficient fine-tuning (PEFT) 和 in-context tuning (ICT),用于训练 continual table semantic parser。该方法可以完全避免 catastrophic forgetting,并且可以在少量示例下进行准确的翻译。
  • results: 经验证实验表明,该方法在两个 benchmark 上比常见的 few-shot 和 continual learning 基eline 表现出色,并且可以在不同的 metric 上取得高效的翻译结果。
    Abstract Continual table semantic parsing aims to train a parser on a sequence of tasks, where each task requires the parser to translate natural language into SQL based on task-specific tables but only offers limited training examples. Conventional methods tend to suffer from overfitting with limited supervision, as well as catastrophic forgetting due to parameter updates. Despite recent advancements that partially alleviate these issues through semi-supervised data augmentation and retention of a few past examples, the performance is still limited by the volume of unsupervised data and stored examples. To overcome these challenges, this paper introduces a novel method integrating \textit{parameter-efficient fine-tuning} (PEFT) and \textit{in-context tuning} (ICT) for training a continual table semantic parser. Initially, we present a task-adaptive PEFT framework capable of fully circumventing catastrophic forgetting, which is achieved by freezing the pre-trained model backbone and fine-tuning small-scale prompts. Building on this, we propose a teacher-student framework-based solution. The teacher addresses the few-shot problem using ICT, which procures contextual information by demonstrating a few training examples. In turn, the student leverages the proposed PEFT framework to learn from the teacher's output distribution, and subsequently compresses and saves the contextual information to the prompts, eliminating the need to store any training examples. Experimental evaluations on two benchmarks affirm the superiority of our method over prevalent few-shot and continual learning baselines across various metrics.
    摘要 CONTINUAL TABLE SEMANTIC PARSING 目标是训练一个 parser 在一个序列任务上,每个任务需要 parser 将自然语言翻译为基于任务特定的表格,但只提供有限的训练示例。传统方法通常会受到有限监督下预测的过拟合和忘记问题的影响,尤其是在受限的训练示例和存储示例的情况下。为了解决这些挑战,本文提出了一种 integrate 参数高效精细调整(PEFT)和在上下文调整(ICT)的新方法,用于训练一个连续表Semantic parser。首先,我们提出了一种任务适应的 PEFT 框架,可以完全避免 catastrophic forgetting,通过冻结预训练模型的背景和细致调整小规模示例。然后,我们提出了一种教师生成器基于解决方案,其中教师通过示例示例来提供上下文信息,而学生通过提出的 PEFT 框架来学习从教师的输出分布中,并将上下文信息压缩并存储到示例中,从而消除需要存储任何训练示例。我们对两个 benchmark 进行实验评估,结果表明我们的方法在多种纪录中比前列的几个 shot 和连续学习基eline表现出色。

Chat Vector: A Simple Approach to Equip LLMs With New Language Chat Capabilities

  • paper_url: http://arxiv.org/abs/2310.04799
  • repo_url: None
  • paper_authors: Shih-Cheng Huang, Pin-Zu Li, Yu-Chi Hsu, Kuang-Ming Chen, Yu Tung Lin, Shih-Kai Hsiao, Richard Tzong-Han Tsai, Hung-yi Lee
  • for: 这 paper 的目的是探讨开发非英语语言的大型自然语言模型 (LLMs),特别是考虑人类偏好的Alignment。
  • methods: 我们提出了一种计算效率高的方法,利用 chat vector,以融合现有知识和行为在 LLMs 中,重新定义了传统训练方法的 paradigm,从 continual pre-train -> SFT -> RLHF 改为 continual pre-train + chat vector。
  • results: 我们的实验主要采用 Traditional Chinese 为基础模型 LLaMA2,通过 subtracting LLaMA2 的预训练 веса,获得 chat vector。我们从三个方面进行评估:恶意语言、模型可以遵循指令以及多回话示例,发现 chat vector 在聊天中表现出色。我们还在采用模型预训练在韩语和简化字中进行扩展,ILLUSTRATE 我们的方法的多样性。总的来说,我们提供了一种效果的Alignment LLMs 的方法,可以有效地在不同语言上将模型与人类偏好相Alignment。
    Abstract With the advancements in conversational AI, such as ChatGPT, this paper focuses on exploring developing Large Language Models (LLMs) for non-English languages, especially emphasizing alignment with human preferences. We introduce a computationally efficient method, leveraging chat vector, to synergize pre-existing knowledge and behaviors in LLMs, restructuring the conventional training paradigm from continual pre-train -> SFT -> RLHF to continual pre-train + chat vector. Our empirical studies, primarily focused on Traditional Chinese, employ LLaMA2 as the base model and acquire the chat vector by subtracting the pre-trained weights, LLaMA2, from the weights of LLaMA2-chat. Evaluating from three distinct facets, which are toxicity, ability of instruction following, and multi-turn dialogue demonstrates the chat vector's superior efficacy in chatting. To confirm the adaptability of our approach, we extend our experiments to include models pre-trained in both Korean and Simplified Chinese, illustrating the versatility of our methodology. Overall, we present a significant solution in aligning LLMs with human preferences efficiently across various languages, accomplished by the chat vector.
    摘要 <>转换给定文本到简化中文。<>随着对话AI的发展,如ChatGPT,这篇论文专注于发展非英语语言的大型语言模型(LLMs),特别强调与人类偏好的Alignment。我们提出了一种计算效率高的方法,利用 chat vector,将 pré-exist 的知识和行为 synergize 在 LLMS 中,重新设计传统训练方程式从 continual pre-train -> SFT -> RLHF 改为 continual pre-train + chat vector。我们的实验 Studies primarily focused on Traditional Chinese, using LLaMA2 as the base model, and obtaining the chat vector by subtracting the pre-trained weights of LLaMA2 from the weights of LLaMA2-chat. From three distinct aspects, namely toxicity, ability of instruction following, and multi-turn dialogue, our empirical studies demonstrate the chat vector's superior efficacy in chatting. To confirm the adaptability of our approach, we extend our experiments to include models pre-trained in both Korean and Simplified Chinese, illustrating the versatility of our methodology. Overall, we present a significant solution in aligning LLMs with human preferences efficiently across various languages, accomplished by the chat vector.

FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets

  • paper_url: http://arxiv.org/abs/2310.04793
  • repo_url: https://github.com/ai4finance-foundation/fingpt
  • paper_authors: Neng Wang, Hongyang Yang, Christina Dan Wang
  • for: This paper focuses on the potential of GPT-based models in the financial sector and presents a distinctive approach for integrating these models with financial datasets.
  • methods: The paper introduces the Instruction Tuning paradigm for open-source large language models, which ensures a seamless and transparent integration of these models into financial contexts. The paper also presents a benchmarking scheme for end-to-end training and testing, including basic competencies such as Named Entity Recognition (NER) and sentiment analysis, as well as a comprehensive model that executes multi-task operations.
  • results: The paper explores the zero-shot capabilities of the proposed approach by testing the model on unseen tasks and incorporating novel datasets, demonstrating its adaptability in uncharted territories. The paper also highlights the effectiveness of the Instruction Tuning paradigm for immediate integration and the robust foundation it lays for future investigations in open-source financial large language models (FinLLMs).
    Abstract In the swiftly expanding domain of Natural Language Processing (NLP), the potential of GPT-based models for the financial sector is increasingly evident. However, the integration of these models with financial datasets presents challenges, notably in determining their adeptness and relevance. This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models, specifically adapted for financial contexts. Through this methodology, we capitalize on the interoperability of open-source models, ensuring a seamless and transparent integration. We begin by explaining the Instruction Tuning paradigm, highlighting its effectiveness for immediate integration. The paper presents a benchmarking scheme designed for end-to-end training and testing, employing a cost-effective progression. Firstly, we assess basic competencies and fundamental tasks, such as Named Entity Recognition (NER) and sentiment analysis to enhance specialization. Next, we delve into a comprehensive model, executing multi-task operations by amalgamating all instructional tunings to examine versatility. Finally, we explore the zero-shot capabilities by earmarking unseen tasks and incorporating novel datasets to understand adaptability in uncharted terrains. Such a paradigm fortifies the principles of openness and reproducibility, laying a robust foundation for future investigations in open-source financial large language models (FinLLMs).
    摘要 在快速扩展的自然语言处理(NLP)领域中,GPT基于模型在金融领域的潜在价值日益明显。然而,将这些模型与金融数据集 integrate 起来存在挑战,主要表现在确定其适应性和相关性的问题。本文介绍了一种特殊的approach,基于开源大语言模型的Instruction Tuning paradigm,专门适用于金融上下文。通过这种方法ологи,我们可以利用开源模型的兼容性,以无缝和透明的方式进行集成。我们首先介绍了Instruction Tuning paradigm,强调其在即时集成方面的效果。本文提出了一个特点是cost-effective的benchmarking scheme,包括基本能力和基本任务的评估,如命名实体识别(NER)和情感分析,以提高特化。然后,我们探讨了一个全面的模型,通过将所有的instructional tunings融合来执行多任务操作,以评估其多样性。最后,我们探索了零基础能力的可能性,通过标记未看过任务和 incorporating 新的数据集来了解模型在未知领域的适应性。这种方法 fortifies 开源金融大语言模型(FinLLMs)的原则,即开放和可重现性,为未来的研究提供了坚实的基础。

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

  • paper_url: http://arxiv.org/abs/2310.04782
  • repo_url: None
  • paper_authors: Yuchen Yang, Houqiang Li, Yanfeng Wang, Yu Wang
  • for: 提高大规模语言模型(LLM)的可靠性和可信度
  • methods: 使用不确定性感知框架,让模型能够自动识别和排除不确定答案,并考虑模型知识的限制
  • results: 经过评估,模型能够更好地回答问题,并且能够自动识别和排除不确定答案,提高了模型的可靠性和可信度
    Abstract In recent years, large-scale language models (LLMs) have gained attention for their impressive text generation capabilities. However, these models often face the challenge of "hallucination," which undermines their reliability. In this study, we introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty. Human-defined methods for estimating uncertainty typically assume that "uncertainty is lower when the model's response is correct compared to when it is incorrect." However, setting a precise threshold to distinguish correctness is challenging. Therefore, we introduce uncertainty information as an intermediary variable that implicitly influences the model's behavior. Our innovative uncertainty-aware in-context learning framework involves fine-tuning the LLM using a calibration dataset. Our aim is to improve the model's responses by filtering out answers with high uncertainty while considering the model's knowledge limitations. We evaluate the model's knowledge by examining multiple responses to the same question for the presence of a correct answer. When the model lacks relevant knowledge, the response should indicate that the question cannot be answered. Conversely, when the model has relevant knowledge, the response should provide the correct answer. Extensive experiments confirm the effectiveness of our framework, leading to two key findings. First, the logit output values of the LLM partly reflect inherent uncertainty. Second, our model autonomously recognizes uncertainty, resulting in improved responses.
    摘要 近年来,大规模语言模型(LLM)已引起关注,因其出色的文本生成能力。然而,这些模型经常面临“幻觉”挑战,这会损害其可靠性。在本研究中,我们提出了一种基于上下文学习的不确定性意识框架,以便让模型能够根据不确定性进行增强或拒绝输出。人类定义的不确定性估计方法通常认为,“不确定性较低时模型的响应正确性较高”。然而,确定正确性的精确阈值是困难的。因此,我们引入不确定性信息作为间接影响模型行为的变量。我们的创新的不确定性意识框架包括精度调整LLM使用准备集。我们的目标是通过筛选高不确定性答案来提高模型的响应,同时考虑模型的知识限制。我们评估模型的知识力通过对同一个问题的多个答案进行检查是否包含正确答案。当模型缺乏相关知识时,它应该返回问题无法答案。相反,当模型具备相关知识时,它应该提供正确答案。我们进行了广泛的实验,并确认了我们的框架的有效性,导致两项关键发现。首先,LLM的对数输出值有一定的内在不确定性。其次,我们的模型自动认出不确定性,从而提高了响应。

A New Dataset for End-to-End Sign Language Translation: The Greek Elementary School Dataset

  • paper_url: http://arxiv.org/abs/2310.04753
  • repo_url: None
  • paper_authors: Andreas Voskou, Konstantinos P. Panousis, Harris Partaourides, Kyriakos Tolias, Sotirios Chatzis
  • for: 本研究旨在提高听力困难者(HoH)与正常听力者之间的交流,提高 HoH 的社会生活质量和参与度。
  • methods: 本研究使用了现代 Transformer 型方法,并使用了一个新建立的 29653 个希腊手语视频翻译对,该对基于希腊Elementary School 的官方课程。
  • results: 研究结果表明,使用本研究 introduce 的 dataset 可以提高 SLT 研究的可用性和实际价值。
    Abstract Automatic Sign Language Translation (SLT) is a research avenue of great societal impact. End-to-End SLT facilitates the interaction of Hard-of-Hearing (HoH) with hearing people, thus improving their social life and opportunities for participation in social life. However, research within this frame of reference is still in its infancy, and current resources are particularly limited. Existing SLT methods are either of low translation ability or are trained and evaluated on datasets of restricted vocabulary and questionable real-world value. A characteristic example is Phoenix2014T benchmark dataset, which only covers weather forecasts in German Sign Language. To address this shortage of resources, we introduce a newly constructed collection of 29653 Greek Sign Language video-translation pairs which is based on the official syllabus of Greek Elementary School. Our dataset covers a wide range of subjects. We use this novel dataset to train recent state-of-the-art Transformer-based methods widely used in SLT research. Our results demonstrate the potential of our introduced dataset to advance SLT research by offering a favourable balance between usability and real-world value.
    摘要

Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models

  • paper_url: http://arxiv.org/abs/2310.04743
  • repo_url: None
  • paper_authors: Song Jiang, Zahra Shakeri, Aaron Chan, Maziar Sanjabi, Hamed Firooz, Yinglong Xia, Bugra Akyildiz, Yizhou Sun, Jinchao Li, Qifan Wang, Asli Celikyilmaz
  • for: 提高大语言模型(LLM)的多步逻辑能力
  • methods: 提出了一种新的提示策略——即连接促进(RESPROMPT),通过在提示中嵌入缺失的连接链来重建逻辑图,从而更好地捕捉复杂的逻辑图
  • results: 在六个benchmark测试中,对LLaMA家族模型进行评估,RESPIROMPT比基准的CoT基eline提高12.5%的逻辑准确率(在LLaMA-65B上)和6.8%的逻辑准确率(在LLaMA2-70B上),特别是在需要至少五个逻辑步骤的问题上,RESPIROMPT比基准的Best CoT基eline提高21.1%的逻辑准确率(在LLaMA-65B上)和14.3%的逻辑准确率(在LLaMA2-70B上)。
    Abstract Chain-of-thought (CoT) prompting, which offers step-by-step problem-solving rationales, has impressively unlocked the reasoning potential of large language models (LLMs). Yet, the standard CoT is less effective in problems demanding multiple reasoning steps. This limitation arises from the complex reasoning process in multi-step problems: later stages often depend on the results of several steps earlier, not just the results of the immediately preceding step. Such complexities suggest the reasoning process is naturally represented as a graph. The almost linear and straightforward structure of CoT prompting, however, struggles to capture this complex reasoning graph. To address this challenge, we propose Residual Connection Prompting (RESPROMPT), a new prompting strategy that advances multi-step reasoning in LLMs. Our key idea is to reconstruct the reasoning graph within prompts. We achieve this by integrating necessary connections-links present in the reasoning graph but missing in the linear CoT flow-into the prompts. Termed "residual connections", these links are pivotal in morphing the linear CoT structure into a graph representation, effectively capturing the complex reasoning graphs inherent in multi-step problems. We evaluate RESPROMPT on six benchmarks across three diverse domains: math, sequential, and commonsense reasoning. For the open-sourced LLaMA family of models, RESPROMPT yields a significant average reasoning accuracy improvement of 12.5% on LLaMA-65B and 6.8% on LLaMA2-70B. Breakdown analysis further highlights RESPROMPT particularly excels in complex multi-step reasoning: for questions demanding at least five reasoning steps, RESPROMPT outperforms the best CoT based benchmarks by a remarkable average improvement of 21.1% on LLaMA-65B and 14.3% on LLaMA2-70B. Through extensive ablation studies and analyses, we pinpoint how to most effectively build residual connections.
    摘要 <>Translate the given text into Simplified Chinese.<>Chain-of-thought(CoT)提示,它提供了一排一步的问题解决逻辑,奇异地激活了大型自然语言模型(LLM)的逻辑潜力。然而,标准CoT在多步问题上表现不佳,这是由于多步问题的解释过程具有较复杂的逻辑结构:后续阶段常常基于前一些步骤的结果,而不仅仅是当前步骤的结果。这种复杂的逻辑结构表明解释过程自然地表示为图形。然而,标准CoT的线性和直观结构很难捕捉这种复杂的图形。为解决这个挑战,我们提出了带连接Prompting(RESPROMPT),一种新的提示策略,可以提高LLM中的多步解释能力。我们的关键想法是在提示中重建解释图形。我们实现这一点通过在提示中添加缺失在线性CoT流中的必要连接链接,这些链接被称为“剩余连接”。这些连接将线性CoT结构转化为图形表示,有效地捕捉了多步问题中的复杂解释图形。我们在六个标准benchmark上进行了评估,包括数学、顺序和通用意识解释。对于开源的LLaMA家族模型,RESPROMPT在LLaMA-65B和LLaMA2-70B上提供了显著的平均解释准确率提升(12.5%),分析结果表明RESPROMPT在复杂多步解释中表现特别出色,对需要至少五步解释的问题提供了remarkable的平均提升(21.1%)。通过广泛的减少研究和分析,我们确定了如何最有效地建立剩余连接。

Zero-shot Cross-lingual Transfer without Parallel Corpus

  • paper_url: http://arxiv.org/abs/2310.04726
  • repo_url: None
  • paper_authors: Yuyang Zhang, Xiaofeng Han, Baojun Wang
  • for: solves the problem of low-resource language NLP tasks with pre-trained language models.
  • methods: proposes a novel approach of zero-shot cross-lingual transfer using a pre-trained model, with two modules: Bilingual Task Fitting and self-training.
  • results: achieves new state-of-the-art (SOTA) performance on different tasks without relying on parallel corpora or translation models.Here’s the full text in Simplified Chinese:
  • for: solves the problem of LOW-RESOURCE LANGUAGE NLP tasks with pre-trained language models.
  • methods: proposes a novel approach of ZERO-SHOT CROSS-LINGUAL TRANSFER using a pre-trained model, with two modules: BILINGUAL TASK FITTING and SELF-TRAINING.
  • results: achieves new STATE-OF-THE-ART (SOTA) performance on DIFFERENT TASKS without relying on PARALLEL CORPORA or TRANSLATION MODELS.
    Abstract Recently, although pre-trained language models have achieved great success on multilingual NLP (Natural Language Processing) tasks, the lack of training data on many tasks in low-resource languages still limits their performance. One effective way of solving that problem is to transfer knowledge from rich-resource languages to low-resource languages. However, many previous works on cross-lingual transfer rely heavily on the parallel corpus or translation models, which are often difficult to obtain. We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model. It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment; a self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training. We got the new SOTA on different tasks without any dependencies on the parallel corpus or translation models.
    摘要 最近,预训练语言模型在多语言自然语言处理(NLP)任务上取得了很大成功,但是仍有许多任务在低资源语言中的训练数据短缺,这会限制其性能。一种有效的解决方法是将 ricch-resource 语言的知识传递到低资源语言。然而,许多前一些作品在cross-lingual transfer中依赖于并行文献或翻译模型,这些资源往往困难获得。我们提出了一种新的零shot cross-lingual transfer方法,它包括一个双语任务适应模块,该模块通过任务相关的双语信息对适应;一个自动训练模块,该模块通过生成pseudo软标签和硬标签来训练自动。我们获得了新的SOTA(State of the Art)在不同任务上,无需任何并行文献或翻译模型的依赖。

Integrating Contrastive Learning into a Multitask Transformer Model for Effective Domain Adaptation

  • paper_url: http://arxiv.org/abs/2310.04703
  • repo_url: None
  • paper_authors: Chung-Soo Ahn, Jagath C. Rajapakse, Rajib Rana
  • for: 本研究旨在提高speech emotion recognition(SER)领域的泛化性能。
  • methods: 该研究提出了一种新的领域适应技术,即基于多任务框架、对比学习和信息最大化损失的 transformers 预训练模型细化。
  • results: 实验结果表明,该模型在cross-corpus情况下的SER性能达到了当前最佳水平。
    Abstract While speech emotion recognition (SER) research has made significant progress, achieving generalization across various corpora continues to pose a problem. We propose a novel domain adaptation technique that embodies a multitask framework with SER as the primary task, and contrastive learning and information maximisation loss as auxiliary tasks, underpinned by fine-tuning of transformers pre-trained on large language models. Empirical results obtained through experiments on well-established datasets like IEMOCAP and MSP-IMPROV, illustrate that our proposed model achieves state-of-the-art performance in SER within cross-corpus scenarios.
    摘要 “对话情感识别(SER)研究已经做出了重要进步,但跨多个资料集的通用化仍然是一个问题。我们提出了一种新的领域适应技术,它包含了一个多任务框架,SER作为主要任务,以及对比学习和信息最大化损失作为辅助任务,基于大语言模型预训练的transformer的精确调整。实验结果显示,我们的提案模型在跨多个资料集的SCENARIO中实现了STATE-OF-THE-ART的SER性能。”Note: "SCENARIO" and "STATE-OF-THE-ART" are in English in the original text, but they are not translated into Chinese in the translation.

EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling

  • paper_url: http://arxiv.org/abs/2310.04691
  • repo_url: https://github.com/drsy/emo
  • paper_authors: Siyu Ren, Zhiyong Wu, Kenny Q. Zhu
  • for: 提高语言模型的表现和可靠性
  • methods: 使用地球运动距离优化(EMO)法,利用地球运动距离的特性来解决 MAXimum Likelihood Estimation(MLE)中的几种缺陷
  • results: 在各个领域中,使用EMO法训练语言模型,可以达到MLE法的同等或更高的表现水平,并且只需要微调25,000句语言数据就可以获得显著的提高。
    Abstract Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
    摘要 neural network语言模型是人类文本的概率模型。它们主要通过最大可能性估计(MLE)进行训练,这与实际数据分布和模型分布之间的前向十分之 entropy相等。然而,在从模型分布中解码时仍然广泛观察到了各种异常现象。我们证明了前向十分之 entropy是一个不合适的距离度量,因为它存在(1)回忆优先(2)负面多样性忽视和(3)训练测试不一致。在这篇论文中,我们提出了地球运输距离优化(EMO)方法,用于自然语言模型化。EMO利用地球运输距离的本质性特性来解决上述挑战。由于直接计算的复杂度较高,我们进一步提出了可行的Upper bound的EMO,以便实现简单的练习训练。经过广泛的语言模型使用EMO和MLE进行训练后,我们发现EMO在各个领域中表现了一致更好的语言模型表现。此外,EMO在只需25000句 sentences的微调后,在下游任务中表现出了可塑性的提升。这表明EMO作为大规模预训练语言模型的轻量级调整方法,具有巨大的潜在力。

DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based Queries

  • paper_url: http://arxiv.org/abs/2310.04678
  • repo_url: https://github.com/real-doris-mae/doris-mae-dataset
  • paper_authors: Jianyou Wang, Kaicheng Wang, Xiaoyue Wang, Prudhviraj Naidu, Leon Bergen, Ramamohan Paturi
  • for: 本研究的目的是提出一个新的任务,即科学文摘检索使用多级方面based queries (DORIS-MAE),以解决科学研究中的复杂查询问题。
  • methods: 本研究使用了100个人工制作的复杂查询 случа件,并为每个查询案例制作了100个相关文献的收集和专家级分类分数。此外,我们还提出了一种可扩展的框架,即Anno-GPT,用于验证大自然语言模型(LLM)在专家级数据集注解任务中的性能。
  • results: 我们对17种 latest retrieval method 进行了评估,发现其性能与传统数据集相比明显下降。这 highlights 了需要更好地处理科学研究中的复杂、多方面查询问题。
    Abstract In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high cost and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research. We developed a benchmark dataset within the field of computer science, consisting of 100 human-authored complex query cases. For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them. Recognizing the significant labor of expert annotation, we also introduce Anno-GPT, a scalable framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, without compromising quality. Furthermore, due to the multi-tiered structure of these complex queries, the DORIS-MAE dataset can be extended to over 4,000 sub-query test cases without requiring additional annotation. We evaluated 17 recent retrieval methods on DORIS-MAE, observing notable performance drops compared to traditional datasets. This highlights the need for better approaches to handle complex, multifaceted queries in scientific research. Our dataset and codebase are available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset.
    摘要 在科学研究中,能够有效地检索相关文献 Based on 复杂多方面查询是关键。现有的评估数据集对这个任务有限,主要因为 annotate 资源所需的成本和努力高昂。为解决这个问题,我们提出了一个新的任务:科学 DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE),可以 effectively 处理科学研究中用户查询的复杂性。我们在计算机科学领域内创建了一个100个人创建的复杂查询案例的 benchark 数据集。每个复杂查询案例都有100个相关文献,以及对其进行了注释的相关性分数。认可到 annotate 过程中的劳动成本,我们也提出了一个可扩展的框架,用于验证大语言模型(LLM)在专家级别数据集注释任务中的性能。 LLN 注释 DORIS-MAE 数据集中,可以实现500倍的成本减少,而无需妥协质量。此外,由于这些复杂查询的多层结构,DORIS-MAE 数据集可以通过添加4,000个子查询测试案例来扩展,而无需进一步的注释。我们对17个最新的检索方法进行了评估,发现它们在 DORIS-MAE 数据集上的性能明显下降,这显示了科学研究中的复杂、多方面查询需要更好的解决方案。我们的数据集和代码库可以在 中找到。

cs.LG - 2023-10-07

Transferable Deep Clustering Model

  • paper_url: http://arxiv.org/abs/2310.04946
  • repo_url: https://github.com/cirosantilli/china-dictatorship
  • paper_authors: Zheng Zhang, Liang Zhao
  • for: 本研究旨在提出一种可传递的深度划分模型,以便根据源领域中获得的知识自动调整目标领域中的划分结果。
  • methods: 我们提出了一种新的注意力模块,该模块可以自动调整划分中心点,根据样本与划分中心点之间的关系。此外,我们还证明了我们的模型比一些经典的划分算法,如k-means或GMM更有力。
  • results: 我们在实验中对真实 dataset 进行了测试,结果表明我们的提出的传输学习框架可以显著提高目标领域中的性能,同时降低计算成本。
    Abstract Deep learning has shown remarkable success in the field of clustering recently. However, how to transfer a trained clustering model on a source domain to a target domain by leveraging the acquired knowledge to guide the clustering process remains challenging. Existing deep clustering methods often lack generalizability to new domains because they typically learn a group of fixed cluster centroids, which may not be optimal for the new domain distributions. In this paper, we propose a novel transferable deep clustering model that can automatically adapt the cluster centroids according to the distribution of data samples. Rather than learning a fixed set of centroids, our approach introduces a novel attention-based module that can adapt the centroids by measuring their relationship with samples. In addition, we theoretically show that our model is strictly more powerful than some classical clustering algorithms such as k-means or Gaussian Mixture Model (GMM). Experimental results on both synthetic and real-world datasets demonstrate the effectiveness and efficiency of our proposed transfer learning framework, which significantly improves the performance on target domain and reduces the computational cost.
    摘要 深度学习在 clustering 领域最近显示出惊人的成功。然而,如何通过获得的知识来导引 clustering 过程中的转移还是一个挑战。现有的深度 clustering 方法通常缺乏对新领域的泛化能力,因为它们通常学习一组固定的集群中心点,这些中心点可能不适合新领域的数据分布。在这篇文章中,我们提出了一种新的可传递深度 clustering 模型,可以自动调整集群中心点根据数据样本的分布。而不是学习固定的集群中心点,我们的方法引入了一种新的注意力基于模块,可以通过测量中心点和样本之间的关系来调整中心点。此外,我们也证明了我们的模型比一些经典的 clustering 算法,如 k-means 或 Gaussian Mixture Model (GMM) 更加强大。实验结果表明,我们的提出的转移学习框架在目标领域中显著提高性能,同时降低计算成本。

Beyond Text: A Deep Dive into Large Language Models’ Ability on Understanding Graph Data

  • paper_url: http://arxiv.org/abs/2310.04944
  • repo_url: None
  • paper_authors: Yuntong Hu, Zheng Zhang, Liang Zhao
  • for: 本研究旨在评估大语言模型(LLM)在不同的图数据预测任务中的表现,以及是否可以利用图结构提高性能。
  • methods: 本研究使用了多种示例和任务/数据集选择方式,对 LLM 的表现进行了分析和比较,以评估它们是否可以正确地理解和利用图结构。
  • results: 研究发现 LLM 在图数据预测任务中的表现有限,特别是在图结构更复杂的任务中。然而, LLM 仍然可以在某些任务中提供高性能,特别是在使用特定的示例和任务/数据集时。这些发现可以帮助我们更好地理解 LLM 在图分析中的能力和局限性。
    Abstract Large language models (LLMs) have achieved impressive performance on many natural language processing tasks. However, their capabilities on graph-structured data remain relatively unexplored. In this paper, we conduct a series of experiments benchmarking leading LLMs on diverse graph prediction tasks spanning node, edge, and graph levels. We aim to assess whether LLMs can effectively process graph data and leverage topological structures to enhance performance, compared to specialized graph neural networks. Through varied prompt formatting and task/dataset selection, we analyze how well LLMs can interpret and utilize graph structures. By comparing LLMs' performance with specialized graph models, we offer insights into the strengths and limitations of employing LLMs for graph analytics. Our findings provide insights into LLMs' capabilities and suggest avenues for further exploration in applying them to graph analytics.
    摘要 大型语言模型(LLMs)已经在许多自然语言处理任务中表现出色。然而,它们对图结构数据的能力仍然相当未为人所知。在这篇论文中,我们进行了一系列的实验,测试领先的 LLMs 在多种图预测任务中表现。我们想要评估 LLMS 能否有效地处理图数据,并利用图结构来提高性能,与特殊化图 ней罗网络相比。透过不同的提示格式和任务/数据选择,我们分析了 LLMS 如何理解和利用图结构。通过与特殊化图模型比较 LLMS 的表现,我们提供了关于 LLMS 的优点和局限性,以及在应用它们到图分析方面的可能性。我们的发现可以帮助我们更好地理解 LLMS 的能力,并提供未来应用它们到图分析的方向。

Large Language Models for Spatial Trajectory Patterns Mining

  • paper_url: http://arxiv.org/abs/2310.04942
  • repo_url: None
  • paper_authors: Zheng Zhang, Hossein Amiri, Zhenke Liu, Andreas Züfle, Liang Zhao
  • for: 这 paper 用于评估大型自然语言模型 (LLMs) 是否可以检测人类空间轨迹异常行为。
  • methods: 这 paper 使用了 GPT-4 和 Claude-2 等 LLMs,并对它们进行了比较,以评估它们在检测异常行为方面的表现。
  • results: 研究发现,LLMs 可以达到一定的异常检测性能,而不需要特定的cue。此外,在提供 contextual clues 的情况下,LLMs 的预测效果可以进一步提高。此外,LLMs 还可以提供可读的解释,从而提高了透明度。
    Abstract Identifying anomalous human spatial trajectory patterns can indicate dynamic changes in mobility behavior with applications in domains like infectious disease monitoring and elderly care. Recent advancements in large language models (LLMs) have demonstrated their ability to reason in a manner akin to humans. This presents significant potential for analyzing temporal patterns in human mobility. In this paper, we conduct empirical studies to assess the capabilities of leading LLMs like GPT-4 and Claude-2 in detecting anomalous behaviors from mobility data, by comparing to specialized methods. Our key findings demonstrate that LLMs can attain reasonable anomaly detection performance even without any specific cues. In addition, providing contextual clues about potential irregularities could further enhances their prediction efficacy. Moreover, LLMs can provide reasonable explanations for their judgments, thereby improving transparency. Our work provides insights on the strengths and limitations of LLMs for human spatial trajectory analysis.
    摘要 检测人类空间轨迹异常 patrtern可以指示人们的 mobilty 行为发生了动态变化,有应用于各种领域,如传染病监测和老年人护理。 latest advancements in large language models (LLMs) 表明它们可以像人类一样思考。这种可能性具有检测人类空间轨迹异常行为的潜在应用。在这篇论文中,我们进行了实验研究,以评估主流 LLMs 如 GPT-4 和 Claude-2 在分析人类空间轨迹数据中的异常行为检测能力,并与专门的方法进行比较。我们的主要发现表明 LLMS 可以获得合理的异常行为检测性能,而无需任何特定的提示。此外,在提供Contextual 提示后,LLMS 的预测效果可以进一步提高。此外,LLMS 可以提供合理的解释,从而提高透明度。我们的工作提供了 LLMS 对人类空间轨迹分析的强点和局限性。

Statistical Guarantees for Variational Autoencoders using PAC-Bayesian Theory

  • paper_url: http://arxiv.org/abs/2310.04935
  • repo_url: https://github.com/diarra2339/pac-bayes-vae
  • paper_authors: Sokhna Diarra Mbacke, Florence Clerc, Pascal Germain
  • for: 这篇论文是为了提供关于Variational Autoencoders(VAEs)的理论保证。
  • methods: 这篇论文使用了PAC-Bayesian理论来提供关于VAEs的统计保证。
  • results: 这篇论文提供了对VAEs的重建损失的泛化保证,以及输入和生成模型之间的距离的Upper bound。
    Abstract Since their inception, Variational Autoencoders (VAEs) have become central in machine learning. Despite their widespread use, numerous questions regarding their theoretical properties remain open. Using PAC-Bayesian theory, this work develops statistical guarantees for VAEs. First, we derive the first PAC-Bayesian bound for posterior distributions conditioned on individual samples from the data-generating distribution. Then, we utilize this result to develop generalization guarantees for the VAE's reconstruction loss, as well as upper bounds on the distance between the input and the regenerated distributions. More importantly, we provide upper bounds on the Wasserstein distance between the input distribution and the distribution defined by the VAE's generative model.
    摘要 以下是文本的简化中文翻译:自它们的创始以来,变量自动编码器(VAEs)在机器学习中变得中心。尽管广泛使用,但还有许多关于它们理论性质的问题未解决。使用PAC-Bayesian理论,这项工作提供了统计保证 для VAEs。首先,我们 derivate了对各个样本的数据生成分布 conditioned posterior distributions的首个PAC-Bayesian bound。然后,我们利用这个结果,为VAE的重建损失提供了总体化保证,以及输入和重新生成分布之间的距离的 Upper bound。更重要的是,我们提供了输入分布和VAE的生成模型定义的 Wasserstein距离的Upper bound。

Crystal-GFN: sampling crystals with desirable properties and constraints

  • paper_url: http://arxiv.org/abs/2310.04925
  • repo_url: https://github.com/alexhernandezgarcia/gflownet
  • paper_authors: Mila AI4Science, Alex Hernandez-Garcia, Alexandre Duval, Alexandra Volokhova, Yoshua Bengio, Divya Sharma, Pierre Luc Carrier, Michał Koziarski, Victor Schmidt
  • for: 加速材料发现可以帮助减轻气候危机。发现新的固体晶体,如电解质、离子导电器或太阳能电池,可以提高可再生能源生产和储存的效率。
  • methods: 本文引入了Crystal-GFlowNet,一种基于晶体结构的生成模型。该模型顺序样本晶体的组成物质、空GROUP和晶格参数,并可以采用任何可用的预测模型来对晶体结构的物理和几何约束进行约束。
  • results: 通过使用新的代理模型训练后,Crystal-GFlowNet能够样本出低能formation crystal structure。
    Abstract Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state crystals such as electrocatalysts, ionic conductors or photovoltaics can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFlowNet, a generative model of crystal structures that sequentially samples a crystal's composition, space group and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and geometrical constraints, as well as the use of any available predictive model of a desired property as an objective function. We evaluate the capabilities of Crystal-GFlowNet by using as objective the formation energy of a crystal structure, as predicted by a new proxy model trained on MatBench. The results demonstrate that Crystal-GFlowNet is able to sample diverse crystals with low formation energy.
    摘要 加速材料发现可能帮助减轻气候危机。发现新的固体晶体,如电化学catalysts、离子导电体或光伏材料,可以有重要影响,例如提高可再生能源生产和存储的效率。在这篇论文中,我们介绍了Crystal-GFlowNet,一种生成模型的晶体结构,可以顺序样本晶体的化学组成、空间群和晶格参数。这种域名称的approach允许采用物理和几何约束,以及使用任何可用的预测模型的所需性能为目标函数。我们通过使用新的proxy模型,训练于MatBench,来评估Crystal-GFlowNet的能力。结果表明,Crystal-GFlowNet能够样本低能formation晶体。

The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models

  • paper_url: http://arxiv.org/abs/2310.04919
  • repo_url: None
  • paper_authors: Yushu Shi, Michael Martens
  • for: 这个研究的目的是确定哪些变量与结果相关,并在大量可能的预测器中进行选择,以控制false discovery rate(FDR)。
  • methods: 这个研究使用knockoff检查来进行变量选择,并使用conditional prediction function(CPF)来评估预测模型与结果的关系。
  • results: 这个研究显示,使用CPF statistics可以提供更高的权威性,并且可以捕捉非线性关系between predictors和结果。实际应用中,这种方法可以帮助选择 truly prognostic variables,并且可以提高预测模型的准确性。
    Abstract In modern scientific research, the objective is often to identify which variables are associated with an outcome among a large class of potential predictors. This goal can be achieved by selecting variables in a manner that controls the the false discovery rate (FDR), the proportion of irrelevant predictors among the selections. Knockoff filtering is a cutting-edge approach to variable selection that provides FDR control. Existing knockoff statistics frequently employ linear models to assess relationships between features and the response, but the linearity assumption is often violated in real world applications. This may result in poor power to detect truly prognostic variables. We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models, such as deep neural networks. The CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features. We illustrate the capability of the CPF statistics to provide superior power over common knockoff statistics with continuous, categorical, and survival outcomes using repeated simulations. Knockoff filtering with the CPF statistics is demonstrated using (1) a residential building dataset to select predictors for the actual sales prices and (2) the TCGA dataset to select genes that are correlated with disease staging in lung cancer patients.
    摘要 现代科学研究中的目标Frequently是确定一个大类中的变量与结果相关的变量。这个目标可以通过控制False Discovery Rate(FDR)来实现。Knockoff filtering是一种先进的变量选择方法,它提供了FDR控制。现有的knockoff统计 часто使用线性模型来评估特征和响应之间的关系,但在实际应用中,这种假设可能被打砸。这会导致捕捉真正预测变量的能力强化。我们介绍了基于条件预测函数(CPF)的knockoff统计,它可以与当前的机器学习预测模型结合使用,如深度神经网络。CPF统计可以捕捉特征和结果之间的非线性关系,同时也考虑特征之间的相关性。我们通过重复的 simulations 示例,展示了CPF统计在不同类型的输出(连续、分类、生存时间)上的优异能力。knockoff filtering with CPF statistics是使用(1)一个住宅建筑数据集来选择实际销售价格的预测变量,(2)TCGA数据集来选择与肺癌级别相关的基因。

Tight Certified Robustness via Min-Max Representations of ReLU Neural Networks

  • paper_url: http://arxiv.org/abs/2310.04916
  • repo_url: None
  • paper_authors: Brendon G. Anderson, Samuel Pfrommer, Somayeh Sojoudi
  • for: 这个研究旨在提供对神经网络系统的可靠部署需要严格的Robustness保证。
  • methods: 这篇论文使用了一种对征测函数的凸形式表示,并使用了最近的分布式 robust optimization的结果来解决这个问题。
  • results: 这篇论文可以获得严格的Robustness保证,并且可以解决问题中的原始非数据问题。实验结果显示了这个方法的有效性。
    Abstract The reliable deployment of neural networks in control systems requires rigorous robustness guarantees. In this paper, we obtain tight robustness certificates over convex attack sets for min-max representations of ReLU neural networks by developing a convex reformulation of the nonconvex certification problem. This is done by "lifting" the problem to an infinite-dimensional optimization over probability measures, leveraging recent results in distributionally robust optimization to solve for an optimal discrete distribution, and proving that solutions of the original nonconvex problem are generated by the discrete distribution under mild boundedness, nonredundancy, and Slater conditions. As a consequence, optimal (worst-case) attacks against the model may be solved for exactly. This contrasts prior state-of-the-art that either requires expensive branch-and-bound schemes or loose relaxation techniques. Experiments on robust control and MNIST image classification examples highlight the benefits of our approach.
    摘要 要确保神经网络在控制系统中可靠地部署,需要强制性的Robustness保证。在这篇论文中,我们获得了对凸攻击集的紧张Robustness证明。这是通过“升级”问题到无穷维度优化中的概率分布来实现的,利用最近的分布式Robust优化技术解决优化问题,并证明原始非 convex问题的解是由离散分布下的解决方案生成的,只要满足某些轻度的约束条件。因此,我们可以解决最优(最差)攻击问题。这与之前的状态艺术技术不同,需要昂贵的分支和缓冲 schemes或者笔势的放松技术。实验表明,我们的方法在Robust控制和MNIST图像识别中具有优异的表现。

A Dual Latent State Learning Approach: Exploiting Regional Network Similarities for QoS Prediction

  • paper_url: http://arxiv.org/abs/2310.05988
  • repo_url: None
  • paper_authors: Ziliang Wang, Xiaohong Zhang, Meng Yan
  • for: 本文是为了提高服务质量(QoS)预测的精度,而设计的一种深度学习框架。
  • methods: 本文使用的方法包括建立两个区域网络幽默状态(城市网络幽默状态和AS网络幽默状态),并使用改进的惯性损失函数来解决数据稀缺和标签不均匀问题。
  • results: 经过实验表明,本文的方法在实际的QoS数据集上表现较为出色,超过了现有的状态艺术方法。
    Abstract Individual objects, whether users or services, within a specific region often exhibit similar network states due to their shared origin from the same city or autonomous system (AS). Despite this regional network similarity, many existing techniques overlook its potential, resulting in subpar performance arising from challenges such as data sparsity and label imbalance. In this paper, we introduce the regional-based dual latent state learning network(R2SL), a novel deep learning framework designed to overcome the pitfalls of traditional individual object-based prediction techniques in Quality of Service (QoS) prediction. Unlike its predecessors, R2SL captures the nuances of regional network behavior by deriving two distinct regional network latent states: the city-network latent state and the AS-network latent state. These states are constructed utilizing aggregated data from common regions rather than individual object data. Furthermore, R2SL adopts an enhanced Huber loss function that adjusts its linear loss component, providing a remedy for prevalent label imbalance issues. To cap off the prediction process, a multi-scale perception network is leveraged to interpret the integrated feature map, a fusion of regional network latent features and other pertinent information, ultimately accomplishing the QoS prediction. Through rigorous testing on real-world QoS datasets, R2SL demonstrates superior performance compared to prevailing state-of-the-art methods. Our R2SL approach ushers in an innovative avenue for precise QoS predictions by fully harnessing the regional network similarities inherent in objects.
    摘要 各个对象,无论是用户或服务,在特定区域 often exhibit 相似的网络状态,这是因为它们来自同一个城市或自治系统(AS)的共同起源。 DESPITE THIS REGIONAL NETWORK SIMILARITY, many existing techniques overlook 其潜在可能性,导致subpar performance due to challenges such as data sparsity and label imbalance. In this paper, we introduce the regional-based dual latent state learning network (R2SL), a novel deep learning framework designed to overcome the pitfalls of traditional individual object-based prediction techniques in Quality of Service (QoS) prediction. Unlike its predecessors, R2SL captures the nuances of regional network behavior by deriving two distinct regional network latent states: the city-network latent state and the AS-network latent state. These states are constructed utilizing aggregated data from common regions rather than individual object data. Furthermore, R2SL adopts an enhanced Huber loss function that adjusts its linear loss component, providing a remedy for prevalent label imbalance issues. To cap off the prediction process, a multi-scale perception network is leveraged to interpret the integrated feature map, a fusion of regional network latent features and other pertinent information, ultimately accomplishing the QoS prediction. Through rigorous testing on real-world QoS datasets, R2SL demonstrates superior performance compared to prevailing state-of-the-art methods. Our R2SL approach ushers in an innovative avenue for precise QoS predictions by fully harnessing the regional network similarities inherent in objects.

Regret Analysis of Repeated Delegated Choice

  • paper_url: http://arxiv.org/abs/2310.04884
  • repo_url: None
  • paper_authors: MohammadTaghi Hajiaghayi, Mohammad Mahdavi, Keivan Rezaei, Suho Shin
  • for: 这个研究是关于一个重复委托选择问题的,这是 Kleinberg 和 Kleinberg 的在线学习变体(EC’18)的第一个考虑。在这个模型中,一位主任与一位代理人进行重复互动,以寻找高效的解决方案。每个解决方案都可以为主任和代理人带来不同的利用价值,并且代理人可以在自私的方式下提议解决方案以 Maximize 自己的利用价值。为了缓解这种行为,主任公布一个合法的集合,以排除一些解决方案。但是,主任没有任何关于解决方案的分布信息。因此,主任会在不同的征文中动态公布不同的合法集合,以高效地学习分布。主任的目标是在审核后比Optimal 合法集合减少征文 regret。
  • methods: 我们explore 两个维度的问题设置:代理人是否做出myopic 选择,以及解决方案是否具有 deterministic 或 stochastic 的利用价值。我们的分析主要描述了一些情况下,主任可以在不同的征文中减少 regret,并且 shed 光于不同的征文中的升降。
  • results: 我们的分析主要描述了一些情况下,主任可以在不同的征文中减少 regret,并且 shed 光于不同的征文中的升降。
    Abstract We present a study on a repeated delegated choice problem, which is the first to consider an online learning variant of Kleinberg and Kleinberg, EC'18. In this model, a principal interacts repeatedly with an agent who possesses an exogenous set of solutions to search for efficient ones. Each solution can yield varying utility for both the principal and the agent, and the agent may propose a solution to maximize its own utility in a selfish manner. To mitigate this behavior, the principal announces an eligible set which screens out a certain set of solutions. The principal, however, does not have any information on the distribution of solutions in advance. Therefore, the principal dynamically announces various eligible sets to efficiently learn the distribution. The principal's objective is to minimize cumulative regret compared to the optimal eligible set in hindsight. We explore two dimensions of the problem setup, whether the agent behaves myopically or strategizes across the rounds, and whether the solutions yield deterministic or stochastic utility. Our analysis mainly characterizes some regimes under which the principal can recover the sublinear regret, thereby shedding light on the rise and fall of the repeated delegation procedure in various regimes.
    摘要 我们介绍了一项研究,探讨了一种循环委托问题,这是首次考虑了在克林伯格和克林伯格(EC'18)的在线学习变体中。在这个模型中,一位主人与一位代理人进行了多次互动,以寻找高效的解决方案。每个解决方案可以为主人和代理人带来不同的利用价值,而代理人可能会为自己的利益而选择解决方案。为了缓解这种行为,主人公布了合法的集合,从而排除一部分解决方案。然而,主人没有任何关于解决方案的分布的信息。因此,主人在不断更新的集合中 dynamically 宣布了多个合法集合,以有效地学习分布。主人的目标是在后知之前对合法集合进行最小化的恨 regret。我们探讨了两个维度的问题设置:代理人是否偏爱短期目标,以及解决方案是否具有 deterministic 或 stochastic 的利用价值。我们的分析主要描述了一些情况下,主人可以在不同的环境下重新恢复子线性的恨 regret,从而为各种情况下的循环委托过程带来更好的理解。

Randomized Sparse Neural Galerkin Schemes for Solving Evolution Equations with Deep Networks

  • paper_url: http://arxiv.org/abs/2310.04867
  • repo_url: https://github.com/julesberman/rsng
  • paper_authors: Jules Berman, Benjamin Peherstorfer
  • for: 这项研究的目的是为了解决时间依赖的偏微分方程的解场的近似问题,以保持 causality 和其他物理性质。
  • methods: 该研究使用了神经网络随时序顺序训练,以 aproximate 时间依赖的偏微分方程的解场。在训练过程中,采用了随机杂化 sparse 网络参数的更新方法,以避免当地过拟合和error 快速积累。
  • results: 实验表明,提案的方法在各种时间演化方程中比较精度和效率,可以在固定计算预算下提高精度至少两个数量级,并在固定精度下提高速度至少两个数量级。
    Abstract Training neural networks sequentially in time to approximate solution fields of time-dependent partial differential equations can be beneficial for preserving causality and other physics properties; however, the sequential-in-time training is numerically challenging because training errors quickly accumulate and amplify over time. This work introduces Neural Galerkin schemes that update randomized sparse subsets of network parameters at each time step. The randomization avoids overfitting locally in time and so helps prevent the error from accumulating quickly over the sequential-in-time training, which is motivated by dropout that addresses a similar issue of overfitting due to neuron co-adaptation. The sparsity of the update reduces the computational costs of training without losing expressiveness because many of the network parameters are redundant locally at each time step. In numerical experiments with a wide range of evolution equations, the proposed scheme with randomized sparse updates is up to two orders of magnitude more accurate at a fixed computational budget and up to two orders of magnitude faster at a fixed accuracy than schemes with dense updates.
    摘要 培训神经网络Sequentially在时间上来 aproximate解析场的时间依赖partial differential equations可以保持 causality和其他物理性质;然而,逐步在时间上培训是数学上具有挑战性,因为培训错误快速积累和增强在时间上。这项工作提出了神经加尔金方案,在每个时间步骤中随机选择神经网络参数的 subset。随机性可以避免在时间上的局部适应,从而防止培训错误快速积累,这与Dropout的解决方案类似。神经网络参数的稀疏更新可以降低培训计算成本,保持表达能力,因为在每个时间步骤中的神经网络参数具有重复性。在一系列的演化方程中的数学实验中,提出的方案与随机 sparse更新比较其他方案在 fix computational budget 下具有更高的准确性,并且在 fix accuracy 下具有更高的计算效率,可以达到两个数量级。

Universal Graph Random Features

  • paper_url: http://arxiv.org/abs/2310.04859
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Isaac Reid, Krzysztof Choromanski, Eli Berger, Adrian Weller
  • for: 本研究的目的是提出一种新的随机游走算法,用于无偏估计图上任意函数的权重邻接矩阵。
  • methods: 该算法使用随机游走模块化函数,可以在图中实现估计。它的时间复杂度为乘数减少,可以处理更大的图。此外,该算法可以轻松分布在多台机器上,实现大规模学习。
  • results: 研究人员通过实验和理论分析表明,该算法可以提供更高质量的估计或高效、扩展性的学习。具体来说,该算法可以实现点精度估计、非同Homogeneous图ordinary differential equations、节点划分和kernel regression等任务。
    Abstract We propose a novel random walk-based algorithm for unbiased estimation of arbitrary functions of a weighted adjacency matrix, coined universal graph random features (u-GRFs). This includes many of the most popular examples of kernels defined on the nodes of a graph. Our algorithm enjoys subquadratic time complexity with respect to the number of nodes, overcoming the notoriously prohibitive cubic scaling of exact graph kernel evaluation. It can also be trivially distributed across machines, permitting learning on much larger networks. At the heart of the algorithm is a modulation function which upweights or downweights the contribution from different random walks depending on their lengths. We show that by parameterising it with a neural network we can obtain u-GRFs that give higher-quality kernel estimates or perform efficient, scalable kernel learning. We provide robust theoretical analysis and support our findings with experiments including pointwise estimation of fixed graph kernels, solving non-homogeneous graph ordinary differential equations, node clustering and kernel regression on triangular meshes.
    摘要 我们提出了一种新的随机步骤基本算法,用于不偏估任意函数权重图adjacency矩阵中的函数,称为universal graph random features(u-GRF)。这包括许多最流行的节点上定义的kernels的示例。我们的算法在节点数量的减少方面具有下方几何时间复杂度,超越了精确图kernels评估的不orioius cubic scaling。它还可以轻松分布在机器上,允许学习更大的网络。algorithm的核心是一个modulation函数,该函数根据不同的随机步骤长度而增加或减少其贡献。我们示出了使用神经网络参数化这个函数可以得到更高质量的kernel估计或进行高效、可扩展的kernel学习。我们提供了robust的理论分析,并通过包括固定图kernels的点约估、非同Homogeneous图ordinary differential equations、节点划分和kernel regression on triangular meshes等实验来支持我们的发现。

LIPEx – Locally Interpretable Probabilistic Explanations – To Look Beyond The True Class

  • paper_url: http://arxiv.org/abs/2310.04856
  • repo_url: None
  • paper_authors: Hongbo Zhu, Angelo Cangelosi, Procheta Sen, Anirbit Mukherjee
  • for: 这 paper 的目的是提出一种新的干扰基于的多类解释框架,LIPEx(本地可解释概率解释)。
  • methods: 这 paper 使用了一种新的方法,即通过在概率分布空间进行回归来定义解释,并通过HELLINGER距离来衡量解释的准确性。
  • results: 实验表明,LIPEx 可以不仅在本地复制 Complex 分类模型输出的概率分布,还可以提供每个特征对预测概率的解释,并且在文本和图像数据上进行了ablation 测试,显示 LIPEx 在隐藏特征 elimination 方面比其他 saliency-based 或 feature importance-based XAI 方法更加有效。
    Abstract In this work, we instantiate a novel perturbation-based multi-class explanation framework, LIPEx (Locally Interpretable Probabilistic Explanation). We demonstrate that LIPEx not only locally replicates the probability distributions output by the widely used complex classification models but also provides insight into how every feature deemed to be important affects the prediction probability for each of the possible classes. We achieve this by defining the explanation as a matrix obtained via regression with respect to the Hellinger distance in the space of probability distributions. Ablation tests on text and image data, show that LIPEx-guided removal of important features from the data causes more change in predictions for the underlying model than similar tests on other saliency-based or feature importance-based XAI methods. It is also shown that compared to LIME, LIPEx is much more data efficient in terms of the number of perturbations needed for reliable evaluation of the explanation.
    摘要 在这个工作中,我们实现了一种新的扰动基于多类解释框架LIPEx(本地可解释概率解释)。我们示出了LIPEx不仅可以在广泛使用的复杂分类模型输出的概率分布上重建本地的概率分布,而且还提供了每个被识别为重要的特征对预测概率的影响的理解。我们实现了这一点通过定义解释为基于地LLOyd Distance在概率分布空间上进行回归的矩阵。我们在文本和图像数据上进行了ablation测试,显示LIPEx引导的删除重要特征从数据中引入的改变对下面模型的预测更大。此外,与LIME相比,LIPEx在数据效率方面远胜一步,需要的扰动数量远少于其他质量基于或特征重要性基于XAI方法。

Epsilon non-Greedy: A Bandit Approach for Unbiased Recommendation via Uniform Data

  • paper_url: http://arxiv.org/abs/2310.04855
  • repo_url: None
  • paper_authors: S. M. F. Sani, Seyed Abbas Hosseini, Hamid R. Rabiee
  • for: 降低推荐系统中的自回给舌径偏见,并提高推荐系统的多样性和准确性。
  • methods: 提出一个框架,使用小量的均匀收集的数据来学习无偏观估计器,并专注于生成改进的训练数据 для后续的训练过程。
  • results: 透过实验显示了与现有的减偏方法相比,提出的模型具有更高的多样性和准确性。
    Abstract Often, recommendation systems employ continuous training, leading to a self-feedback loop bias in which the system becomes biased toward its previous recommendations. Recent studies have attempted to mitigate this bias by collecting small amounts of unbiased data. While these studies have successfully developed less biased models, they ignore the crucial fact that the recommendations generated by the model serve as the training data for subsequent training sessions. To address this issue, we propose a framework that learns an unbiased estimator using a small amount of uniformly collected data and focuses on generating improved training data for subsequent training iterations. To accomplish this, we view recommendation as a contextual multi-arm bandit problem and emphasize on exploring items that the model has a limited understanding of. We introduce a new offline sequential training schema that simulates real-world continuous training scenarios in recommendation systems, offering a more appropriate framework for studying self-feedback bias. We demonstrate the superiority of our model over state-of-the-art debiasing methods by conducting extensive experiments using the proposed training schema.
    摘要 通常,推荐系统会使用连续训练,导致一个自适应循环偏见,其中系统会偏好前一些推荐。近期研究尝试了消除这种偏见,通过收集一些无偏见数据。although these studies have successfully developed less biased models, they ignore the crucial fact that the recommendations generated by the model serve as the training data for subsequent training sessions. To address this issue, we propose a framework that learns an unbiased estimator using a small amount of uniformly collected data and focuses on generating improved training data for subsequent training iterations. To accomplish this, we view recommendation as a contextual multi-arm bandit problem and emphasize on exploring items that the model has a limited understanding of. We introduce a new offline sequential training schema that simulates real-world continuous training scenarios in recommendation systems, offering a more appropriate framework for studying self-feedback bias. We demonstrate the superiority of our model over state-of-the-art debiasing methods by conducting extensive experiments using the proposed training schema.Here's the translation breakdown:* 通常 (tōng zhì) - usually* 推荐系统 (pù jiào xì tǒng) - recommendation system* employs (yǐn) - employs* 连续训练 (lián jiè xùn) - continuous training* 自适应循环 (zì shì bìng xún) - self-feedback loop* 偏见 (pēn jiàn) - bias* 前一些 (qián yī xiē) - previous* 推荐 (pù jiào) - recommendations* 收集 (shōu jié) - collect* 无偏见数据 (wú pēn jiàn shù) - unbiased data* 近期研究 (jìn qī yán jí) - recent studies* successfully (gōng cháng) - successfully* developed (fāng zhì) - developed* less biased models (liǎo pēn jiàn mó delè) - less biased models* ignore (yì qù) - ignore* crucial fact (zhì zhèng shí) - crucial fact* 系统会偏好前一些推荐 (xiàng zhì bìng hǎo qián yī xiē pù jiào) - the system will have a preference for previous recommendations* To address this issue (dāng zhèng yì bù) - to address this issue* 我们提出一个框架 (wǒ men tím zhāng yī jīng) - we propose a framework* 学习一个无偏见估计器 (xué xí yī jī without bias estimator) - learn an unbiased estimator* 使用一小量 uniformly collected data (fù yǐn yī xiǎo liàng uniform collected data) - use a small amount of uniformly collected data* 并对后续训练进行改进 (bìng duì hòu xiù xùn zhì gòng gòng) - and improve subsequent training* To accomplish this (dāng zhèng yì bù) - to accomplish this* 我们视推荐为 contextual multi-arm bandit problem (wǒ men wèi pù jiào as contextual multi-arm bandit problem) - we view recommendation as a contextual multi-arm bandit problem* 强调在模型有限理解的 item (qiáng dǎo zài mó delè yǒu xiǎn lǐ yǐn de item) - emphasize on exploring items that the model has a limited understanding of* 我们引入一种新的 offline sequential training schema (wǒ men yìn rù yī xiāng xīn de offline sequential training schema) - we introduce a new offline sequential training schema* 这种方法可以更好地 simulate real-world continuous training scenarios (zhè zhòng fāng yì kěn hǎo de chéng shí yī jiàn shì zhèng) - this method can better simulate real-world continuous training scenarios* 我们通过对比 experiment results (wǒ men tōng zhì yǐn bǐ experiment results) - we demonstrate the superiority of our model over state-of-the-art debiasing methods by conducting extensive experiments using the proposed training schema.

Repelling Random Walks

  • paper_url: http://arxiv.org/abs/2310.04854
  • repo_url: None
  • paper_authors: Isaac Reid, Eli Berger, Krzysztof Choromanski, Adrian Weller
  • for: 提高图形基于采样的效率,使统计估计更集中。
  • methods: 使用吸引相互作用的ensemble,使每个步骤的概率不变,从而更好地探索图形。
  • results: 在不同的设置中,包括图kernels估计、PageRank向量和图лет集中心估计等,证明了repelling random walks的效果。提供了详细的实验评估和理论保证。
    Abstract We present a novel quasi-Monte Carlo mechanism to improve graph-based sampling, coined repelling random walks. By inducing correlations between the trajectories of an interacting ensemble such that their marginal transition probabilities are unmodified, we are able to explore the graph more efficiently, improving the concentration of statistical estimators whilst leaving them unbiased. The mechanism has a trivial drop-in implementation. We showcase the effectiveness of repelling random walks in a range of settings including estimation of graph kernels, the PageRank vector and graphlet concentrations. We provide detailed experimental evaluation and robust theoretical guarantees. To our knowledge, repelling random walks constitute the first rigorously studied quasi-Monte Carlo scheme correlating the directions of walkers on a graph, inviting new research in this exciting nascent domain.
    摘要 我团队提出了一种新的 quasi-Monte Carlo 机制,称为吸引随机步行(repelling random walks),用于改进基于图的采样。我们通过控制互动 ensemble 的轨迹相互关联,使其 marginal transition probabilities 不受变化,从而更好地探索图,提高统计估计器的集中程度,保持它们无偏性。该机制具有轻松实现的 Drop-in 实现。我们在多个设置中展示了 repelling random walks 的效果,包括图kernels 的估计、PageRank вектор和 graphlet 的吸引度。我们提供了详细的实验评估和坚实的理论保证。我们认为,repelling random walks 是首个正式研究基于图的 quasi-Monte Carlo 机制,吸引了新的研究在这一领域。

HyperSINDy: Deep Generative Modeling of Nonlinear Stochastic Governing Equations

  • paper_url: http://arxiv.org/abs/2310.04832
  • repo_url: None
  • paper_authors: Mozes Jacobs, Bingni W. Brunton, Steven L. Brunton, J. Nathan Kutz, Ryan V. Raut
  • for: 该 paper 是为了探索数据驱动的 governing differential equations 的开放前ier。
  • methods: 该 paper 使用了 HyperSINDy 框架,该框架是一种基于 deep generative model 的 sparse governing equations 模型,通过 variational encoder 和 hypernetwork 来学习 differential equation 的参数。
  • results: 该 paper 实验表明,HyperSINDy 可以准确地回归真实的杂性 governing equations,并且学习出来的杂性几乎与数据中的杂性相同,同时也提供了高维系统中的uncertainty quantification。
    Abstract The discovery of governing differential equations from data is an open frontier in machine learning. The sparse identification of nonlinear dynamics (SINDy) \citep{brunton_discovering_2016} framework enables data-driven discovery of interpretable models in the form of sparse, deterministic governing laws. Recent works have sought to adapt this approach to the stochastic setting, though these adaptations are severely hampered by the curse of dimensionality. On the other hand, Bayesian-inspired deep learning methods have achieved widespread success in high-dimensional probabilistic modeling via computationally efficient approximate inference techniques, suggesting the use of these techniques for efficient stochastic equation discovery. Here, we introduce HyperSINDy, a framework for modeling stochastic dynamics via a deep generative model of sparse governing equations whose parametric form is discovered from data. HyperSINDy employs a variational encoder to approximate the distribution of observed states and derivatives. A hypernetwork \citep{ha_hypernetworks_2016} transforms samples from this distribution into the coefficients of a differential equation whose sparse form is learned simultaneously using a trainable binary mask \citep{louizos_learning_2018}. Once trained, HyperSINDy generates stochastic dynamics via a differential equation whose coefficients are driven by a Gaussian white noise. In experiments, HyperSINDy accurately recovers ground truth stochastic governing equations, with learned stochasticity scaling to match that of the data. Finally, HyperSINDy provides uncertainty quantification that scales to high-dimensional systems. Taken together, HyperSINDy offers a promising framework for model discovery and uncertainty quantification in real-world systems, integrating sparse equation discovery methods with advances in statistical machine learning and deep generative modeling.
    摘要 发现数据中的束缚方程是机器学习的开放前ier。SINDy框架(\cite{brunton_discovering_2016})使得数据驱动的发现可以获得可读的模型,这些模型是稀疏的、决定的 governing laws。然而,在Stochastic setting中,这些方法受到维度约束的困扰。与此同时, bayesian-inspired deep learning方法在高维probabilistic模型中得到了广泛的成功,这些方法通过计算效率高的approximate inference技术来实现。在这种情况下,我们引入了HyperSINDy框架,它通过深度生成模型来模型 Stochastic dynamics,并使用可变的binary mask来学习稀疏的 governing equations。一旦训练完成,HyperSINDy可以生成Stochastic dynamics,其中的koefficients是由Gaussian white noise驱动的。在实验中,HyperSINDy能够准确地回归真实的Stochastic governing equations,并且学习到的Stochasticity可以与数据中的Stochasticity相匹配。此外,HyperSINDy还提供了可靠的uncertainty quantification,可以应用于高维系统。综上所述,HyperSINDy表示了一种有前途的模型发现和uncertainty quantification框架,它将SINDy框架与统计机器学习和深度生成模型相结合。

Critique Ability of Large Language Models

  • paper_url: http://arxiv.org/abs/2310.04815
  • repo_url: None
  • paper_authors: Liangchen Luo, Zi Lin, Yinxiao Liu, Lei Shu, Yun Zhu, Jingbo Shang, Lei Meng
  • for: 这项研究旨在探讨大语言模型(LLM)是否能够提供准确的批判。
  • methods: 该研究使用了一个统一的评估框架,称为CriticBench,来评估 LLM 的批判能力。 CriticBench 包括了 3000 个高质量的自然语言问题和对应的模型回答,并对这些回答进行了注释。
  • results: 研究发现,大多数 LLM 在批判任务中表现不佳,而且自我批判是特别困难的。 Even top-performing LLMs 在自我批判任务中表现不满足。 此外,模型在不确定性最高的问题上的批判精度较低。 为解决这个问题,研究人员提出了一种简单 yet effective 的基线方案,称为自我检查,可以使用自我批判来提高任务性能。
    Abstract Critical thinking is essential for rational decision-making and problem-solving. This skill hinges on the ability to provide precise and reasoned critiques and is a hallmark of human intelligence. In the era of large language models (LLMs), this study explores the ability of LLMs to deliver accurate critiques across various tasks. We are interested in this topic as a capable critic model could not only serve as a reliable evaluator, but also as a source of supervised signals for model tuning. Particularly, if a model can self-critique, it has the potential for autonomous self-improvement. To examine this, we introduce a unified evaluation framework for assessing the critique abilities of LLMs. We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses; and annotate the correctness of these responses. The benchmark cover tasks such as math problem-solving, code completion, and question answering. We evaluate multiple LLMs on the collected dataset and our analysis reveals several noteworthy insights: (1) Critique is generally challenging for most LLMs, and this capability often emerges only when models are sufficiently large. (2) In particular, self-critique is especially difficult. Even top-performing LLMs struggle to achieve satisfactory performance. (3) Models tend to have lower critique accuracy on problems where they are most uncertain. To this end, we introduce a simple yet effective baseline named self-check, which leverages self-critique to improve task performance for various models. We hope this study serves as an initial exploration into understanding the critique abilities of LLMs, and aims to inform future research, including the development of more proficient critic models and the application of critiques across diverse tasks.
    摘要 <>translate_language Simplified Chinese;critical_thinking 是必备的,以便进行 rational decision-making 和 problem-solving。这种技能取决于能够提供精准和理据的批评,是人类智能的标志。在大语言模型(LLM)时代,这项研究探讨了 LLM 是否能够提供准确的批评。我们对这个话题感兴趣,因为一个能够自我批评的模型不仅可以成为可靠的评估器,而且还可以用于自主改进。为了探讨这一点,我们提出了一个统一的评估框架,用于评估 LLM 的批评能力。我们开发了一个名为 CriticBench 的benchmark,该benchmark包括 3K 个高质量的自然语言问题和对应的模型回答;并对这些回答进行了正确性的标注。该benchmark覆盖了数学问题解决、代码完成和问答等任务。我们对收集到的数据进行了多个 LLM 的评估,我们的分析发现了一些有趣的发现:1. 批评通常是 LLM 最大的挑战之一,并且这种能力通常只在模型够大时出现。2. 特别是自我批评是 LLM 最大的挑战之一。even top-performing LLMs 很难达到满意的性能。3. 模型在它们最不确定的问题上的批评精度通常较低。为了解决这一问题,我们提出了一种简单 yet effective 的基线名为 self-check,它利用自我批评来提高任务性能。我们希望这项研究能够作为 LLM 批评能力的初步探讨,并希望这项研究能够导向未来的研究,包括开发更具批评能力的模型和在多种任务上应用批评。

Applications of Littlestone dimension to query learning and to compression

  • paper_url: http://arxiv.org/abs/2310.04812
  • repo_url: None
  • paper_authors: Hunter Chase, James Freitag, Lev Reyzin
  • for: 这个论文提供了几种利用Littlestone维度的应用,包括对\cite{angluin2017power}模型的扩展,以及对无穷域概念类型的扩展。
  • methods: 该论文使用了Littlestone维度来扩展\cite{angluin2017power}模型,并使用随机counterexample来学习。它还使用了扩展$d$-压缩算法来提高结果。
  • results: 该论文提出了一个强版的\cite{floyd1995sample} conjecture,证明了Littlestone维度与扩展$d$-压缩算法之间的关系。
    Abstract In this paper we give several applications of Littlestone dimension. The first is to the model of \cite{angluin2017power}, where we extend their results for learning by equivalence queries with random counterexamples. Second, we extend that model to infinite concept classes with an additional source of randomness. Third, we give improved results on the relationship of Littlestone dimension to classes with extended $d$-compression schemes, proving a strong version of a conjecture of \cite{floyd1995sample} for Littlestone dimension.
    摘要 在这篇论文中,我们提供了一些对带纹度(Littlestone dimension)的应用。第一个是对《\cite{angluin2017power}》中的学习模型进行扩展,使其可以通过 equipollence 查询和随机反例来学习。第二,我们将这个模型扩展到无穷无数的概念类型,并添加了随机性作为另一种来源。第三,我们提供了对带纹度与 extended $d$-压缩表示法之间的关系的改进结果,证明了带纹度的强版推测。

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

  • paper_url: http://arxiv.org/abs/2310.04796
  • repo_url: None
  • paper_authors: Jiayu Chen, Zelai Xu, Yunfei Li, Chao Yu, Jiaming Song, Huazhong Yang, Fei Fang, Yu Wang, Yi Wu
  • for: 提高复杂零点游戏中多智能机器人学习(MARL)的效率,使其更加快速地学习 Nash 平衡(NE)。
  • methods: 使用 curriculum learning 方法,采用适应性的初始状态分布,并采用 particle-based state sampler 来生成子游戏。
  • results: 在 particle-world 环境和 Google Research Football 环境中,SACL 生成了更强的策略,并在复杂的 hide-and-seek quadrant 环境中生成了所有四个 emergent stage,使用的样本数量只是 MAPPO 自我玩家的一半。
    Abstract Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames -- games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-rl.
    摘要 学习奈什均衡(NE)在复杂的零点游戏中的多智能体学习(MARL)可以非常 computationally expensive。课程学习是一种有效的加速学习方法,但是一个未经探索的维度是难度学习子游戏 -- 由特定状态开始的游戏。在这项工作中,我们提出了一种新的子游戏课程学习框架 для零点游戏。它采用了一种自适应的初始状态分布,通过将代理人重置到之前访问过的状态中,以便快速提高性能。基于这个框架,我们 derivate了一个子游戏选择度量,该度量约等于均衡值的平方距离。其后,我们采用一种粒子基本的状态采样器来生成子游戏。将这些技术集成到我们的新算法中,我们称之为Subgame Automatic Curriculum Learning(SACL)。SACL可以与任何MARL算法结合,如MAPPO。实验在团子环境和Google研究足球环境中表明,SACL生成了强一点的策略,并且在具有挑战性的隐藏和找猎 quadrant环境中生成了所有的四个 Emergent 阶段,并且只使用了半个MAPPO的自我玩家样本。项目网站的地址是https://sites.google.com/view/sacl-rl。

Conditional Diffusion Model for Target Speaker Extraction

  • paper_url: http://arxiv.org/abs/2310.04791
  • repo_url: None
  • paper_authors: Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
  • For: 提出了一种基于分布式生成模型的目标 speaker 提取方法(DiffSpEx),用于提取混合来源中的目标 speaker。* Methods: 使用了连续时间的随机扩散过程在复杂短时间傅立叶Transform领域,从目标 speaker 源开始,向 Gaussian 分布中心的混合来源 converges。 在反时间过程中,使用了参数化的得分函数,根据目标 speaker 嵌入来提取目标 speaker。* Results: 在WSJ0-2mix dataset 上运行,实现了 SI-SDR 12.9 dB 和 NISQA 分数 3.56。此外,我们还示出了在特定 speaker 的 fine-tuning 可以进一步提高性能,启用了个性化目标 speaker 提取。
    Abstract We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources. We utilise ECAPA-TDNN target speaker embeddings and condition the score function alternately on the SDE time embedding and the target speaker embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we show that fine-tuning a pre-trained DiffSpEx model to a specific speaker further improves performance, enabling personalisation in target speaker extraction.
    摘要 我们提出了DiffSpEx,一种基于得分型生成模型的目标说话人提取方法。DiffSpEx通过在短时 Fourier transform 领域中使用概率扩散过程来实现连续时间的生成,从目标说话人源开始,向 Gaussian 分布中心点转化。在反时间过程中,我们使用 Parametrized 得分函数,通过 conditioning target speaker embedding来提取目标说话人。我们使用 ECAPA-TDNN 目标说话人嵌入,并在 SDE 时间嵌入和目标说话人嵌入之间交互 conditioning。我们在 WSJ0-2mix 数据集上实现了 SI-SDR 12.9 dB 和 NISQA 分数 3.56。此外,我们还证明了在预训练 DiffSpEx 模型后进行个性化训练可以进一步提高性能,实现目标说话人个性化提取。

Online Corrupted User Detection and Regret Minimization

  • paper_url: http://arxiv.org/abs/2310.04768
  • repo_url: https://github.com/JizeXie/Online-Corrupted-User-Detection-and-Regret-Minimization
  • paper_authors: Zhiyong Wang, Jize Xie, Tong Yu, Shuai Li, John C. S. Lui
  • for: 学习和识别在线恶意用户的多用户场景中的恶意用户
  • methods: 提出了一种新的带刺恶意抑制算法RCLUB-WCU,以及一种基于RCLUB-WCU的在线检测算法OCCUD
  • results: 对比前一代算法,其在多用户场景下的性能和检测精度均达到了新的高水平。
    Abstract In real-world online web systems, multiple users usually arrive sequentially into the system. For applications like click fraud and fake reviews, some users can maliciously perform corrupted (disrupted) behaviors to trick the system. Therefore, it is crucial to design efficient online learning algorithms to robustly learn from potentially corrupted user behaviors and accurately identify the corrupted users in an online manner. Existing works propose bandit algorithms robust to adversarial corruption. However, these algorithms are designed for a single user, and cannot leverage the implicit social relations among multiple users for more efficient learning. Moreover, none of them consider how to detect corrupted users online in the multiple-user scenario. In this paper, we present an important online learning problem named LOCUD to learn and utilize unknown user relations from disrupted behaviors to speed up learning, and identify the corrupted users in an online setting. To robustly learn and utilize the unknown relations among potentially corrupted users, we propose a novel bandit algorithm RCLUB-WCU. To detect the corrupted users, we devise a novel online detection algorithm OCCUD based on RCLUB-WCU's inferred user relations. We prove a regret upper bound for RCLUB-WCU, which asymptotically matches the lower bound with respect to $T$ up to logarithmic factors, and matches the state-of-the-art results in degenerate cases. We also give a theoretical guarantee for the detection accuracy of OCCUD. With extensive experiments, our methods achieve superior performance over previous bandit algorithms and high corrupted user detection accuracy.
    摘要 在现实世界上的在线网络系统中,多个用户通常会sequentially进入系统。对于应用程序如click fraud和假评论,一些用户可能会有恶意的行为来欺骗系统。因此,设计高效的在线学习算法是非常重要的,以robustly从潜在的损害用户行为中学习,并快速确定损害用户。现有的工作提出了对抗恶意损害的bandit算法。然而,这些算法只适用于单个用户,无法利用多个用户之间的隐式社交关系以更高效地学习。另外,它们没有考虑在多用户场景中如何在线检测损害用户。在这篇论文中,我们提出了一个重要的在线学习问题,named LOCUD,以学习和利用不确定用户关系从损害行为中快速学习,并在线确定损害用户。为了robustly学习和利用不确定用户之间的关系,我们提出了一种新的bandit算法RCLUB-WCU。为了在线检测损害用户,我们设计了一种基于RCLUB-WCU的推理用户关系的检测算法OCCUD。我们证明了RCLUB-WCU的 regretUpper bound,其在T approached to infinity asymptotically与低 bound相匹配,并与现状最佳结果在不良情况下匹配。我们还给出了检测精度的理论保证。通过广泛的实验,我们的方法在前一代bandit算法之上 achieved superior performance和高的损害用户检测精度。

Robust Low-Rank Matrix Completion via a New Sparsity-Inducing Regularizer

  • paper_url: http://arxiv.org/abs/2310.04762
  • repo_url: None
  • paper_authors: Zhi-Yong Wang, Hing Cheung So, Abdelhak M. Zoubir
  • for: 这篇论文提出了一种新的损失函数,即混合普通-Welsch(HOW)函数,以及一种基于HOW函数的稀疏化规范。
  • methods: 该论文提出了一种新的规范,即启用HOW函数的规范,并证明了这个规范是 quasi-convex 的。此外,该论文还提出了一种基于 alternating direction method of multipliers 的高效算法。
  • results: 实验结果表明,相比非整数正则化函数(如 lp-norm 函数),提出的规范在矩阵完成问题中表现更加优秀,并且可以在 synthetic 和实际世界数据上达到更高的Restoration性能。
    Abstract This paper presents a novel loss function referred to as hybrid ordinary-Welsch (HOW) and a new sparsity-inducing regularizer associated with HOW. We theoretically show that the regularizer is quasiconvex and that the corresponding Moreau envelope is convex. Moreover, the closed-form solution to its Moreau envelope, namely, the proximity operator, is derived. Compared with nonconvex regularizers like the lp-norm with 0
    摘要

Unit Commitment Predictor With a Performance Guarantee: A Support Vector Machine Classifier

  • paper_url: http://arxiv.org/abs/2310.08601
  • repo_url: None
  • paper_authors: Farzaneh Pourahmadi, Jalal Kazempour
  • for: 这篇论文的目的是提供一个实用的解决方案,以帮助系统操作人员在短时间内解决大规模的发电单位调度问题。
  • methods: 这篇论文使用了学习和预测核心发电单位的启用/停用决策,以实现系统操作人员可以快速整合启用数据,并且提供了一个稳定的性能保证。
  • results: 根据IEEE 6-bus和118-bus测试系统,这篇论文的结果显示,使用kernelized SVM核函数预测器,并且适当地调整正规化,可以实现更好的性能,提高computational time的速度。此外,如果时间限制很紧,即使发电单位调度问题无法在时间限制内解决,但是可以使用暖启动的方法,仍可以在时间限制内解决问题。
    Abstract The system operators usually need to solve large-scale unit commitment problems within limited time frame for computation. This paper provides a pragmatic solution, showing how by learning and predicting the on/off commitment decisions of conventional units, there is a potential for system operators to warm start their solver and speed up their computation significantly. For the prediction, we train linear and kernelized support vector machine classifiers, providing an out-of-sample performance guarantee if properly regularized, converting to distributionally robust classifiers. For the unit commitment problem, we solve a mixed-integer second-order cone problem. Our results based on the IEEE 6-bus and 118-bus test systems show that the kernelized SVM with proper regularization outperforms other classifiers, reducing the computational time by a factor of 1.7. In addition, if there is a tight computational limit, while the unit commitment problem without warm start is far away from the optimal solution, its warmly started version can be solved to optimality within the time limit.
    摘要 系统运维人员通常需要在有限时间内解决大规模单位承诺问题。这篇论文提供了一种实用的解决方案,显示通过学习和预测传统单位的启动和停止决策,有可能为系统运维人员提供快速的计算。为预测,我们训练了线性和核kernel支持向量机分类器,提供了外样性能 garantía,转化为分布robust分类器。对于单位承诺问题,我们解决了杂合二次 cone 问题。我们的结果基于IEEE 6-bus和118-bus测试系统表明,使用适当规则化的核kernel SVM可以超越其他分类器,降低计算时间约为1.7倍。此外,如果计算时间有紧张的限制,而单位承诺问题无暖启动情况下的计算结果远离优解,而暖启动后的计算结果可以在有限时间内达到优解。

Digital Twin Assisted Deep Reinforcement Learning for Online Optimization of Network Slicing Admission Control

  • paper_url: http://arxiv.org/abs/2310.09299
  • repo_url: None
  • paper_authors: Zhenyu Tao, Wei Xu, Xiaohu You
    for:The paper aims to address the initial instability issue of Deep Reinforcement Learning (DRL) models in admission control for network slicing in 5G and beyond networks.methods:The proposed solution uses a digital twin (DT) assisted DRL approach, which involves formulating the admission decision-making process as a semi-Markov decision process and simplifying it into an equivalent discrete-time Markov decision process. The DT is established through supervised learning and employed to assist the training phase of the DRL model.results:The DT-assisted DRL model achieved over 40% increase in resource utilization compared to the directly trained state-of-the-art Dueling-DQN and over 20% compared to the directly trained DRL model during initial training, while preserving the model’s capacity to optimize long-term rewards.Here’s the Simplified Chinese version of the information:for:本研究旨在解决 Deep Reinforcement Learning (DRL) 模型在网络压力控制中初始稳定性问题。methods:提议的解决方案使用了数字双子 (DT) 助け的 DRL 方法,其中将招召决策过程模型化为半Markov决策过程,并将其简化为等效的 discrete-time Markov 决策过程。 DT 通过监督学习建立,并用于 DRL 模型训练阶段的助手。results:DT 助け的 DRL 模型在初始训练阶段实现了资源利用率提高超过 40%,比直接训练的 state-of-the-art Dueling-DQN 和直接训练的 DRL 模型提高了超过 20%。此外,该模型还能保持长期奖励优化的能力。
    Abstract The proliferation of diverse network services in 5G and beyond networks has led to the emergence of network slicing technologies. Among these, admission control plays a crucial role in achieving specific optimization goals through the selective acceptance of service requests. Although Deep Reinforcement Learning (DRL) forms the foundation in many admission control approaches for its effectiveness and flexibility, the initial instability of DRL models hinders their practical deployment in real-world networks. In this work, we propose a digital twin (DT) assisted DRL solution to address this issue. Specifically, we first formulate the admission decision-making process as a semi-Markov decision process, which is subsequently simplified into an equivalent discrete-time Markov decision process to facilitate the implementation of DRL methods. The DT is established through supervised learning and employed to assist the training phase of the DRL model. Extensive simulations show that the DT-assisted DRL model increased resource utilization by over 40\% compared to the directly trained state-of-the-art Dueling-DQN and over 20\% compared to our directly trained DRL model during initial training. This improvement is achieved while preserving the model's capacity to optimize the long-term rewards.
    摘要 “5G以及以后的网络服务多样性的兴盛,导致网络排他控制技术的出现。其中,入选控制具有重要的优化目标 дости度,通过选择性地接受服务请求。深度循环学习(DRL)在许多入选控制方法中扮演了基础,因为它的效果和灵活性。但是,DRL模型的初始不稳定性限制了它们在实际网络中的实际应用。在这个工作中,我们提出了一个“数字双子”(DT)协助DRL解决方案。具体来说,我们首先将入选决策过程转换为半markt状况过程,然后将其简化为可以实现DRL方法的类似碎时间Markov做决策过程。DT通过监督学习被建立,并用于协助DRL模型训练阶段。实际验证表明,DT协助DRL模型可以提高资源利用率高于40%,比直接训练的现有Dueling-DQN和我们直接训练的DRL模型在初期训练阶段。这个提高是同时保持模型的长期回报优化能力。”

Parameter Efficient Multi-task Model Fusion with Partial Linearization

  • paper_url: http://arxiv.org/abs/2310.04742
  • repo_url: None
  • paper_authors: Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, Dacheng Tao
  • for: 这个研究的目的是提高多任务模型融合的效率和数据� astriction,使得可以更好地搭配不同任务的模型。
  • methods: 这个研究使用了一种新的方法,即只部分线性化adapter模组,然后将不同任务的模型组合成一个多任务模型。这个方法可以充分利用模型融合的优点,同时仍然能够实现精确的 fine-tuning 和测试。
  • results: 实验结果显示,这个方法可以更好地融合多个任务,并且在增加任务数量时表现更好。相比于标准的参数有效率调整方法,这个方法可以更好地搭配不同任务的模型,并且可以更好地减少调整的参数数量。
    Abstract Large pre-trained models have enabled significant advances in machine learning and served as foundation components. Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, leading to inefficient multi-task model fusion. In this work, we propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning. Specifically, our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters. This allows us to leverage the the advantages of model fusion over linearized fine-tuning, while still performing fine-tuning and inference efficiently. We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone. Experimental results demonstrate the capabilities of our proposed partial linearization technique to effectively construct unified multi-task models via the fusion of fine-tuned task vectors. We evaluate performance over an increasing number of tasks and find that our approach outperforms standard parameter-efficient fine-tuning techniques. The results highlight the benefits of partial linearization for scalable and efficient multi-task model fusion.
    摘要 具体来说,我们的方法只是部分 Linearize 适配器模块,然后对Linearized的适配器进行任务加法。这样可以利用模型融合的优势,同时仍然可以进行 fine-tuning 和推理高效。我们的partial linearization technique可以更好地将多个任务融合到一个单独的模型中,超过标准适配器调整和任务加法alone。我们的方法可以更有效地构建多任务模型,并且在增加任务数量时表现更好。我们的实验结果表明,我们的partial linearization technique可以更好地实现多任务模型的融合,并且可以高效地进行 fine-tuning 和推理。我们在增加任务数量时评估了我们的方法,并发现它在标准参数有效 fine-tuning 技术的基础上表现更好。这些结果表明,我们的partial linearization technique可以为可扩展和高效的多任务模型融合提供了很好的 benefits。

Subspace Identification for Multi-Source Domain Adaptation

  • paper_url: http://arxiv.org/abs/2310.04723
  • repo_url: None
  • paper_authors: Zijian Li, Ruichu Cai, Guangyi Chen, Boyang Sun, Zhifeng Hao, Kun Zhang
  • for: 提高多源频道适应性(MSDA)方法的可靠性和效能,使其能够在实际应用中更好地适应不同频道的变化。
  • methods: 基于subspace标识理论,开发了一种名为SIG模型,该模型通过变分推断来实现适应。此外,SIG模型还包括了类快捷适应,以适应目标频道的变化。
  • results: 对多个 benchmark 数据集进行实验,研究发现,SIG模型比现有的 MSDA 技术表现更高效和可靠。
    Abstract Multi-source domain adaptation (MSDA) methods aim to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Although current methods achieve target joint distribution identifiability by enforcing minimal changes across domains, they often necessitate stringent conditions, such as an adequate number of domains, monotonic transformation of latent variables, and invariant label distributions. These requirements are challenging to satisfy in real-world applications. To mitigate the need for these strict assumptions, we propose a subspace identification theory that guarantees the disentanglement of domain-invariant and domain-specific variables under less restrictive constraints regarding domain numbers and transformation properties, thereby facilitating domain adaptation by minimizing the impact of domain shifts on invariant variables. Based on this theory, we develop a Subspace Identification Guarantee (SIG) model that leverages variational inference. Furthermore, the SIG model incorporates class-aware conditional alignment to accommodate target shifts where label distributions change with the domains. Experimental results demonstrate that our SIG model outperforms existing MSDA techniques on various benchmark datasets, highlighting its effectiveness in real-world applications.
    摘要 多源领域适应(MSDA)方法目的是将多个标注源领域中的知识传递到无标注目标领域。现有的方法通过保证领域之间的最小变化来实现目标共享分布的准确性,但这些条件frequently是在实际应用中困难或不可能满足。为了缓解这些严格的假设,我们提出了一种子空间识别理论,该理论 garanties the disentanglement of domain-invariant and domain-specific variables under less restrictive constraints regarding domain numbers and transformation properties, thereby facilitating domain adaptation by minimizing the impact of domain shifts on invariant variables。基于这种理论,我们开发了一个Subspace Identification Guarantee(SIG)模型,该模型通过变量极限推理来实现。此外,SIG模型还包括类感 conditional alignment,以适应目标频谱中的变化。实验结果表明,我们的SIG模型在多个 benchmark dataset上表现出色,超过了现有的 MSDA 技术,这有力地表明其在实际应用中的效果。

Offline Imitation Learning with Variational Counterfactual Reasoning

  • paper_url: http://arxiv.org/abs/2310.04706
  • repo_url: None
  • paper_authors: Bowei He, Zexu Sun, Jinxin Liu, Shuai Zhang, Xu Chen, Chen Ma
  • for: 提高offlineimitasion学习(IL) Agent的灵活性和泛化能力,使其能够在缺乏专家数据的情况下学习优化的专家行为策略。
  • methods: 我们提出了一种名为OILCA的框架,利用可识别的变量自动编码器生成对应的 counterfactual 样本,并对其进行了理论分析和实验 validate。
  • results: 我们的方法在具有不同环境变化和缺乏专家数据的情况下,在 \textsc{DeepMind Control Suite} 和 \textsc{CausalWorld} 测试集上显著超过了多种基eline。
    Abstract In offline Imitation Learning (IL), an agent aims to learn an optimal expert behavior policy without additional online environment interactions. However, in many real-world scenarios, such as robotics manipulation, the offline dataset is collected from suboptimal behaviors without rewards. Due to the scarce expert data, the agents usually suffer from simply memorizing poor trajectories and are vulnerable to the variations in the environments, lacking the capability of generalizing to new environments. To effectively remove spurious features that would otherwise bias the agent and hinder generalization, we propose a framework named \underline{O}ffline \underline{I}mitation \underline{L}earning with \underline{C}ounterfactual data \underline{A}ugmentation (OILCA). In particular, we leverage the identifiable variational autoencoder to generate \textit{counterfactual} samples. We theoretically analyze the counterfactual identification and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both \textsc{DeepMind Control Suite} benchmark for in-distribution robustness and \textsc{CausalWorld} benchmark for out-of-distribution generalization.
    摘要 Offline Imitation Learning (IL) 是一种尝试学习最佳专家行为策略的 Agent,无需在线环境交互。然而,在许多实际场景中,如机器人操作, Offline 数据集是由不优化的行为而收集的。由于专家数据稀缺, Agent 通常会记忆低质量的轨迹,感受不到环境变化,缺乏新环境泛化能力。为了有效地除掉干扰特征,防止 Agent 受到环境变化的影响,我们提出了名为 OILCA 的框架。具体来说,我们利用可识别性的变分自动编码器来生成 "counterfactual" 样本。我们 theoretically 分析了 counterfactual 识别和改进的泛化性。此外,我们进行了广泛的实验,以示我们的方法在各种基eline上出perform 许多基eline。Specifically, we use an identifiable variational autoencoder to generate counterfactual samples, which are used to remove spurious features that would otherwise bias the agent and hinder generalization. We theoretically analyze the counterfactual identification and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both in-distribution robustness and out-of-distribution generalization.In particular, we use an identifiable variational autoencoder to generate counterfactual samples, which are used to remove spurious features that would otherwise bias the agent and hinder generalization. We theoretically analyze the counterfactual identification and the improvement of generalization. Moreover, we conduct extensive experiments to demonstrate that our approach significantly outperforms various baselines on both in-distribution robustness and out-of-distribution generalization.

Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System

  • paper_url: http://arxiv.org/abs/2310.04701
  • repo_url: https://github.com/alipay/microservice_system_twin_graph_based_anomaly_detection
  • paper_authors: Jun Huang, Yang Yang, Hang Yu, Jianguo Li, Xiao Zheng
  • for: 这个论文旨在提出一种基于图像学的 semi-supervised 方法,用于检测微服务系统中的异常。
  • methods: 论文使用了多种数据模式(指标、日志和跟踪)的混合学习方法,通过具有异常检测和模式识别功能的 transformer 神经网络来检测异常。
  • results: 论文通过实验表明,该方法可以准确地检测微服务系统中的异常,并且可以提供实时的异常检测结果。
    Abstract Microservice architecture has sprung up over recent years for managing enterprise applications, due to its ability to independently deploy and scale services. Despite its benefits, ensuring the reliability and safety of a microservice system remains highly challenging. Existing anomaly detection algorithms based on a single data modality (i.e., metrics, logs, or traces) fail to fully account for the complex correlations and interactions between different modalities, leading to false negatives and false alarms, whereas incorporating more data modalities can offer opportunities for further performance gain. As a fresh attempt, we propose in this paper a semi-supervised graph-based anomaly detection method, MSTGAD, which seamlessly integrates all available data modalities via attentive multi-modal learning. First, we extract and normalize features from the three modalities, and further integrate them using a graph, namely MST (microservice system twin) graph, where each node represents a service instance and the edge indicates the scheduling relationship between different service instances. The MST graph provides a virtual representation of the status and scheduling relationships among service instances of a real-world microservice system. Second, we construct a transformer-based neural network with both spatial and temporal attention mechanisms to model the inter-correlations between different modalities and temporal dependencies between the data points. This enables us to detect anomalies automatically and accurately in real-time. The source code of MSTGAD is publicly available at https://github.com/alipay/microservice_system_twin_graph_based_anomaly_detection.
    摘要 干脆服务架构在近年来为企业应用程序管理而出现,因为它可以独立部署和缩放服务。然而,确保微服务系统的可靠性和安全性仍然非常困难。现有的异常检测算法基于单一数据模式(例如指标、日志或跟踪)无法完全考虑微服务系统中不同模式之间的复杂相关性和互动,导致假阳性和假警示,而在包含更多数据模式时可以获得更高的性能提升。在这篇论文中,我们提出了一种基于图的半监督异常检测方法,称为MSTGAD,它可以兼容所有可用的数据模式,并通过注意力多模式学习来协调异常检测。首先,我们从三种数据模式中提取和 норализова特征,并使用图,即MST(微服务系统双)图,其中每个节点表示一个服务实例,两个节点之间的边表示这些服务实例之间的调度关系。MST图为实际微服务系统中服务实例的状态和调度关系提供虚拟表示。其次,我们构建了基于变换器的神经网络,其中包括空间和时间注意力机制,以模型不同模式之间的相互关系和时间关系。这使得我们可以在实时中自动和准确地检测异常。MSTGAD的源代码可以在https://github.com/alipay/microservice_system_twin_graph_based_anomaly_detection上获取。

Tight Rates in Supervised Outlier Transfer Learning

  • paper_url: http://arxiv.org/abs/2310.04686
  • repo_url: None
  • paper_authors: Mohammadreza M. Kalan, Samory Kpotufe
  • for: 本研究旨在探讨在异常检测中如何传输知识,具体来说是在异常检测任务中使用相似 yet imperfect 的异常数据来传输信息。
  • methods: 本研究采用了传统的Neyman-Pearson分类框架,并假设有访问一些相关 yet imperfect 的异常数据。我们的主要结果是:我们首先确定了异常检测问题下的信息理论上限,并证明这些上限是可实现的,并且可以通过适应过程来实现。
  • results: 我们的研究结果显示,与传统的平衡分类不同,在异常检测任务中,不同的源数据可以提供大量的信息,从而实现快速的知识传输。
    Abstract A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to the use of similar but imperfect outlier data from which they might transfer information to the target outlier detection task. Despite the recent empirical success of transfer learning approaches in outlier detection, a fundamental understanding of when and how knowledge can be transferred from a source to a target outlier detection task remains elusive. In this work, we adopt the traditional framework of Neyman-Pearson classification -- which formalizes supervised outlier detection -- with the added assumption that one has access to some related but imperfect outlier data. Our main results are as follows: We first determine the information-theoretic limits of the problem under a measure of discrepancy that extends some existing notions from traditional balanced classification; interestingly, unlike in balanced classification, seemingly very dissimilar sources can provide much information about a target, thus resulting in fast transfer. We then show that, in principle, these information-theoretic limits are achievable by adaptive procedures, i.e., procedures with no a priori information on the discrepancy between source and target outlier distributions.
    摘要 一个重要的障碍在精准地学习异常检测的准确决策规则是异常数据的罕见性。因此,实践者们经常会使用类似 pero 不完全的异常数据,从而将信息传递到目标异常检测任务中。虽然latest empirical success of transfer learning approaches in outlier detection中有一定的成功,但是一个基本的理解当when和如何从源到目标异常检测任务中传递知识仍然是 unclear。在这个工作中,我们采用传统的Neyman-Pearson分类框架--这个框架将supervised outlier detection formalized--并假设有一些相关的 pero 不完全的异常数据。我们的主要结果如下:我们首先确定了异常检测问题下的信息论限制,这个限制是基于扩展一些现有的传统平衡分类的不同概念的一种度量。有趣的是,在平衡分类中,看起来非常不相似的源数据可以提供大量的信息给目标,因此可以快速传递。我们然后表明,在理论上,这些信息论限制是可以实现的,即可以通过适应程序来实现。这些适应程序没有对异常分布的先验信息,即可以在不知道源和目标异常分布之间的差异情况下进行学习。

Surgical Gym: A high-performance GPU-based platform for reinforcement learning with surgical robots

  • paper_url: http://arxiv.org/abs/2310.04676
  • repo_url: https://github.com/samuelschmidgall/surgicalgym
  • paper_authors: Samuel Schmidgall, Axel Krieger, Jason Eshraghian
  • for: 这 paper 是为了提高 робоット助成手术的精度、效率和非侵入性而写的。
  • methods: 这 paper 使用的方法是深度强化学习方法,以实现手术自动化。
  • results: 这 paper 的结果表明,使用 Surgical Gym 平台可以减少手术结果的变化和风险。 训练时间也比前一代平台快了100-5000倍。
    Abstract Recent advances in robot-assisted surgery have resulted in progressively more precise, efficient, and minimally invasive procedures, sparking a new era of robotic surgical intervention. This enables doctors, in collaborative interaction with robots, to perform traditional or minimally invasive surgeries with improved outcomes through smaller incisions. Recent efforts are working toward making robotic surgery more autonomous which has the potential to reduce variability of surgical outcomes and reduce complication rates. Deep reinforcement learning methodologies offer scalable solutions for surgical automation, but their effectiveness relies on extensive data acquisition due to the absence of prior knowledge in successfully accomplishing tasks. Due to the intensive nature of simulated data collection, previous works have focused on making existing algorithms more efficient. In this work, we focus on making the simulator more efficient, making training data much more accessible than previously possible. We introduce Surgical Gym, an open-source high performance platform for surgical robot learning where both the physics simulation and reinforcement learning occur directly on the GPU. We demonstrate between 100-5000x faster training times compared with previous surgical learning platforms. The code is available at: https://github.com/SamuelSchmidgall/SurgicalGym.
    摘要

Modeling non-uniform uncertainty in Reaction Prediction via Boosting and Dropout

  • paper_url: http://arxiv.org/abs/2310.04674
  • repo_url: None
  • paper_authors: Taicheng Guo, Changsheng Ma, Xiuying Chen, Bozhao Nan, Kehan Guo, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang
  • for: 预测化学反应的精度性能,尤其是考虑到反应过程中的不确定性。
  • methods: 提出了一种基于Variational Autoencoder(VAE)框架的方法,通过采用不同的dropout和boosting Ensemble来模型化不确定性。
  • results: 实验结果表明,提出的方法在USPTO-MIT大型反应预测 benchmark 上表现出色,比基elines有更好的精度性能。
    Abstract Reaction prediction has been recognized as a critical task in synthetic chemistry, where the goal is to predict the outcome of a reaction based on the given reactants. With the widespread adoption of generative models, the Variational Autoencoder(VAE) framework has typically been employed to tackle challenges in reaction prediction, where the reactants are encoded as a condition for the decoder, which then generates the product. Despite effectiveness, these conditional VAE (CVAE) models still fail to adequately account for the inherent uncertainty in reaction prediction, which primarily stems from the stochastic reaction process. The principal limitations are twofold. Firstly, in these CVAE models, the prior is independent of the reactants, leading to a default wide and assumed uniform distribution variance of the generated product. Secondly, reactants with analogous molecular representations are presumed to undergo similar electronic transition processes, thereby producing similar products. This hinders the ability to model diverse reaction mechanisms effectively. Since the variance in outcomes is inherently non-uniform, we are thus motivated to develop a framework that generates reaction products with non-uniform uncertainty. Firstly, we eliminate the latent variable in previous CVAE models to mitigate uncontrol-label noise. Instead, we introduce randomness into product generation via boosting to ensemble diverse models and cover the range of potential outcomes, and through dropout to secure models with minor variations. Additionally, we design a ranking method to union the predictions from boosting and dropout, prioritizing the most plausible products. Experimental results on the largest reaction prediction benchmark USPTO-MIT show the superior performance of our proposed method in modeling the non-uniform uncertainty compared to baselines.
    摘要 <>translate text into Simplified ChineseReaction prediction has been recognized as a critical task in synthetic chemistry, where the goal is to predict the outcome of a reaction based on the given reactants. With the widespread adoption of generative models, the Variational Autoencoder(VAE) framework has typically been employed to tackle challenges in reaction prediction, where the reactants are encoded as a condition for the decoder, which then generates the product. Despite effectiveness, these conditional VAE (CVAE) models still fail to adequately account for the inherent uncertainty in reaction prediction, which primarily stems from the stochastic reaction process. The principal limitations are twofold. Firstly, in these CVAE models, the prior is independent of the reactants, leading to a default wide and assumed uniform distribution variance of the generated product. Secondly, reactants with analogous molecular representations are presumed to undergo similar electronic transition processes, thereby producing similar products. This hinders the ability to model diverse reaction mechanisms effectively. Since the variance in outcomes is inherently non-uniform, we are thus motivated to develop a framework that generates reaction products with non-uniform uncertainty. Firstly, we eliminate the latent variable in previous CVAE models to mitigate uncontrol-label noise. Instead, we introduce randomness into product generation via boosting to ensemble diverse models and cover the range of potential outcomes, and through dropout to secure models with minor variations. Additionally, we design a ranking method to union the predictions from boosting and dropout, prioritizing the most plausible products. Experimental results on the largest reaction prediction benchmark USPTO-MIT show the superior performance of our proposed method in modeling the non-uniform uncertainty compared to baselines.中文简体版:<>将文本翻译成中文简体版化学synthesis中的反应预测Task被认为是一个关键任务,目标是根据给定的reactants预测反应的结果。通过大量采用生成模型,Variational Autoencoder(VAE)框架通常被用来解决反应预测挑战,其中reactants被用作生成产品的condition。尽管有效,但这些conditional VAE(CVAE)模型仍然无法准确地考虑反应预测中的内在不确定性,主要来自反应过程的随机性。主要的限制有两点:firstly,在这些CVAE模型中,假设是独立的reactants,导致生成产品的default宽度和假设的uniform分布变iance。secondly,reactantswith相似分子表示会经历类似的电子过程,因此生成相似的产品。这会限制我们模型多样化的反应机理。由于结果的 variance是非常不均匀的,我们因此受动于开发一个框架,可以生成反应产品中的非均匀不确定性。firstly,我们在前一代CVAE模型中消除latent variable,以避免Label noise。相反,我们通过boosting ensemble多种模型,覆盖potential outcome的范围,并通过dropout保持模型中的小变化。此外,我们设计了一种排名方法,以union boosting和dropout的预测结果,优先级最有可能的产品。实验结果表明,我们提议的方法在USPTO-MIT最大反应预测benchmark上表现出优于基线。

Oracle Efficient Algorithms for Groupwise Regret

  • paper_url: http://arxiv.org/abs/2310.04652
  • repo_url: None
  • paper_authors: Krishna Acharya, Eshwar Ram Arunachaleswaran, Sampath Kannan, Aaron Roth, Juba Ziani
  • for: 这个论文是为了解决在线预测问题,在每个时间步 $t$ 时,一个个体 $x_t$ 出现,我们需要预测其标签。每个个体都有不同的群体特征,例如年龄、性别、种族等,这些群体可能 intersect。我们的目标是在每个子序列中 simultaneously 获得群体不同的 regret guarantee。
  • methods: 我们使用了一种简单修改的睡眠专家技术,将问题转化为不含群体考虑的外在 regret 问题。我们的方法可以在大型模型集上实现高效,并且具有oracle-efficient 性。
  • results: 我们的算法可以在多个群体间 simultaneously 获得 regret guarantee,并且在实际数据上进行了广泛的实验,证明了我们的算法在各个群体中都具有显著的预测错误改善。
    Abstract We study the problem of online prediction, in which at each time step $t$, an individual $x_t$ arrives, whose label we must predict. Each individual is associated with various groups, defined based on their features such as age, sex, race etc., which may intersect. Our goal is to make predictions that have regret guarantees not just overall but also simultaneously on each sub-sequence comprised of the members of any single group. Previous work such as [Blum & Lykouris] and [Lee et al] provide attractive regret guarantees for these problems; however, these are computationally intractable on large model classes. We show that a simple modification of the sleeping experts technique of [Blum & Lykouris] yields an efficient reduction to the well-understood problem of obtaining diminishing external regret absent group considerations. Our approach gives similar regret guarantees compared to [Blum & Lykouris]; however, we run in time linear in the number of groups, and are oracle-efficient in the hypothesis class. This in particular implies that our algorithm is efficient whenever the number of groups is polynomially bounded and the external-regret problem can be solved efficiently, an improvement on [Blum & Lykouris]'s stronger condition that the model class must be small. Our approach can handle online linear regression and online combinatorial optimization problems like online shortest paths. Beyond providing theoretical regret bounds, we evaluate this algorithm with an extensive set of experiments on synthetic data and on two real data sets -- Medical costs and the Adult income dataset, both instantiated with intersecting groups defined in terms of race, sex, and other demographic characteristics. We find that uniformly across groups, our algorithm gives substantial error improvements compared to running a standard online linear regression algorithm with no groupwise regret guarantees.
    摘要 我们研究在线预测问题上,每个时间步骤 $t$ 时,一个个体 $x_t$ 到来,我们必须预测其标签。每个个体都与不同的群体相关,根据其特征(如年龄、性别、种族等),这些群体可能交叉。我们的目标是获得不同群体的同时 regret guarantee,而不是单纯的总 regret guarantee。先前的研究,如 [Blum & Lykouris] 和 [Lee et al] 提供了吸引人的 regret guarantee,但这些方法 computationally intractable 在大型模型集上。我们展示了一个简单的修改 later 的 sleeping experts 技术,可以将这个问题转换为 absent groupe considerations 的问题。我们的方法具有相似的 regret guarantee,但它们在number of groups 是线性增长的情况下执行,并且是 oracle-efficient 在假设集上。这意味着我们的算法是当number of groups 是 polynomially bounded 且 external-regret problem 可以有效解决时,才会高效。我们的方法可以处理在线 Linear Regression 和 online combinatorial optimization 问题。我们不仅提供了理论上的 regret bound,还进行了广泛的实验,评估我们的算法在 synthetic data 和 Medical costs 和 Adult income 两个真实数据集上的表现。我们发现在所有群体中,我们的算法均匀地提高了错误值。

NPEFF: Non-Negative Per-Example Fisher Factorization

  • paper_url: http://arxiv.org/abs/2310.04649
  • repo_url: https://github.com/mmatena/npeff_ref
  • paper_authors: Michael Matena, Colin Raffel
  • for: 本研究旨在解释深度学习模型的预测结果,提供一种可靠的解释方法NPEFF,可应用于任何端到端可微分模型。
  • methods: 本研究使用的NPEFF方法基于Shared Characteristic Processing(SCP)的原理,即处理共同特征的模型参数 subset 的特征。我们对每个例子的 Fisher 信息矩阵进行分解,并将每个分解组件表示为非负向量或rank-1正semidefinite矩阵。
  • results: 我们通过语言和视觉模型的实验表明,NPEFF可以恢复有意义的参数空间表示,并且这些表示与模型的处理有直接的连接。此外,我们还证明了NPEFF可以探测和修复模型中的潜在错误假设。我们发布了我们的代码,以便进行基于NPEFF的研究。
    Abstract As deep learning models are deployed in more and more settings, it becomes increasingly important to be able to understand why they produce a given prediction, but interpretation of these models remains a challenge. In this paper, we introduce a novel interpretability method called NPEFF that is readily applicable to any end-to-end differentiable model. It operates on the principle that processing of a characteristic shared across different examples involves a specific subset of model parameters. We perform NPEFF by decomposing each example's Fisher information matrix as a non-negative sum of components. These components take the form of either non-negative vectors or rank-1 positive semi-definite matrices depending on whether we are using diagonal or low-rank Fisher representations, respectively. For the latter form, we introduce a novel and highly scalable algorithm. We demonstrate that components recovered by NPEFF have interpretable tunings through experiments on language and vision models. Using unique properties of NPEFF's parameter-space representations, we ran extensive experiments to verify that the connections between directions in parameters space and examples recovered by NPEFF actually reflect the model's processing. We further demonstrate NPEFF's ability to uncover the actual processing strategies used by a TRACR-compiled model. We further explore a potential application of NPEFF in uncovering and correcting flawed heuristics used by a model. We release our code to facilitate research using NPEFF.
    摘要 深度学习模型在更多的场景中部署,理解这些模型的预测结果变得越来越重要。然而,模型解释仍然是一个挑战。在这篇论文中,我们介绍了一种新的解释方法called NPEFF,可以应用于任何端到端可导模型。它基于处理特征 shared across不同的示例 involve特定的模型参数子集的原理。我们通过分解每个示例的 Fisher信息矩阵为非负总和组件来实现NPEFF。这些组件可以是非负向量或 Rank-1正semidefinite矩阵,哪怕我们使用 диагональ或低级 Fisher表示。对于后者,我们引入了一种新的并高度可扩展的算法。我们通过实验表示,NPEFF recovered的组件具有可解释的调整。我们还证明了NPEFF的参数空间表示具有unique的性质,可以用来检验模型的处理策略。此外,我们还探讨了NPEFF可以用于探测和更正模型使用的恶劣假设的可能性。我们发布了我们的代码,以便研究人员可以通过NPEFF进行研究。

eess.IV - 2023-10-07

Hardware-Algorithm Co-design Enabling Processing-in-Pixel-in-Memory (P2M) for Neuromorphic Vision Sensors

  • paper_url: http://arxiv.org/abs/2310.16844
  • repo_url: None
  • paper_authors: Md Abdullah-Al Kaiser, Akhilesh R. Jaiswal
  • for: 这篇论文的目的是为了解决边缘设备具有计算能力限制的问题,尤其是对于计算机见的应用,以节省能源和传输带宽。
  • methods: 这篇论文使用了不同的方法,包括靠近感应器处理、内部感应器处理和内部像素处理,以将计算进行更加靠近感应器,从而节省传输带宽。特别是在像素中进行的内部像素处理,通过将不同的操作结合在一起,以提高能效性。
  • results: 这篇论文的结果显示,这些方法可以提高边缘设备的能效性和传输带宽,并且可以降低训练时间和能源消耗。此外,这篇论文还提出了一些硬件设计和数据分析技术,以提高内部像素处理的泄漏性能。
    Abstract The high volume of data transmission between the edge sensor and the cloud processor leads to energy and throughput bottlenecks for resource-constrained edge devices focused on computer vision. Hence, researchers are investigating different approaches (e.g., near-sensor processing, in-sensor processing, in-pixel processing) by executing computations closer to the sensor to reduce the transmission bandwidth. Specifically, in-pixel processing for neuromorphic vision sensors (e.g., dynamic vision sensors (DVS)) involves incorporating asynchronous multiply-accumulate (MAC) operations within the pixel array, resulting in improved energy efficiency. In a CMOS implementation, low overhead energy-efficient analog MAC accumulates charges on a passive capacitor; however, the capacitor's limited charge retention time affects the algorithmic integration time choices, impacting the algorithmic accuracy, bandwidth, energy, and training efficiency. Consequently, this results in a design trade-off on the hardware aspect-creating a need for a low-leakage compute unit while maintaining the area and energy benefits. In this work, we present a holistic analysis of the hardware-algorithm co-design trade-off based on the limited integration time posed by the hardware and techniques to improve the leakage performance of the in-pixel analog MAC operations.
    摘要 因为边缘设备的数据传输量过高,导致边缘设备具有限制的资源表现出能量和吞吐瓶颈问题。因此,研究人员正在调查不同的方法(如靠近传感器处理、在传感器处理、在像素处理),以便在传感器处理计算更近,减少传输带宽。特别是在像素处理方面,在神经网络感知器(如动态视sensors(DVS))中包含异步多乘法(MAC)操作,可以提高能效性。在CMOS实现中,低负荷能效的分析器可以在passive capacitor上储存电荷,但限制电容器的储存时间影响算法集成时间选择,从而影响算法的准确率、带宽、能效和训练效率。因此,这会导致硬件方面的设计决策——创造低泄漏计算单元,同时维持面积和能效的优点。在这种工作中,我们提供了硬件-算法共设计的硬件限制和提高泄漏性的技术分析。

eess.SP - 2023-10-07

OTFS based Joint Radar and Communication: Signal Analysis using the Ambiguity Function

  • paper_url: http://arxiv.org/abs/2310.04947
  • repo_url: None
  • paper_authors: Shalanika Dayarathna, Peter Smith, Rajitha Senanayake, Jamie Evans
  • for: 这篇论文旨在研究 ортогональ时频空间(OTFS)模调在共同雷达和通信系统中的适用性。
  • methods: 作者通过分析数据模调对雷达探测性能的影响, derivation of the ambiguity function(AF)of the OTFS waveform, and characterization of the radar global accuracy。
  • results: 作者通过分析数据分布对AF的行为进行了准确的approximation,并证明OTFS波形在雷达性能方面的全球性能与OFDM波形相当。
    Abstract Orthogonal time frequency space (OTFS) modulation has recently been identified as a suitable waveform for joint radar and communication systems. Focusing on the effect of data modulation on the radar sensing performance, we derive the ambiguity function (AF) of the OTFS waveform and characterize the radar global accuracy. We evaluate the behavior of the AF with respect to the distribution of the modulated data and derive an accurate approximation for the mean and variance of the AF, thus, approximating its distribution by a Rice distribution. Finally, we evaluate the global radar performance of the OTFS waveform with the OFDM waveform.
    摘要 Orthogonal time frequency space (OTFS) 模ulation 已经被认为是合适的探测和通信系统之waveform。我们专注于数据模ulation对探测性能的影响, derivation ambiguity function (AF) 的OTFS 波形,并Characterize 激光全球精度。我们分析了对于数据分布的影响,并 derive 精度和方差的准确估计,因此可以简化 AF 的分布为Rice distribution。最后,我们评估了 OTFS 波形和 OFDM 波形之间的全球激光性能。Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Linear Least Squares Estimation of Fiber-Longitudinal Optical Power Profile

  • paper_url: http://arxiv.org/abs/2310.04936
  • repo_url: None
  • paper_authors: Takeo Sasai, Minami Takahashi, Masanori Nakamura, Etsushi Yamazaki, Yoshiaki Kisaka
  • for: 这篇论文提出了一种线性最小二乘方法用于光纤 longitudinal 功率Profile 估算 (PPE),该方法可以在光纤通信系统中高精度地估算光信号力度分布。
  • methods: 该方法使用线性最小二乘方法来估算光纤中的力度分布,并通过全球最优化来找到最优的估算结果。
  • results: 实验结果表明,该方法可以准确地估算光纤中的力度分布,RMS 误差为 0.18 dB。此外,该方法还能够成功地检测到小于 0.77 dB 的损失缺陷。
    Abstract This paper presents a linear least squares method for fiber-longitudinal power profile estimation (PPE), which estimates an optical signal power distribution throughout a fiber-optic link at a coherent receiver. The method finds the global optimum in least square estimation of longitudinal power profiles, thus closely matching true optical power profiles and locating loss anomalies in a link with high spatial resolution. Experimental results show that the method achieves accurate PPE with an RMS error from OTDR of 0.18 dB. Consequently, it successfully identifies a loss anomaly as small as 0.77 dB, demonstrating the potential of a coherent receiver in locating even splice and connector losses. The method is also evaluated under a WDM condition with optimal system fiber launch power, highlighting its feasibility for use in operations. Furthermore, a fundamental limit for stable estimation and spatial resolution of least-squares-based PPE is quantitatively discussed in relation to the ill-posedness of PPE by evaluating the condition number of a nonlinear perturbation matrix.
    摘要 Translation notes:* "fiber-longitudinal" is translated as "长itudinal" (cháng yǐng) to emphasize the direction of the power profile estimation.* "optical signal power distribution" is translated as "光学信号电压分布" (guāng yǐng xìn yuè diàn yuè fēn bù) to emphasize the distribution of the optical signal power.* "coherent receiver" is translated as "同步接收器" (tóng xù jì huò) to emphasize the type of receiver used in the method.* "loss anomalies" is translated as "损失异常" (shèng shí yì cháng) to emphasize the type of anomalies detected by the method.* "high spatial resolution" is translated as "高空间分辨率" (gāo kōng jìan fēn biéng rù) to emphasize the accuracy of the method in locating the loss anomalies.* "WDM condition" is translated as "WDM条件" (WDM tiáo jiàn) to emphasize the specific condition under which the method is evaluated.* "optimal system fiber launch power" is translated as "最佳系统纤维发射功率" (zuì jì system fàng xiàng yì yì) to emphasize the importance of the launch power in the method.

A Grouping-based Scheduler for Efficient Channel Utilization under Age of Information Constraints

  • paper_url: http://arxiv.org/abs/2310.04817
  • repo_url: None
  • paper_authors: Lehan Wang, Jingzhou Sun, Yuxuan Sun, Sheng Zhou, Zhisheng Niu
  • for: 这个论文是为了解决一个复杂的大规模状态更新系统中的一个问题,即一个融合中心从多个来源收集状态信息,每个来源都有其自己的年龄信息(AoI)约束。
  • methods: 该论文提出了一种分组方法,即将来源分为不同的分组,以解决这个复杂的问题。具体来说,首先将来源按照AoI约束进行分组,然后为每个分组设计优化的协调器。
  • results: simulation 结果显示,提出的二步分组算法(TGA)可以减少通道使用率,与一个先前的工作相比,在许多情况下显著减少了channel使用率,并且与一个 derive 的下界相比,当有多个来源时,channel使用率为0.42%。
    Abstract We consider a status information updating system where a fusion center collects the status information from a large number of sources and each of them has its own age of information (AoI) constraints. A novel grouping-based scheduler is proposed to solve this complex large-scale problem by dividing the sources into different scheduling groups. The problem is then transformed into deriving the optimal grouping scheme. A two-step grouping algorithm (TGA) is proposed: 1) Given AoI constraints, we first identify the sources with harmonic AoI constraints, then design a fast grouping method and an optimal scheduler for these sources. Under harmonic AoI constraints, each constraint is divisible by the smallest one and the sum of reciprocals of the constraints with the same value is divisible by the reciprocal of the smallest one. 2) For the other sources without such a special property, we pack the sources which can be scheduled together with minimum update rates into the same group. Simulations show the channel usage of the proposed TGA is significantly reduced as compared to a recent work and is 0.42% larger than a derived lower bound when the number of sources is large.
    摘要 我团队考虑了一个状态信息更新系统,该系统中的拥有者中心从多个来源收集状态信息,每个来源都有自己的年龄信息(AoI)约束。我们提出了一种分组方式的计划器,用于解决这个复杂的大规模问题。问题转化为找到最佳分组方案。我们提出了两步分组算法(TGA):1. 给定AoI约束,我们首先将来源分为具有幂等AoI约束的源组和其他无特殊性质的源组。在幂等AoI约束下,每个约束都是最小一个的分母,并且所有具有相同值的约束的 reciprocal 之和是最小一个的分母的reciprocal的分母。2. 对于其他无特殊性质的来源,我们将具有最小更新频率的来源打包在同一组中。实验表明,提案的TGA Channel使用率significantly reduced compared to recent work, and is 0.42% larger than a derived lower bound when the number of sources is large.

Age of Information Guaranteed Scheduling for Asynchronous Status Updates in Collaborative Perception

  • paper_url: http://arxiv.org/abs/2310.04813
  • repo_url: None
  • paper_authors: Lehan Wang, Jingzhou Sun, Yuxuan Sun, Sheng Zhou, Zhisheng Niu
  • for: The paper is written for collaborative perception (CP) systems where a fusion center monitors various regions using multiple sources, and the center has different age of information (AoI) constraints for different regions.
  • methods: The paper proposes an algorithm called scheduling for CP with asynchronous status updates (SCPA) to minimize the number of required channels and subject to AoI constraints with asynchronous status updates.
  • results: According to numerical results, the number of channels required by SCPA can reach only 12% more than a derived lower bound.
    Abstract We consider collaborative perception (CP) systems where a fusion center monitors various regions by multiple sources. The center has different age of information (AoI) constraints for different regions. Multi-view sensing data for a region generated by sources can be fused by the center for a reliable representation of the region. To ensure accurate perception, differences between generation time of asynchronous status updates for CP fusion should not exceed a certain threshold. An algorithm named scheduling for CP with asynchronous status updates (SCPA) is proposed to minimize the number of required channels and subject to AoI constraints with asynchronous status updates. SCPA first identifies a set of sources that can satisfy the constraints with minimum updating rates. It then chooses scheduling intervals and offsets for the sources such that the number of required channels is optimized. According to numerical results, the number of channels required by SCPA can reach only 12% more than a derived lower bound.
    摘要 我们考虑了协同感知(CP)系统,其中统计中心监控多个区域,并且从多个来源获取多元观察数据。中心具有不同的资讯年龄(AoI)限制 для不同的区域。多元感知数据 для一个区域由来源生成,可以在中心进行融合,以获得区域的可靠表现。确保正确感知,协同感知融合中的不同时间生成的 asynchronous status updates 差异应小于一定阈值。一个名为协同感知 avec asynchronous status updates 的算法(SCPA)被提出,以最小化需要的通道数量,并且遵循 AoI 限制。SCPA 首先 identific 一群可以满足限制的来源,然后选择这些来源的调度间隔和偏移量,以便优化通道数量。根据数据显示,SCPA 可以对需要的通道数量进行最小化,与一个 derive 的下限相差只有12%。

Score-based Diffusion Models With Self-supervised Learning For Accelerated 3D Multi-contrast Cardiac Magnetic Resonance Imaging

  • paper_url: http://arxiv.org/abs/2310.04669
  • repo_url: None
  • paper_authors: Yuanyuan Liu, Zhuo-Xu Cui, Congcong Liu, Hairong Zheng, Haifeng Wang, Yihang Zhou, Yanjie Zhu
    for:这种研究旨在加速3D-MC-CMR成像过程,以提高其广泛应用的可行性。methods:研究人员提出了一种基于自我监督学习的新方法,使用得分基于扩散模型来加速3D-MC-CMR成像。这种方法首先建立了快照测量和MR图像之间的映射,然后使用自我监督 Bayesian 重建网络来恢复MR图像。最后,研究人员开发了一种三维分布的分数基于扩散模型,以捕捉3D-MC-CMR图像的自然分布。results:实验结果表明,该方法比传统的压缩感知和现有的自我监督深度学习MRI重建方法更高效。它还可以在高速度压缩率14的情况下获得高质量的T1和T1rho参数地图,与参照地图相似。
    Abstract Long scan time significantly hinders the widespread applications of three-dimensional multi-contrast cardiac magnetic resonance (3D-MC-CMR) imaging. This study aims to accelerate 3D-MC-CMR acquisition by a novel method based on score-based diffusion models with self-supervised learning. Specifically, we first establish a mapping between the undersampled k-space measurements and the MR images, utilizing a self-supervised Bayesian reconstruction network. Secondly, we develop a joint score-based diffusion model on 3D-MC-CMR images to capture their inherent distribution. The 3D-MC-CMR images are finally reconstructed using the conditioned Langenvin Markov chain Monte Carlo sampling. This approach enables accurate reconstruction without fully sampled training data. Its performance was tested on the dataset acquired by a 3D joint myocardial T1 and T1rho mapping sequence. The T1 and T1rho maps were estimated via a dictionary matching method from the reconstructed images. Experimental results show that the proposed method outperforms traditional compressed sensing and existing self-supervised deep learning MRI reconstruction methods. It also achieves high quality T1 and T1rho parametric maps close to the reference maps obtained by traditional mapping sequences, even at a high acceleration rate of 14.
    摘要 长时间扫描减少了三维多contrast室内Magnetic Resonance成像(3D-MC-CMR)的广泛应用。本研究目的是加速3D-MC-CMR获取的速度。我们采用了一种基于得分分布模型的自我超vised学习方法。首先,我们使用一种自我超vised Bayesian重建网络将不完全样本空间测量映射到MR图像中。其次,我们开发了一种三维分布Score-based扩散模型,以捕捉3D-MC-CMR图像的内在分布。最后,我们使用 conditioned Langenvin Markov chain Monte Carlo采样来重建图像。这种方法可以准确重建图像,不需要完全的样本数据。我们对一个3D联合肌肉T1和T1rho映射序列上获取的数据进行了实验测试。结果表明,我们的方法比传统的压缩感知和现有的自我超vised深度学习MRI重建方法更好。它还可以在高速度减少14%的情况下获得高质量的T1和T1rho参数图像,与传统映射序列中的参考图像几乎相同。

Space Observation by the Australia Telescope Compact Array: Performance Characterization using GPS Satellite Observation

  • paper_url: http://arxiv.org/abs/2310.04653
  • repo_url: None
  • paper_authors: Hamed Nosrati, Stephanie Smith, Douglas B. Hayman
  • for: 用于升级澳大利亚天体望远镜数据启用空间卫星定位应用
  • methods: 基于干扰数据的系统模型,实现距离和方向估算
  • results: 与最新的二线元素(TLE)进行比较,显示距离和方向信息都得到了明显改善
    Abstract In order to operationalize the Australia Telescope Compact Array (ATCA) for space situational awareness (SSA) applications, we develop a system model for range and direction of arrival (DOA) estimation based on the interferometric data. We employ the observational data collected from global positioning system (GPS) satellites to evaluate the developed model and demonstrate that, compared to a priori location propagated from the most recent two-line element (TLE), both range and direction information are improved significantly.
    摘要 为了使澳大利亚望远镜数组(ATCA)用于空间定位意识(SSA)应用,我们开发了基于互相干涉数据的系统模型,以便估算距离和方向信息。我们使用全球定位系统(GPS)卫星的观测数据来评估我们开发的模型,并证明在比之前两行元素(TLE)的位置进行质量提高后,距离和方向信息都得到了显著改善。

cs.SD - 2023-10-06

DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

  • paper_url: http://arxiv.org/abs/2310.04567
  • repo_url: None
  • paper_authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali
  • for: target sound extraction (TSE)
  • methods: diffusion probabilistic modeling (DPM)
  • results: cleaner target renderings and improved separability from unwanted sounds, with significant improvement in perceived quality
    Abstract Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve both cleaner target renderings as well as improved separability from unwanted sounds. The technique also tackles common background noise issues with DPM by introducing a correction method for noise schedules and sample steps. This approach is evaluated using both objective and subjective quality metrics on the FSD Kaggle 2018 dataset. The results show that DPM-TSE has a significant improvement in perceived quality in terms of target extraction and purity.
    摘要 通用目标声音提取(TSE)方法主要依靠推论方法,以分离目标声音而减少背景干扰,Resultsof varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve both cleaner target renderings as well as improved separability from unwanted sounds. The technique also tackles common background noise issues with DPM by introducing a correction method for noise schedules and sample steps. This approach is evaluated using both objective and subjective quality metrics on the FSD Kaggle 2018 dataset. The results show that DPM-TSE has a significant improvement in perceived quality in terms of target extraction and purity.Note: Simplified Chinese is also known as Mandarin Chinese, and is the official language of China. It is written using the Simplified Chinese characters, which are used in mainland China and Singapore. Traditional Chinese is also widely used, and is the official language of Taiwan, Hong Kong, and Macau.

Analysis on the Influence of Synchronization Error on Fixed-filter Active Noise Control

  • paper_url: http://arxiv.org/abs/2310.04249
  • repo_url: None
  • paper_authors: Guo Yu
  • for: 这个研究旨在investigating the synchronization error of digital Active Noise Control (ANC) system.
  • methods: 该研究采用了fixed-filter strategy, which is a viable alternative to traditional adaptive algorithms in addressing the challenges of computing complexity and instability, but with a potential trade-off in terms of noise reduction efficacy.
  • results: 该研究expects to provide a theoretical investigation into the synchronization error of the digital ANC system.
    Abstract The efficacy of active noise control technology in mitigating urban noise, particularly in relation to low-frequency components, has been well-established. In the realm of traditional academic research, adaptive algorithms, such as the filtered reference least mean square method, are extensively employed to achieve real-time noise reduction in many applications. Nevertheless, the utilization of this technology in commercial goods is often hindered by its significant computing complexity and inherent instability. In this particular scenario, the adoption of the fixed-filter strategy emerges as a viable alternative for addressing these challenges, albeit with a potential trade-off in terms of noise reduction efficacy. This work aims to conduct a theoretical investigation into the synchronization error of the digital Active Noise Control (ANC) system. Keywords: Fixed-filter, Active noise control, Multichannel active noise control.
    摘要 “active noise control技术在城市噪声缓解方面的效果已得到了广泛证明。在传统学术研究中,适应算法如 filtered reference least mean square method 广泛应用于实时噪声减少多种应用。然而,商业产品中使用这技术时常受到计算复杂性和内置不稳定性的限制。在这种情况下, fixed-filter 策略 emerges 作为一种可行的替代方案,尽管可能存在噪声减少效果的潜在交换。本工作的目的是 investigate 数字 active noise control(ANC)系统的同步误差。关键字: fixed-filter, active noise control, multichannel active noise control。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

  • paper_url: http://arxiv.org/abs/2310.04004
  • repo_url: None
  • paper_authors: Tao Li, Zhichao Wang, Xinfa Zhu, Jian Cong, Qiao Tian, Yuping Wang, Lei Xie
  • for: zero-shot speaker cloning, synthesize speech for any target speaker unseen during TTS system building
  • methods: employ Grad-TTS as the backbone, cascade speaker- and style-specific encoders between text encoder and diffusion decoder, use signal perturbation to explicitly decompose into speaker- and style-specific modeling parts
  • results: significantly surpass state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity, achieve flexible combinations of desired speaker timbre and style in zero-shot voice cloning
    Abstract Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.
    摘要 <>translate_language: zh-CNZero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

Music Recommendation Based on Audio Fingerprint

  • paper_url: http://arxiv.org/abs/2310.17655
  • repo_url: None
  • paper_authors: Diego Saldaña Ulloa
  • for: 用于建立一种音乐推荐过程中的更加稳健的指纹。
  • methods: 将不同的音频特征结合使用,以获得一个高维向量。然后,使用PCA方法选择95%的主成分,以减少值的数量。最后,计算每个指纹与整个数据集的相似性矩阵。
  • results: 使用这些PCA-指纹,实现了89%的成功推荐率(推荐的歌曲的类别与目标歌曲的类别匹配),基于200首个人音乐库中的歌曲,每首歌曲被标注为相应的歌手们的类别。
    Abstract This work combined different audio features to obtain a more robust fingerprint to be used in a music recommendation process. The combination of these methods resulted in a high-dimensional vector. To reduce the number of values, PCA was applied to the set of resulting fingerprints, selecting the number of principal components that corresponded to an explained variance of $95\%$. Finally, with these PCA-fingerprints, the similarity matrix of each fingerprint with the entire data set was calculated. The process was applied to 200 songs from a personal music library; the songs were tagged with the artists' corresponding genres. The recommendations (fingerprints of songs with the closest similarity) were rated successful if the recommended songs' genre matched the target songs' genre. With this procedure, it was possible to obtain an accuracy of $89\%$ (successful recommendations out of total recommendation requests).
    摘要 这个工作将不同的音频特征结合起来,以获得更加鲜明的音乐推荐指标。这些方法的组合导致了一个高维度的向量。为了减少值的数量,对这些指标进行了PCA处理,选择了Explained variance的95%。最后,使用这些PCA指标,计算了每个指标与整个数据集的相似性矩阵。这个过程采用了200首个人音乐库中的歌曲,这些歌曲被标注为艺术家的相应类别。推荐(与整个数据集最相似的歌曲指标)被评估为成功,如果推荐的歌曲的类别与目标歌曲的类别匹配。通过这种方式,可以获得89%的准确率(成功推荐请求数量 / 总推荐请求数量)。

Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2310.03992
  • repo_url: None
  • paper_authors: Yan Zhao, Yuan Zong, Jincen Wang, Hailun Lian, Cheng Lu, Li Zhao, Wenming Zheng
    for:The paper proposes a new unsupervised domain adaptation method called LIDAN to address the challenge of cross-corpus speech emotion recognition.methods:LIDAN extends the previous ICASSP work, DIDAN, by introducing a novel regularization term called layer-adapted implicit distribution alignment (LIDA) that considers emotion labels at different levels of granularity.results:LIDAN surpasses recent state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER tasks, as demonstrated by extensive experiments on EmoDB, eNTERFACE, and CASIA corpora.
    Abstract In this paper, we propose a new unsupervised domain adaptation (DA) method called layer-adapted implicit distribution alignment networks (LIDAN) to address the challenge of cross-corpus speech emotion recognition (SER). LIDAN extends our previous ICASSP work, deep implicit distribution alignment networks (DIDAN), whose key contribution lies in the introduction of a novel regularization term called implicit distribution alignment (IDA). This term allows DIDAN trained on source (training) speech samples to remain applicable to predicting emotion labels for target (testing) speech samples, regardless of corpus variance in cross-corpus SER. To further enhance this method, we extend IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated extention consists of three modified IDA terms that consider emotion labels at different levels of granularity. These terms are strategically arranged within different fully connected layers in LIDAN, aligning with the increasing emotion-discriminative abilities with respect to the layer depth. This arrangement enables LIDAN to more effectively learn emotion-discriminative and corpus-invariant features for SER across various corpora compared to DIDAN. It is also worthy to mention that unlike most existing methods that rely on estimating statistical moments to describe pre-assumed explicit distributions, both IDA and LIDA take a different approach. They utilize an idea of target sample reconstruction to directly bridge the feature distribution gap without making assumptions about their distribution type. As a result, DIDAN and LIDAN can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA corpora. The experimental results demonstrate that LIDAN surpasses recent state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER tasks.
    摘要 在这篇论文中,我们提出了一种新的无监督领域适应(DA)方法,即层 adapted implicit distribution alignment networks(LIDAN),以解决跨 Corpora 的语音情感识别(SER)挑战。LIDAN 是我们之前的 ICASSP 工作的扩展,深度隐式分布对接网络(DIDAN),其关键贡献在于引入了一个新的正则化项 called implicit distribution alignment(IDA)。这个项使得 DIDAN 在 source 语音样本训练后可以有效地预测 testing 语音样本的情感标签,不管跨 Corpora 的语音样本变化。为了进一步改进这种方法,我们延伸 IDA 到层 adapted IDA(LIDA),得到 LIDAN。这个层 adapted 扩展包括三个修改后的 IDA 项,这些项在不同的全连接层中适应不同的情感细分水平。这种适应安排使得 LIDAN 可以更好地学习语音样本中的情感特征,并且可以更好地适应不同 Corpora 的语音样本。值得一提的是,不同于大多数现有方法,DIDAN 和 LIDAN 不需要 estimating 统计 moments 来描述预设的显式分布,而是直接使用 target 样本重建的思想,从而bridge 特征分布差距。因此,DIDAN 和 LIDAN 可以视为隐式跨 Corpora SER 方法。为了评估 LIDAN,我们在 EmoDB、eNTERFACE 和 CASIA corpora 上进行了广泛的 cross-Corpus SER 实验。实验结果表明,LIDAN 超越了最近的显式无监督 DA 方法,在跨 Corpora SER 任务中表现出色。

Zero-Shot Emotion Transfer For Cross-Lingual Speech Synthesis

  • paper_url: http://arxiv.org/abs/2310.03963
  • repo_url: None
  • paper_authors: Yuke Li, Xinfa Zhu, Yi Lei, Hai Li, Junhui Liu, Danming Xie, Lei Xie
  • for: 这个研究目的是为了实现零次调变的语言转换 speech synthesis,即将语言转换后的 speech 转换为具有不同语言的表情。
  • methods: 这个研究使用了 DelightfulTTS нейрон网络架构,并导入了特定设计的模组来模型不同语言的语言特有的调变特征和语言共享的情感表现。 Specifically, 使用了 non-autoregressive predictive coding (NPC) 模组来学习语言特有的 speech 调变,并从 HuBERT 预训练模型中提取了具有强一致能力的共享情感表现。 此外,还使用了层次情感模型来捕捉不同语言之间的更全面的情感表现。
  • results: 实验结果显示,提案的框架可以实现零次调变的语言转换 speech synthesis,即将语言转换后的 speech 转换为具有不同语言的表情,而不需要调变训练数据。
    Abstract Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model HuBERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data.
    摘要 zero-shot 情感传递在跨语言speech sintesis中目标是将来源语言中的任意speech作为参考,将情感传递到目标语言的synthetic speech中。建立这种系统面临着不自然的外语口音和不同语言之间共享的情感表达模型化的挑战。基于DelightfulTTS神经网络架构,本文通过特制的模块来分立语言特有的态度特征和共享的情感表达,以提高跨语言speech的自然性。具体来说,使用非autoregressive predictive coding(NPC)模块来学习语言特有的speech态度,以提高跨语言speech的自然性。同时,使用层次情感模型来捕捉不同语言之间的共享情感。实验结果表明我们提出的框架能够在没有情感培训数据的情况下,为单语言target speakerSynthesize bi-lingual emotional speech。

eess.AS - 2023-10-06

Optimal model-based beamforming and independent steering for spherical loudspeaker arrays

  • paper_url: http://arxiv.org/abs/2310.04202
  • repo_url: None
  • paper_authors: Boaz Rafaely, Dima Khaykin
  • for: 研究方向性喇声广播的方法,使用圆形喇声器阵列控制声波的三维空间方向性。
  • methods: 使用物理模型基于的优化框架,在圆形喇声器阵列中实现独立执行。
  • results: 实验证明了理论框架的可靠性,并且在圆形喇声器阵列中实现了独立执行。
    Abstract Spherical loudspeaker arrays have been recently studied for directional sound radiation, where the compact arrangement of the loudspeaker units around a sphere facilitated the control of sound radiation in three-dimensional space. Directivity of sound radiation, or beamforming, was achieved by driving each loudspeaker unit independently, where the design of beamforming weights was typically achieved by numerical optimization with reference to a given desired beam pattern. This is in contrast to the methods already developed for microphone arrays in general and spherical microphone arrays in particular, where beamformer weights are designed to satisfy a wider range of objectives, related to directivity, robustness, and side-lobe level, for example. This paper presents the development of a physical-model-based, optimal beamforming framework for spherical loudspeaker arrays, similar to the framework already developed for spherical microphone arrays, facilitating efficient beamforming in the spherical harmonics domain, with independent steering. In particular, it is shown that from a beamforming perspective, the spherical loudspeaker array is similar to the spherical microphone array with microphones arranged around a rigid sphere. Experimental investigation validates the theoretical framework of beamformer design.
    摘要 圆形 loudspeaker 阵列在近期研究中被用于指向性声波发射,其中圆形 loudspeaker 单元的紧凑排布使得三维空间中声波发射的控制变得更加容易。通过独立驱动每个 loudspeaker 单元,实现了声波发射的指向性,也就是 beamforming。与现有的 Microphone 阵列和圆形 Microphone 阵列的方法不同,这里的 beamforming 权重设计通常通过数字优化来实现,以满足更加宽泛的目标,包括指向性、Robustness 和侧射强度等。本文介绍了一种基于物理模型的、优化 beamforming 框架 для圆形 loudspeaker 阵列,与圆形 Microphone 阵列的框架类似,可以有效地在圆函数频谱中进行 beamforming,并且可以独立控制声波发射的方向。特别是,从 beamforming 的视角来看,圆形 loudspeaker 阵列与圆形 Microphone 阵列的声波发射方式类似。实验室调查 validate 了这种理论框架。

Zones of quiet in a broadband diffuse sound field

  • paper_url: http://arxiv.org/abs/2310.04191
  • repo_url: None
  • paper_authors: Boaz Rafaely
  • for: 本研究探讨了广频杂音场中的安静区域,并使用了最新的广频杂音场的空间时间相关性研究来开发一个理论框架,以研究控制广频杂音的当地活动声控系统,如活动头rest。
  • methods: 本研究使用了广频杂音场的空间时间相关性研究,并对各种杂音场进行了分析和计算,以 derivation of the diffuse field zones of quiet in the near-field and the far-field of the secondary source。
  • results: 研究结果表明,在低通滤波后的杂音场中,安静区域的大小与中心频率相关,并且在一定程度上可以通过对各种杂音场进行分析和计算来预测安静区域的大小。
    Abstract The zones of quiet in pure-tone diffuse sound fields have been studied extensively in the past, both theoretically and experimentally, with the well known result of the 10\,dB attenuation extending to about a tenth of a wavelength. Recent results on the spatial-temporal correlation of broadband diffuse sound fields are used in this study to develop a theoretical framework for predicting the extension of the zones of quiet in broadband diffuse sound fields. This can be used to study the acoustic limitations imposed on local active sound control systems such as an active headrest when controlling broadband noise. Spatial-temporal correlation is first revised, after which derivations of the diffuse field zones of quiet in the near-field and the far-field of the secondary source are presented. The theoretical analysis is supported by simulation examples comparing the zones of quiet for diffuse fields excited by tonal and broadband signals. It is shown that as a first approximation the zone of quiet of a low-pass filtered noise is comparable to that of a pure-tone with a frequency equal to the center frequency of the broadband noise bandwidth.
    摘要 在过去,混响频率场中的幽静区域已经得到了广泛的研究,both theoretically和experimentally,以得到知名的10dB抑制范围延伸约为一个波长的一半。在这种研究中,我们使用了最近的广band混响场的空间时间相关性研究,开发了一种用于预测混响场中幽静区域的理论框架。这可以用来研究控制广band噪声的地方活动声控系统,如活动头rest。首先,我们修改了空间时间相关性,然后提出了混响场中幽静区域的近场和远场 derivations。 theoretical分析得到了通过对比幽静区域的混响场 excited by tonal和广band信号的simulation例子。结果显示,作为一个初步的approximation,混响场中幽静区域的zone of quiet与一个中心频率为混响场宽频率范围的低通滤波器噪声的zone of quiet几乎相同。

Spatial sampling and beamforming for spherical microphone arrays

  • paper_url: http://arxiv.org/abs/2310.04169
  • repo_url: None
  • paper_authors: Boaz Rafaely
  • for: 这篇论文主要写于圆形麦克风阵的声学记录、语音通信和室内噪声分析等领域。
  • methods: 论文总结了最近的空间采样方法,包括圆形麦克风阵的各种配置,从单一的固定圆形到自由位置的麦克风。同时还介绍了各种射频方法,包括延迟和总和法和道尔芬-切比雪夫法,以及更高级的优化方法,通常在圆形傅里叶域中进行。
  • results: 论文回顾了最近的圆形麦克风阵 beamforming 方法的进展,包括延迟和总和法、道尔芬-切比雪夫法以及更高级的优化方法。
    Abstract Spherical microphone arrays have been recently studied for spatial sound recording, speech communication, and sound field analysis for room acoustics and noise control. Complementary theoretical studies presented progress in spatial sampling and beamforming methods. This paper reviews recent results in spatial sampling that facilitate a wide range of spherical array configurations, from a single rigid sphere to free positioning of microphones. The paper then presents an overview of beamforming methods recently presented for spherical arrays, from the widely used delay-and-sum and Dolph-Chebyshev, to the more advanced optimal methods, typically performed in the spherical harmonics domain.
    摘要 圆形微型麦克风数组在声学记录、语音通信和室内声学雷达控制中得到了最近的研究。相关理论研究提出了在圆形麦克风数组中的空间抽样和扩散方法的进步。本文将介绍最近在圆形麦克风数组中的空间抽样技术,从单一固定圆形麦克风到自由位置的麦克风。然后将介绍圆形麦克风数组中的扩散方法,从通用的延迟和总和到更高级的优化方法,通常在圆形傅里叶域内进行。

A privacy-preserving method using secret key for convolutional neural network-based speech classification

  • paper_url: http://arxiv.org/abs/2310.04035
  • repo_url: None
  • paper_authors: Shoko Niwa, Sayaka Shiota, Hitoshi Kiya
  • for: 本研究旨在提出一种隐私保护方法,用于 convolutional neural network(CNN)基于语音分类任务中。相比于图像分类领域中的隐私保护研究,语音分类领域尚未得到足够的关注。本研究提供一种基于随机矩阵的加密方法,以保护语音数据的隐私。
  • methods: 本研究使用了一种基于随机矩阵的加密方法,其中使用了一个可逆的随机矩阵来生成加密后的语音数据。加密后的语音数据可以通过使用一个可逆的随机矩阵来解密,并且可以完全复用原始数据。在实验中,本研究使用了自主学习前端系统,并在语音识别(ASR)和语音认证(ASV)任务中进行了实验。
  • results: 实验结果表明,使用了本研究提出的加密方法后,语音数据仍然可以完全复用原始数据,并且对于恢复攻击有很好的鲁棒性。此外,本研究还评估了加密后语音数据的难度恢复原始信息。
    Abstract In this paper, we propose a privacy-preserving method with a secret key for convolutional neural network (CNN)-based speech classification tasks. Recently, many methods related to privacy preservation have been developed in image classification research fields. In contrast, in speech classification research fields, little research has considered these risks. To promote research on privacy preservation for speech classification, we provide an encryption method with a secret key in CNN-based speech classification systems. The encryption method is based on a random matrix with an invertible inverse. The encrypted speech data with a correct key can be accepted by a model with an encrypted kernel generated using an inverse matrix of a random matrix. Whereas the encrypted speech data is strongly distorted, the classification tasks can be correctly performed when a correct key is provided. Additionally, in this paper, we evaluate the difficulty of reconstructing the original information from the encrypted spectrograms and waveforms. In our experiments, the proposed encryption methods are performed in automatic speech recognition~(ASR) and automatic speaker verification~(ASV) tasks. The results show that the encrypted data can be used completely the same as the original data when a correct secret key is provided in the transformer-based ASR and x-vector-based ASV with self-supervised front-end systems. The robustness of the encrypted data against reconstruction attacks is also illustrated.
    摘要 在这篇论文中,我们提出了一种保持隐私的方法,用于在卷积神经网络(CNN)基于的语音分类任务中。在图像分类研究领域中,最近已经有许多隐私保护方法的研究。然而,在语音分类研究领域,很少有研究者考虑到这些风险。为了促进语音分类领域中的隐私保护研究,我们提供了一种使用随机矩阵的加密方法。这种加密方法基于一个可逆的随机矩阵。具有正确密钥的加密语音数据可以通过一个使用逆矩阵生成的加密神经网络进行接受。然而,加密语音数据具有强烈的扭曲,但是在正确密钥提供下,分类任务仍然可以正确完成。此外,在这篇论文中,我们评估了加密后的原始信息重建的困难度。在我们的实验中,我们使用自动语音识别(ASR)和自动说话人验证(ASV)任务中的转换器基于ASR和x-vector基于ASV自适应前端系统进行实现。结果显示,当正确密钥提供时,加密数据可以完全 Replace original data,并且在转换器基于ASR和x-vector基于ASV自适应前端系统中,加密数据的稳定性也得到了证明。

cs.CV - 2023-10-06

An Algorithm to Train Unrestricted Sequential Discrete Morphological Neural Networks

  • paper_url: http://arxiv.org/abs/2310.04584
  • repo_url: None
  • paper_authors: Diego Marcondes, Mariana Feldman, Junior Barrera
  • for: 这个论文是为了描述一种基于深度学习的数学 morphology(MM)操作 insertion into convolutional neural networks(CNN)的方法,以及这种方法的应用在二值图像转换中。
  • methods: 这种方法使用的是一种新的离散 morphological neural networks(DMNN),它可以表示特定的 W-操作类型,并使用机器学习算法来学习这些操作的参数。
  • results: 这种方法可以在二值图像转换中提供更高的性能,并且可以在不具备域知识的情况下进行应用。
    Abstract With the advent of deep learning, there have been attempts to insert mathematical morphology (MM) operators into convolutional neural networks (CNN), and the most successful endeavor to date has been the morphological neural networks (MNN). Although MNN have performed better than CNN in solving some problems, they inherit their black-box nature. Furthermore, in the case of binary images, they are approximations, which loose the Boolean lattice structure of MM operators and, thus, it is not possible to represent a specific class of W-operators with desired properties. In a recent work, we proposed the Discrete Morphological Neural Networks (DMNN) for binary image transformation to represent specific classes of W-operators and estimate them via machine learning. We also proposed a stochastic lattice gradient descent algorithm (SLGDA) to learn the parameters of Canonical Discrete Morphological Neural Networks (CDMNN), whose architecture is composed only of operators that can be decomposed as the supremum, infimum, and complement of erosions and dilations. In this paper, we propose an algorithm to learn unrestricted sequential DMNN (USDMNN), whose architecture is given by the composition of general W-operators. We consider the representation of a W-operator by its characteristic Boolean function, and then learn it via a SLGDA in the Boolean lattice of functions. Although both the CDMNN and USDMNN have the Boolean lattice structure, USDMNN are not as dependent on prior information about the problem at hand, and may be more suitable in instances in which the practitioner does not have strong domain knowledge. We illustrate the algorithm in a practical example.
    摘要 Deep learning 技术的出现,有人尝试插入数学形态(MM)运算到卷积神经网络(CNN)中,最成功的尝试是形态神经网络(MNN)。 although MNN 在解决一些问题上表现比 CNN 更好,但它们继承了黑盒模式,无法表示特定的 W-运算器。 在二进制图像的情况下,MNN 是一种近似方法,不能保持 Boolean 网格结构,因此无法表示特定的 W-运算器。在我们的最近工作中,我们提出了逻辑分割神经网络(DMNN)来解决这个问题。 DMNN 可以表示特定的 W-运算器,并通过机器学习来参数化。 我们还提出了一种Stochastic Lattice Gradient Descent Algorithm(SLGDA)来学习 Canonical Discrete Morphological Neural Networks(CDMNN)的参数,其架构由 supremum、infimum 和扩散、减小的操作组成。在这篇论文中,我们提出了一种算法来学习不受限制的顺序 DMNN(USDMNN)。 USDMNN 的架构由 general W-运算器组成。 我们认为 W-运算器的特征 Boolean 函数可以表示它,然后通过 SLGDA 在 Boolean 网格中学习它。 虽然 CDMNN 和 USDMNN 都具有 Boolean 网格结构,但 USDMNN 不受具体问题的先验知识的限制,可能更适合在具体问题上使用。 我们在实践中 illustrate 了这种算法。

Universal Humanoid Motion Representations for Physics-Based Control

  • paper_url: http://arxiv.org/abs/2310.04582
  • repo_url: None
  • paper_authors: Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu
  • for: 这种新的运动表示法可以用于physics-based humanoid控制,它可以涵盖各种人工智能控制任务中的各种运动样式。
  • methods: 这种运动表示法使用了一种含有变量信息瓶颈的encoder-decoder结构,并通过学习自然人类运动数据来塑造运动表示。同时,它还使用了一种优化的采样策略来提高模型表达能力和采样效率。
  • results: 通过使用这种运动表示法,研究人员可以解决各种生成任务(如攻击和地形越过)和运动跟踪任务使用VR控制器。这种运动表示法可以生成长时间、稳定、多样化的人类运动,并且可以在各种复杂任务中表现出自然和现实的人类行为。
    Abstract We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high-dimensionality of humanoid control as well as the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers its applicability in complex tasks. Our work closes this gap, significantly increasing the coverage of motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. Sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using natural and realistic human behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.
    摘要 我们提出了一种涵盖广泛人形机器人控制的通用运动表示方法。由于人形机器人控制的维度较高以及学习奖励学习的自然难度,先前的方法通常是从专门的运动数据集中学习一些特定的运动风格(如行走、游戏角色)的技能嵌入。这限制了其应用在复杂任务中。我们的工作将这个差距减少,显著扩大运动表示空间的覆盖率。为了实现这一点,我们首先学习了一个可以模仿所有人类运动的运动模仿器。然后,我们通过变量信息瓶颈的encoder-decoder结构来创建我们的运动表示。此外,我们同时学习了一个受过优化的先天条件,以提高模型表达力和下游任务的采样效率。从这个幽Defaults中,我们可以生成长、稳定、多样化的人类运动。使用这个潜在空间进行层次RL,我们展示了我们的策略可以通过自然和现实的人类行为解决任务。我们通过生成任务(如击打、地形穿越)和使用VR控制器进行运动跟踪来证明了我们的运动表示的效果。

VTON-IT: Virtual Try-On using Image Translation

  • paper_url: http://arxiv.org/abs/2310.04558
  • repo_url: https://github.com/shuntos/viton-it
  • paper_authors: Santosh Adhikari, Bishnu Bhusal, Prashant Ghimire, Anil Shrestha
  • for: 这项研究旨在提供一种基于生成对抗网络的虚拟试穿服务(Virtual Try-On),以帮助用户在线上快速适应不同的服装。
  • methods: 该研究使用 semantic segmentation 和基于生成对抗网络的图像翻译网络,以生成高品质的虚拟试穿图像。
  • results: 研究发现,该方法可以生成高分辨率的自然图像,并保留人体图像中的细节 texture。
    Abstract Virtual Try-On (trying clothes virtually) is a promising application of the Generative Adversarial Network (GAN). However, it is an arduous task to transfer the desired clothing item onto the corresponding regions of a human body because of varying body size, pose, and occlusions like hair and overlapped clothes. In this paper, we try to produce photo-realistic translated images through semantic segmentation and a generative adversarial architecture-based image translation network. We present a novel image-based Virtual Try-On application VTON-IT that takes an RGB image, segments desired body part, and overlays target cloth over the segmented body region. Most state-of-the-art GAN-based Virtual Try-On applications produce unaligned pixelated synthesis images on real-life test images. However, our approach generates high-resolution natural images with detailed textures on such variant images.
    摘要 虚拟试穿(虚拟尝试服装)是生成对抗网络(GAN)的应用之一,但是将想要的服装 Item onto 人体部位的任务是困难的,因为人体大小、姿势和遮盾物如头发和覆盖的衣服会导致困难。在这篇论文中,我们尝试通过 semantic segmentation 和基于生成对抗网络的图像翻译网络来生成高品质的译文图像。我们提出了一个名为 VTON-IT 的新型虚拟试穿应用,它从 RGB 图像中提取欲要的体部,并将目标衣服覆盖到提取的体部上。大多数现状的 GAN-based Virtual Try-On 应用程序在实际测试图像上生成不一致的Pixelated Synthesis图像,但我们的方法可以在 variant 图像上生成高分辨率的自然图像,具有细腻的文件。

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2310.04551
  • repo_url: None
  • paper_authors: Muhammad Osama Khan, Junbang Liang, Chun-Kai Wang, Shan Yang, Yu Lou
    for:* The paper aims to improve the performance of monocular depth estimation models by proposing a comprehensive framework called MeSa, which leverages the complementary strengths of masked, geometric, and supervised pre-training.methods:* The paper uses a combination of pre-training techniques, including masked pre-training, geometric pre-training, and supervised pre-training, to improve the representations of the later layers of the model.* The authors use a layer-wise analysis technique called CKA to evaluate the effectiveness of their pre-training strategy.results:* The paper demonstrates performance improvements in both the in-distribution and out-of-distribution settings on the NYUv2 and IBims-1 datasets, compared to the SOTA SSL method.* The authors show that their approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE, and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.
    Abstract Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a comprehensive framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.
    摘要

Iris Liveness Detection Competition (LivDet-Iris) – The 2023 Edition

  • paper_url: http://arxiv.org/abs/2310.04541
  • repo_url: None
  • paper_authors: Patrick Tinsley, Sandip Purnapatra, Mahsa Mitcheff, Aidan Boyd, Colton Crum, Kevin Bowyer, Patrick Flynn, Stephanie Schuckers, Adam Czajka, Meiling Fang, Naser Damer, Xingyu Liu, Caiyong Wang, Xianyun Sun, Zhaohua Chang, Xinyue Li, Guangzhe Zhao, Juan Tapia, Christoph Busch, Carlos Aravena, Daniel Schulz
  • for: 这个研究报告描述了2023年度的’’LivDet’’系列眼睛展示攻击检测(PAD)竞赛的结果。
  • methods: 这次竞赛新增了基于生成 adversarial network(GAN)生成的眼睛图像作为攻击工具(PAI)的一类,以及人工智能的检测PAI的参考准则。
  • results: Clarkson University和 Notre Dame大学提供了竞赛中使用的图像集,包括7种不同的PAI类型的样本,以及基eline PAD算法。 Fraunhofer IGD、北京市市政工程学院和 Höchschule Darmstadt提交了共8个PAD算法的结果。 根据不同的PAI类型,分析了准确率结果,并与人工准确率进行比较。 总的来说,Fraunhofer IGD的算法(使用注意力基于像素精度网络)获得了最佳权重准确率(37.31%的平均分类错误率),而北京市市政工程学院的算法(使用等权重)获得了平均分类率(22.15%)。这些结果表明,眼睛PAD仍然是一个具有挑战性的问题。
    Abstract This paper describes the results of the 2023 edition of the ''LivDet'' series of iris presentation attack detection (PAD) competitions. New elements in this fifth competition include (1) GAN-generated iris images as a category of presentation attack instruments (PAI), and (2) an evaluation of human accuracy at detecting PAI as a reference benchmark. Clarkson University and the University of Notre Dame contributed image datasets for the competition, composed of samples representing seven different PAI categories, as well as baseline PAD algorithms. Fraunhofer IGD, Beijing University of Civil Engineering and Architecture, and Hochschule Darmstadt contributed results for a total of eight PAD algorithms to the competition. Accuracy results are analyzed by different PAI types, and compared to human accuracy. Overall, the Fraunhofer IGD algorithm, using an attention-based pixel-wise binary supervision network, showed the best-weighted accuracy results (average classification error rate of 37.31%), while the Beijing University of Civil Engineering and Architecture's algorithm won when equal weights for each PAI were given (average classification rate of 22.15%). These results suggest that iris PAD is still a challenging problem.
    摘要

The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric

  • paper_url: http://arxiv.org/abs/2310.05986
  • repo_url: https://github.com/dsevero/linear-autoregressive-similarity-index
  • paper_authors: Daniel Severo, Lucas Theis, Johannes Ballé
  • for: 这个论文旨在构建一种基于推理时的可视系统嵌入,不需要训练数据或深度神经网络特征。
  • methods: 该方法使用了一个权重最小二乘(WLS)问题,定义在像素级别,并在推理时解决,以捕捉全像和局部图像特征。
  • results: 实验表明,使用这种方法可以与基于学习的深度特征方法(如LPIPS和PIM)竞争,而且与手动设计的方法(如MS-SSIM)具有相似的计算成本。
    Abstract We show how perceptual embeddings of the visual system can be constructed at inference-time with no training data or deep neural network features. Our perceptual embeddings are solutions to a weighted least squares (WLS) problem, defined at the pixel-level, and solved at inference-time, that can capture global and local image characteristics. The distance in embedding space is used to define a perceptual similarity metric which we call LASI: Linear Autoregressive Similarity Index. Experiments on full-reference image quality assessment datasets show LASI performs competitively with learned deep feature based methods like LPIPS (Zhang et al., 2018) and PIM (Bhardwaj et al., 2020), at a similar computational cost to hand-crafted methods such as MS-SSIM (Wang et al., 2003). We found that increasing the dimensionality of the embedding space consistently reduces the WLS loss while increasing performance on perceptual tasks, at the cost of increasing the computational complexity. LASI is fully differentiable, scales cubically with the number of embedding dimensions, and can be parallelized at the pixel-level. A Maximum Differentiation (MAD) competition (Wang & Simoncelli, 2008) between LASI and LPIPS shows that both methods are capable of finding failure points for the other, suggesting these metrics can be combined.
    摘要 我们展示了如何在推理时构建视觉系统的感知嵌入,无需训练数据或深度神经网络特征。我们的感知嵌入是解定 weights 最小二乘(WLS)问题的解,定义在像素级别,并在推理时解决,可以捕捉全局和本地图像特征。在嵌入空间中的距离被用来定义一个感知相似度指标,我们称之为线性感知相似度指标(LASI)。我们的实验表明,LASI在全参照图像质量评估数据集上与基于深度学习特征的方法如LPIPS(Zhang et al., 2018)和PIM(Bhardwaj et al., 2020)竞争,同时与手工设计的方法如MS-SSIM(Wang et al., 2003)相似的计算成本。我们发现,增加嵌入空间维度可以逐渐降低WLS损失,同时提高感知任务的性能,但是会增加计算复杂度。LASI是完全导数的,可以在像素级别并行化,并且呈 кубические增长与嵌入维度相关。在MAD竞赛(Wang & Simoncelli, 2008)中,LASI和LPIPS之间进行了竞争,表明这两个指标可以相互找到失败点,这些指标可以结合使用。

URLOST: Unsupervised Representation Learning without Stationarity or Topology

  • paper_url: http://arxiv.org/abs/2310.04496
  • repo_url: None
  • paper_authors: Zeyu Yun, Juexiao Zhang, Bruno Olshausen, Yann LeCun, Yubei Chen
  • for: 该论文旨在开发一种可以从高维数据中学习有意义表示的无监督学习方法,不受数据模式性和结构的限制。
  • methods: 该模型结合学习自组织层、密度调整spectral clustering和伪讯采样层。
  • results: 与现有无监督学习方法相比,该模型在多种数据模式下可以学习有意义表示,并且在许多数据集上表现出色。
    Abstract Unsupervised representation learning has seen tremendous progress but is constrained by its reliance on data modality-specific stationarity and topology, a limitation not found in biological intelligence systems. For instance, human vision processes visual signals derived from irregular and non-stationary sampling lattices yet accurately perceives the geometry of the world. We introduce a novel framework that learns from high-dimensional data lacking stationarity and topology. Our model combines a learnable self-organizing layer, density adjusted spectral clustering, and masked autoencoders. We evaluate its effectiveness on simulated biological vision data, neural recordings from the primary visual cortex, and gene expression datasets. Compared to state-of-the-art unsupervised learning methods like SimCLR and MAE, our model excels at learning meaningful representations across diverse modalities without depending on stationarity or topology. It also outperforms other methods not dependent on these factors, setting a new benchmark in the field. This work represents a step toward unsupervised learning methods that can generalize across diverse high-dimensional data modalities.
    摘要 无监督表征学学习在很大程度上得到了进步,但它受到数据类型特有的静止和结构的限制,这与生物智能系统不同。例如,人类视觉处理视觉信号来自不规则和不静止的抽样网络,然而准确地感知世界的几何结构。我们提出了一种新的框架,可以从缺乏静止和结构的高维数据学习有意义的表示。我们的模型结合可学习的自组织层、适应率调整的спектраль clustering和掩码自适应器。我们对用于生物视觉数据、视觉核心区域神经记录和基因表达数据进行评估,与现状的无监督学习方法SimCLR和MAE相比,我们的模型在不同的数据模式下学习有意义的表示,不依赖于静止和结构。它还超过了其他不依赖于这些因素的方法,创造了新的benchmark在这个领域。这种工作代表了无监督学习方法在多种高维数据模式下的总结。

Alice Benchmarks: Connecting Real World Object Re-Identification with the Synthetic

  • paper_url: http://arxiv.org/abs/2310.04416
  • repo_url: None
  • paper_authors: Xiaoxiao Sun, Yue Yao, Shengjin Wang, Hongdong Li, Liang Zheng
  • for: 本研究的目的是提供一个大规模的实验数据集和评估协议,以便研究从合成数据学习摄像头识别(re-ID)领域的新方法。
  • methods: 本研究使用了现有的PersonX和VehicleX作为合成源领域,并收集了两个具有挑战性的实际世界目标数据集:AlicePerson和AliceVehicle。
  • results: 本研究提供了一个大规模的实验数据集和评估协议,以便研究从合成数据学习摄像头识别领域的新方法。在本研究中,我们还提供了一个线上服务器,让社区可以便捷地评估和比较不同的方法。
    Abstract For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server will be set up for the community to evaluate methods conveniently and fairly.
    摘要 <>输入文本中文转化为简化字符串。<>对象重新标识(re-ID)学习从合成数据中得到了一种可靠的扩大大规模标注数据和效果模型,而且具有少量隐私问题。这种策略引发了许多有趣的研究问题,例如如何减少合成源和实际世界目标之间的领域差异。为了推动学习合成数据中的新方法,我们介绍了Alice数据集,这是一个大规模的数据集,同时提供了评估协议和评价标准。在Alice数据集中,我们提供了两个对象重新标识任务:人体和车辆重新标识。我们收集和标注了一些复杂的实际世界目标数据:AlicePerson和AliceVehicle,这些数据包括不同的照明、图像分辨率等。作为我们实际目标的一个重要特点,我们不手动确保了它的训练集的含义,以便更加接近实际领域适应测试enario。因此,我们 reuse existing PersonX和VehicleX作为合成源领域。我们的主要目标是通过合成数据来训练可以在实际世界中工作的模型。在这篇论文中,我们详细介绍了Alice数据集的设置,分析了现有的通用领域适应方法,并讨论了一些有趣的未来方向。我们将设立一个在线服务器,以便社区可以便捷地评估方法并公平地评估。

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

  • paper_url: http://arxiv.org/abs/2310.04414
  • repo_url: https://github.com/sxzrt/CIFAR-10-W
  • paper_authors: Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, Liang Zheng
  • for: 本研究目的是探讨机器学习模型在不同的未知环境中表现的研究问题。
  • methods: 本文引入了CIFAR-10-Warehouse测试环境,包含180个由搜索图像引擎和扩散模型不同方式生成的数据集。
  • results: 对CIFAR-10-W进行了广泛的 benchmarking 和比较实验,并显示了这些任务中的新和 interessante 结论。
    Abstract Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.
    摘要 研究机器学习模型在不同环境下的性能分析是机器学习社区中的一个关键问题。为了研究这个问题,构建一个包含多个环境差异的测试环境是非常重要的。然而,现有的测试环境通常只有一小部分的领域,或者通过图像损害Synthesized,这限制了算法设计的实际效果。在这篇论文中,我们介绍了CIFAR-10-Warehouse,包含180个数据集,通过图像搜索引擎和扩散模型的不同方法收集。这些数据集的大小通常在300和8,000个图像之间,包含自然图像、动漫、特定颜色或者不自然出现的对象。通过CIFAR-10-W,我们想要提高评估和深入理解两个总结任务:领域总结和模型在多个不同环境下的准确率预测。我们进行了广泛的比较和实验,并显示了CIFAR-10-W提供了新和有趣的总结预测和模型性能评估的视角。此外,我们还讨论了其他领域可以从CIFAR-10-W中受益。

FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

  • paper_url: http://arxiv.org/abs/2310.04412
  • repo_url: https://github.com/ucsc-vlaa/fedconv
  • paper_authors: Peiran Xu, Zeyu Wang, Jieru Mei, Liangqiong Qu, Alan Yuille, Cihang Xie, Yuyin Zhou
  • for: 这篇论文旨在探讨 Federated Learning (FL) 中不同设备上的数据不一致性问题,以及如何使用不同的架构元素来改善 FL 的性能。
  • methods: 该论文采用了一系列的实验研究,以探讨不同架构元素(如激活函数和归一化层)对 FL 性能的影响。
  • results: 研究发现,通过灵活地修改架构元素,纯 CNN 可以在处理不同数据客户端的 FL 中达到与 ViT 相当或甚至超过其 robustness 水平。此外,该方法可以与现有 FL 技术结合使用,并在多种 FL 标准准样上达到状态 arts 解决方案。
    Abstract Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL. Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConv
    摘要 federated learning (FL) 是一种emerging paradigm在机器学习中,where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL. Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConv.

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

  • paper_url: http://arxiv.org/abs/2310.04378
  • repo_url: https://github.com/luosiallen/latent-consistency-model
  • paper_authors: Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao
  • for: 本研究旨在提高潜在扩散模型(LDM)的渐进生成速度,通过借鉴一致模型(song et al.)提出了潜在一致模型(LCM),可以快速地进行推理,并且可以在任何已经训练过LDM的基础上进行快速的高质量插值。
  • methods: 本研究使用了一种新的方法——潜在一致模型(LCM),它是通过解决一个扩展的概率流ODE(PF-ODE)来直接预测潜在空间中的解决方案,从而消除了许多迭代过程,使得渐进生成速度加快。
  • results: 根据LAION-5B-Aesthetics dataset的评估结果,LCMs可以在几步中实现高质量的文本到图像生成,并且与已有的潜在扩散模型(LDM)相比,LCMs可以快速地进行推理,从而提高渐进生成的速度。
    Abstract Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/
    摘要 Latent Diffusion models (LDMs) 已经取得了高分辨率图像合成的惊人成绩。然而,iterative sampling过程 computationally intensive,导致生成过程慢。 inspirited by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), 可以快速地进行推理,只需要 minimal steps,在任何预训练LDMs上,包括Stable Diffusion (rombach et al)。通过视为 solves an augmented probability flow ODE (PF-ODE),LCMs是设计用来直接预测 latent space中的解决方案,从而减少了许多迭代和允许快速、高精度的抽象。从预训练的 classifier-free guided diffusion models中高质量地备取32 A100 GPU hours for training。此外,我们介绍了Latent Consistency Fine-tuning (LCF),一种适用于精度地 fine-tune LCMs on customized image dataset。对于 LAION-5B-Aesthetics dataset的评估表明,LCMs可以在几步推理中实现state-of-the-art的文本到图像生成性能。项目页面:https://latent-consistency-models.github.io/

SwimXYZ: A large-scale dataset of synthetic swimming motions and videos

  • paper_url: http://arxiv.org/abs/2310.04360
  • repo_url: None
  • paper_authors: Fiche Guénolé, Sevestre Vincent, Gonzalez-Barral Camila, Leglaive Simon, Séguier Renaud
  • for: 本研究旨在提供一个可靠的Synthetic dataset of swimming motions和视频,以便进行人体动作分析和评估。
  • methods: 研究人员通过使用计算机视觉技术,从多个视频源中提取了3.4万帧的动作数据,并对其进行了二维和三维关节标注。
  • results: 研究人员通过对SwimXYZ dataset进行分析和应用,实现了人体动作的分类和2D姿态估计。
    Abstract Technologies play an increasingly important role in sports and become a real competitive advantage for the athletes who benefit from it. Among them, the use of motion capture is developing in various sports to optimize sporting gestures. Unfortunately, traditional motion capture systems are expensive and constraining. Recently developed computer vision-based approaches also struggle in certain sports, like swimming, due to the aquatic environment. One of the reasons for the gap in performance is the lack of labeled datasets with swimming videos. In an attempt to address this issue, we introduce SwimXYZ, a synthetic dataset of swimming motions and videos. SwimXYZ contains 3.4 million frames annotated with ground truth 2D and 3D joints, as well as 240 sequences of swimming motions in the SMPL parameters format. In addition to making this dataset publicly available, we present use cases for SwimXYZ in swimming stroke clustering and 2D pose estimation.
    摘要 科技在体育中扮演着越来越重要的角色,成为运动员获得优势的重要工具。其中,基于动作捕捉的技术在不同体育运动中增加了优势。然而,传统的动作捕捉系统费用高且限制性强。近些年来,基于计算机视觉的方法也在某些运动中表现不佳,如游泳,因为水下环境会导致计算机视觉的准确率下降。一个导致这种差距的原因是游泳动作的标注数据缺乏。为解决这个问题,我们介绍了SwimXYZ,一个人工生成的游泳动作数据集。SwimXYZ包含340万帧的标注2D和3D关节点,以及240个游泳动作序列在SMPL参数格式下。此外,我们还公布了SwimXYZ数据集,并提供了游泳动作分组和2D姿态估计的应用场景。

Distributed Deep Joint Source-Channel Coding with Decoder-Only Side Information

  • paper_url: http://arxiv.org/abs/2310.04311
  • repo_url: None
  • paper_authors: Selim F. Yilmaz, Ezgi Ozyilkan, Deniz Gunduz, Elza Erkip
    for: 这个论文主要是关于优化低延迟图像传输过无线频道时,当接收端只有相关的侧信息时(温erner-Ziv场景)。methods: 作者使用了数据驱动的联合源-频道编码(JSCC)方法,该方法在实际 finite blocklength régime中已经显示出performanse superiority,并且可以提供恰当的衰减。在接收端,作者提出了一种新的神经网络架构,该架构将多个阶段的解码器侧信息integrated into the network.results: 作者的研究结果表明,提案的方法可以有效地integrate侧信息,在所有频道噪听水平下,对多种损失函数进行了改进,特别是在低频率信号噪听水平和小带宽比下表现更好。同时,作者还提供了源代码,以便进一步的研究和复现结果。
    Abstract We consider low-latency image transmission over a noisy wireless channel when correlated side information is present only at the receiver side (the Wyner-Ziv scenario). In particular, we are interested in developing practical schemes using a data-driven joint source-channel coding (JSCC) approach, which has been previously shown to outperform conventional separation-based approaches in the practical finite blocklength regimes, and to provide graceful degradation with channel quality. We propose a novel neural network architecture that incorporates the decoder-only side information at multiple stages at the receiver side. Our results demonstrate that the proposed method succeeds in integrating the side information, yielding improved performance at all channel noise levels in terms of the various distortion criteria considered here, especially at low channel signal-to-noise ratios (SNRs) and small bandwidth ratios (BRs). We also provide the source code of the proposed method to enable further research and reproducibility of the results.
    摘要 我们考虑在噪声无线通道上传输低延迟图像,其中接收方具有相关的侧信息(悉尼-зи夫场景)。我们的研究关注使用数据驱动的联合源-通道编码(JSCC)方法,这种方法在实际的有限块长度尺度下已经被证明可以超越传统的分离基于方法,并且可以在通道质量下提供温饱的适应。我们提出了一种新的神经网络架构,该架构在接收方多个阶段都包含了解oder侧的副本信息。我们的结果表明,提案的方法可以有效地集成侧信息,在所有通道噪声水平下都提高了性能,特别是在低通道信噪比(SNR)和小带宽比(BR)下表现更佳。我们还提供了提案的源代码,以便进一步的研究和复现结果。

Convergent ADMM Plug and Play PET Image Reconstruction

  • paper_url: http://arxiv.org/abs/2310.04299
  • repo_url: None
  • paper_authors: Florent Sureau, Mahdi Latreche, Marion Savanier, Claude Comtat
  • for: 这个论文旨在研究基于模型基本的变量重建法和独立学习深度神经网络算法的混合PET重建方法。
  • methods: 该方法使用ADMM插件和玩家框架,并在学习过程中添加了一个约束来保证网络参数的稳定性。
  • results: 实验表明,当不在学习过程中 enforcing该约束时,ADMM算法不会 converges。而在 enforcing该约束时,方法实际上可以达到意义的稳定点。
    Abstract In this work, we investigate hybrid PET reconstruction algorithms based on coupling a model-based variational reconstruction and the application of a separately learnt Deep Neural Network operator (DNN) in an ADMM Plug and Play framework. Following recent results in optimization, fixed point convergence of the scheme can be achieved by enforcing an additional constraint on network parameters during learning. We propose such an ADMM algorithm and show in a realistic [18F]-FDG synthetic brain exam that the proposed scheme indeed lead experimentally to convergence to a meaningful fixed point. When the proposed constraint is not enforced during learning of the DNN, the proposed ADMM algorithm was observed experimentally not to converge.
    摘要 在这个工作中,我们研究了基于模型基于变量重建和独立学习深度神经网络操作符(DNN)的混合PET重建算法,并在ADMM插件和撤退框架中应用。根据最近的优化结果,我们提出了一种强制在学习过程中加入网络参数的约束,以实现Fixed point convergence的方案。我们提出的ADMM算法,并在一个实际的 [18F]-FDG Synthetic brain exam中展示了,该方案实际上可以导致实际上达到一个意义的Fixed point。当不加入学习DNN的约束时,我们发现了ADMM算法在学习过程中不会 converge。

Graph learning in robotics: a survey

  • paper_url: http://arxiv.org/abs/2310.04294
  • repo_url: None
  • paper_authors: Francesca Pistilli, Giuseppe Averta
  • for: 本文旨在探讨深度神经网络在 роботех学中的应用,以便充分发挥其潜力。
  • methods: 本文总结了图像基于模型的基本知识,包括其建立、训练过程和应用。同时,它还讨论了在实际应用中遇到的最新进展和挑战,例如感知、决策和控制的集成。
  • results: 本文提供了各种机器人应用,如身体和接触模型、机器人操作、动作识别、舰队动力规划等,以及这些应用中图像学习的可能性和局限性。这篇文章旨在为读者提供图像学习在机器人领域的全面了解,并提出未来研究的可能性。
    Abstract Deep neural networks for graphs have emerged as a powerful tool for learning on complex non-euclidean data, which is becoming increasingly common for a variety of different applications. Yet, although their potential has been widely recognised in the machine learning community, graph learning is largely unexplored for downstream tasks such as robotics applications. To fully unlock their potential, hence, we propose a review of graph neural architectures from a robotics perspective. The paper covers the fundamentals of graph-based models, including their architecture, training procedures, and applications. It also discusses recent advancements and challenges that arise in applied settings, related for example to the integration of perception, decision-making, and control. Finally, the paper provides an extensive review of various robotic applications that benefit from learning on graph structures, such as bodies and contacts modelling, robotic manipulation, action recognition, fleet motion planning, and many more. This survey aims to provide readers with a thorough understanding of the capabilities and limitations of graph neural architectures in robotics, and to highlight potential avenues for future research.
    摘要 深度神经网络 для图有效地处理复杂非欧几何数据,这种数据在各种应用中日益普遍。然而,虽然机器学习社区对其潜力广泛认可,但图学习在机器人应用中尚未得到广泛探索。为充分发挥其潜力,我们提出了对图神经建筑从机器人视角进行评审的评论。文章覆盖了图基于模型的基础知识,包括建筑、训练过程和应用。它还讨论了在实践中的进展和挑战,如感知、决策和控制的集成。最后,文章提供了许多机器人应用,例如身体和接触模型、机器人操作、动作识别、队伍运策划等,它们均可以从图学习中受益。这篇评论的目标是为读者提供图神经建筑在机器人领域的能力和局限性的全面了解,并高亮未来研究的可能性。

Compositional Servoing by Recombining Demonstrations

  • paper_url: http://arxiv.org/abs/2310.04271
  • repo_url: None
  • paper_authors: Max Argus, Abhijeet Nayak, Martin Büchner, Silvio Galesso, Abhinav Valada, Thomas Brox
  • for: 本文旨在提高视觉服务器的任务转移能力和多任务能力,并使其更加稳定和高精度。
  • methods: 本文使用图 traversal 方法来解决视觉服务器任务,并通过分解和重新组合示例来实现多任务能力。
  • results: 实验和实践结果表明,我们的方法可以提高任务相关的成功率,并且在高精度场景下达到更高的稳定性和效率。
    Abstract Learning-based manipulation policies from image inputs often show weak task transfer capabilities. In contrast, visual servoing methods allow efficient task transfer in high-precision scenarios while requiring only a few demonstrations. In this work, we present a framework that formulates the visual servoing task as graph traversal. Our method not only extends the robustness of visual servoing, but also enables multitask capability based on a few task-specific demonstrations. We construct demonstration graphs by splitting existing demonstrations and recombining them. In order to traverse the demonstration graph in the inference case, we utilize a similarity function that helps select the best demonstration for a specific task. This enables us to compute the shortest path through the graph. Ultimately, we show that recombining demonstrations leads to higher task-respective success. We present extensive simulation and real-world experimental results that demonstrate the efficacy of our approach.
    摘要 学习基于图像输入的掌控策略经常表现出任务传递能力不足。相比之下,视觉服务方法可以在高精度场景中实现高效的任务传递,只需要几个示范。在这种工作中,我们提出了一种将视觉服务任务转换为图论探索的框架。我们的方法不仅扩展了视觉服务的稳定性,还允许基于几个任务特定示范的多任务能力。我们将示范图分解成多个示范,并将它们重新组合在一起。为在推理情况下探索示范图,我们利用一个相似性函数来选择最佳示范。这使得我们可以计算最短路径。最终,我们发现将示范重新组合可以获得更高的任务特定成功率。我们在 simulate 和实际实验中展示了我们的方法的有效性。

Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark

  • paper_url: http://arxiv.org/abs/2310.04253
  • repo_url: https://github.com/zc199823/bbnet--cocod
  • paper_authors: Cong Zhang, Hongbo Bi, Tian-Zhu Xiang, Ranwan Wu, Jinghui Tong, Xiufang Wang
  • for: 这个论文是为了研究一种新的隐身物体检测任务(CoCOD),该任务的目标是从一组相关图像中同时检测具有同样属性的隐身物体。
  • methods: 作者提出了一种基eline模型,名为 bilateral-branch network (BBNet),该模型在单个图像和图像组之间进行协同探索和综合隐身征 clue,以实现准确的隐身物体检测。
  • results: 作者在提出的 CoCOD8K dataset上进行了广泛的实验,并与 18 种现有模型进行比较。结果表明,提出的方法和模型在 CoCOD 任务中表现出了显著的优异性。
    Abstract In this paper, we provide a comprehensive study on a new task called collaborative camouflaged object detection (CoCOD), which aims to simultaneously detect camouflaged objects with the same properties from a group of relevant images. To this end, we meticulously construct the first large-scale dataset, termed CoCOD8K, which consists of 8,528 high-quality and elaborately selected images with object mask annotations, covering 5 superclasses and 70 subclasses. The dataset spans a wide range of natural and artificial camouflage scenes with diverse object appearances and backgrounds, making it a very challenging dataset for CoCOD. Besides, we propose the first baseline model for CoCOD, named bilateral-branch network (BBNet), which explores and aggregates co-camouflaged cues within a single image and between images within a group, respectively, for accurate camouflaged object detection in given images. This is implemented by an inter-image collaborative feature exploration (CFE) module, an intra-image object feature search (OFS) module, and a local-global refinement (LGR) module. We benchmark 18 state-of-the-art models, including 12 COD algorithms and 6 CoSOD algorithms, on the proposed CoCOD8K dataset under 5 widely used evaluation metrics. Extensive experiments demonstrate the effectiveness of the proposed method and the significantly superior performance compared to other competitors. We hope that our proposed dataset and model will boost growth in the COD community. The dataset, model, and results will be available at: https://github.com/zc199823/BBNet--CoCOD.
    摘要 在这篇论文中,我们提供了一项全面的研究,探讨一种新的任务 called 协同掩饰物体检测(CoCOD),该任务的目标是从一组相关图像中同时检测掩饰物体,并且这些物体具有同样的属性。为了实现这一目标,我们在这篇论文中 méticulously 构建了首个大规模数据集,名为 CoCOD8K,该数据集包含 8,528 高质量和精心选择的图像,以及对象 маску 注解,涵盖 5 个超类和 70 个 subclass。该数据集覆盖了自然和人工掩饰场景,图像中的物体外观和背景具有多样性,使得这个数据集对 CoCOD 非常吃力。此外,我们提出了首个 CoCOD 基线模型,名为 bilateral-branch 网络(BBNet),该模型在单个图像和图像集内进行协同掩饰缓解(CFE)、图像内对象搜索(OFS)和本地-全球精度调整(LGR),以确定准确的掩饰物体。我们对 18 个状态机器学习模型进行了 benchmarking,其中包括 12 COD 算法和 6 CoSOD 算法,并在 5 个常用的评价指标下进行了评估。广泛的实验表明我们提出的方法和模型在 CoCOD 中表现出色,与其他竞争对手相比显著性更高。我们希望我们的提出的数据集、模型和结果将能够推动 COD 社区的发展。数据集、模型和结果将在 GitHub 上公开:https://github.com/zc199823/BBNet--CoCOD。

Semantic segmentation of longitudinal thermal images for identification of hot and cool spots in urban areas

  • paper_url: http://arxiv.org/abs/2310.04247
  • repo_url: None
  • paper_authors: Vasantha Ramani, Pandarasamy Arjunan, Kameshwar Poolla, Clayton Miller
  • For: This paper aims to analyze thermal images collected at the neighborhood scale to identify hot and cool spots in urban areas, with the goal of helping urban planners develop strategies to mitigate the urban heat island (UHI) effect, improve building energy efficiency, and maximize outdoor thermal comfort.* Methods: The authors use state-of-the-art deep learning models to segment various urban features such as buildings, vegetation, sky, and roads from thermal images. They train the models using a subset of the thermal image dataset and compare the performance of different models, including U-Net, DeepLabV3, DeeplabV3+, FPN, and PSPnet.* Results: The authors find that the U-Net segmentation model with a `resnet34’ CNN backbone achieves the highest mIoU score of 0.99 on the test dataset, and that the masks generated using the segmentation models accurately extract the temperature from thermal images and closely match the temperature extracted using ground truth masks. The authors use the masks to identify hot and cool spots in the urban feature at various instances of time.
    Abstract This work presents the analysis of semantically segmented, longitudinally, and spatially rich thermal images collected at the neighborhood scale to identify hot and cool spots in urban areas. An infrared observatory was operated over a few months to collect thermal images of different types of buildings on the educational campus of the National University of Singapore. A subset of the thermal image dataset was used to train state-of-the-art deep learning models to segment various urban features such as buildings, vegetation, sky, and roads. It was observed that the U-Net segmentation model with `resnet34' CNN backbone has the highest mIoU score of 0.99 on the test dataset, compared to other models such as DeepLabV3, DeeplabV3+, FPN, and PSPnet. The masks generated using the segmentation models were then used to extract the temperature from thermal images and correct for differences in the emissivity of various urban features. Further, various statistical measure of the temperature extracted using the predicted segmentation masks is shown to closely match the temperature extracted using the ground truth masks. Finally, the masks were used to identify hot and cool spots in the urban feature at various instances of time. This forms one of the very few studies demonstrating the automated analysis of thermal images, which can be of potential use to urban planners for devising mitigation strategies for reducing the urban heat island (UHI) effect, improving building energy efficiency, and maximizing outdoor thermal comfort.
    摘要

Enhancing the Authenticity of Rendered Portraits with Identity-Consistent Transfer Learning

  • paper_url: http://arxiv.org/abs/2310.04194
  • repo_url: None
  • paper_authors: Luyuan Wang, Yiqian Wu, Yongliang Yang, Chen Liu, Xiaogang Jin
  • for: 该论文旨在提高计算机图形学中的虚拟人像生成质量,并减少’’uncanny valley’’效应。
  • methods: 该论文使用了传输学习来学习一个映射,从虚拟人像的特征空间传递到真实人像的特征空间。
  • results: 该论文通过对 DRFHQ 数据集进行精心适应,使用 StyleGAN2 生成器,实现了提高虚拟人像的真实感和减少’’uncanny valley’’效应。
    Abstract Despite rapid advances in computer graphics, creating high-quality photo-realistic virtual portraits is prohibitively expensive. Furthermore, the well-know ''uncanny valley'' effect in rendered portraits has a significant impact on the user experience, especially when the depiction closely resembles a human likeness, where any minor artifacts can evoke feelings of eeriness and repulsiveness. In this paper, we present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect and improve the overall authenticity of rendered portraits. Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits. During the inference stage, the input portrait of an avatar can be directly transferred to a realistic portrait by changing its appearance style while maintaining the facial identity. To this end, we collect a new dataset, Daz-Rendered-Faces-HQ (DRFHQ), that is specifically designed for rendering-style portraits. We leverage this dataset to fine-tune the StyleGAN2 generator, using our carefully crafted framework, which helps to preserve the geometric and color features relevant to facial identity. We evaluate our framework using portraits with diverse gender, age, and race variations. Qualitative and quantitative evaluations and ablation studies show the advantages of our method compared to state-of-the-art approaches.
    摘要 尽管计算机图形技术得到了快速的进步,但创建高质量的图像化人脸仍然是非常昂贵的。此外,在rendered portrait中的“uncanny valley”效应也对用户体验产生了显著的影响,特别是当描述的人脸非常真实时,任何小误差都可能引起 eeriness和repulsiveness的感受。在这篇论文中,我们提出了一种新的图像化人脸框架,可以有效减少“uncanny valley”效应,提高渲染人脸的 authenticity。我们的关键思想是通过转移学习学习一个人脸的概率空间中的mapping,以便在渲染人脸的过程中保持人脸的facial identity。在推理阶段,输入的人脸可以直接被转换为真实的人脸,只需要改变其外观风格,而不会失去人脸的特征。为了实现这一目标,我们收集了一个新的数据集,DRFHQ(Daz-Rendered-Faces-HQ),这个数据集专门用于渲染风格的人脸。我们利用这个数据集来精心调整StyleGAN2生成器,使其保持人脸的几何和颜色特征。我们对这种方法进行了质量和量化的评估,以及减少方法的ablation study,以证明我们的方法与当前的方法相比有优势。

Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases

  • paper_url: http://arxiv.org/abs/2310.04189
  • repo_url: None
  • paper_authors: Xinpeng Liu, Yong-Lu Li, Ailing Zeng, Zizheng Zhou, Yang You, Cewu Lu
  • for: 这篇论文的目的是建立一个可靠的动作 Semantic 和动作之间的映射关系,但是这是一个复杂的多对多问题。
  • methods: 我们提出了 Kinematic Phrases (KP) 作为一种 mediator,使得可以统一动作知识库并建立动作理解系统。KP 可以自动将动作转换为文本描述,无需主观偏见,这也附生出了一种新的自动动作生成比赛指标——Kinematic Prompt Generation (KPG)。
  • results: 在广泛的实验中,我们的方法表现出了超过其他方法的优势。
    Abstract The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walk with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping between them difficult. Previous attempts adopted direct-mapping paradigms with limited reliability. Also, current automatic metrics fail to provide reliable assessments of the consistency between motions and action semantics. We identify the source of these problems as the significant gap between the two modalities. To alleviate this gap, we propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality characteristics. Based on KP as a mediator, we can unify a motion knowledge base and build a motion understanding system. Meanwhile, KP can be automatically converted from motions and to text descriptions with no subjective bias, inspiring Kinematic Prompt Generation (KPG) as a novel automatic motion generation benchmark. In extensive experiments, our approach shows superiority over other methods. Our code and data would be made publicly available at https://foruck.github.io/KP.
    摘要 目的是建立有可靠映射的动作 Semantics 和动作 Meaning 之间的映射,这是一个复杂的多对多问题。抽象的动作 semantics(例如走向前)可以通过多种感知多样的动作(走动手指或抓握)进行表达,而一个动作可以在不同的上下文和意图下具有不同的 semantics。这使得找到一个简洁的映射变得困难。前一些尝试采用了直接映射 paradigms,但其可靠性有限。此外,当前自动度量不能提供有效的动作 Semantics 和动作之间的一致性评估。我们认为这一问题的来源是动作和 Semantics 之间的巨大差距。为了缓解这个差距,我们提出了机械学术短语(KP),它可以对人体动作的 объекively 知识进行抽象、可读性和一般特征。基于 KP 作为中介,我们可以统一动作知识库和建立动作理解系统。此外, KP 可以自动从动作中转换成文本描述,无需主观偏见,这 inspirits 动机Prompt Generation(KPG)作为一种新的自动动作生成标准。在广泛的实验中,我们的方法表现出了其他方法的优越性。我们的代码和数据将在 上公开。

Whole Slide Multiple Instance Learning for Predicting Axillary Lymph Node Metastasis

  • paper_url: http://arxiv.org/abs/2310.04187
  • repo_url: https://github.com/glejdis/whole-slide-mil-for-predicting-axillary-lymph-node-metastasis
  • paper_authors: Glejdis Shkëmbi, Johanna P. Müller, Zhe Li, Katharina Breininger, Peter Schüffler, Bernhard Kainz
  • for: 本研究旨在开发一种基于深度学习(深度学习)的分类管道,用于从数字核心针刺样本(CNB)图像中提取临床信息,比现有方法减少一步。
  • methods: 本研究使用了一个公共可用的数据集,包含1058名患者的数据,以评估不同基线状态的深度学习模型在分类肿瘤 метастаisis的状况基于CNB图像。此外,还进行了一项广泛的数据扩充研究。
  • results: 研究发现,使用不同的数据扩充技术可以提高模型的性能,并且手动肿瘤 segmentation 和注释步骤进行了评估。
    Abstract Breast cancer is a major concern for women's health globally, with axillary lymph node (ALN) metastasis identification being critical for prognosis evaluation and treatment guidance. This paper presents a deep learning (DL) classification pipeline for quantifying clinical information from digital core-needle biopsy (CNB) images, with one step less than existing methods. A publicly available dataset of 1058 patients was used to evaluate the performance of different baseline state-of-the-art (SOTA) DL models in classifying ALN metastatic status based on CNB images. An extensive ablation study of various data augmentation techniques was also conducted. Finally, the manual tumor segmentation and annotation step performed by the pathologists was assessed.
    摘要 乳癌是女性健康的主要问题, axillary lymph node(ALN) метастази的识别对诊断评估和治疗指导是关键。这篇论文介绍了一种深度学习(DL)分类管道,用于从数字核心针刺影像中提取临床信息,比现有方法少一步。使用了1058名患者的公共可用数据集来评估不同基eline状态的DL模型在分类ALN肿瘤状态基于CNB影像的性能。此外,还进行了广泛的数据增强技术的抽象研究。最后,评估了病理医生手动肿瘤分割和标注步骤。

DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions

  • paper_url: http://arxiv.org/abs/2310.04181
  • repo_url: None
  • paper_authors: Sanket Kalwar, Mihir Ungarala, Shruti Jain, Aaron Monis, Krishna Reddy Konda, Sourav Garg, K Madhava Krishna
  • for: 这篇论文目的是提高自动驾驶系统在不良天气情况下的Semantic segmentation能力。
  • methods: 这篇论文提出了一种新的可微分的视觉和latent prompting机制,可以扩展现有的adaptor在基础模型中的学习能力。我们的提案的 $\nabla$HFC图像处理封页在不良天气情况下表现出色,而传统方法往往无法应对。
  • results: 我们的方法可以将visual prompts和latent prompts联合训练,实现了在Out-of-distribution情况下的表现优化。我们的可微分的视觉提醒可以充分利用平行和串行架构,实现更好地提高object segmentation任务的性能。经过了一系列的实验和评估,我们提供了实践证据支持我们的方法的有效性。
    Abstract Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in foundation models. Our proposed $\nabla$HFC image processing block excels particularly in adverse weather conditions, where conventional methods often fall short. Furthermore, we investigate the advantages of jointly training visual and latent prompts, demonstrating that this combined approach significantly enhances performance in out-of-distribution scenarios. Our differentiable visual prompts leverage parallel and series architectures to generate prompts, effectively improving object segmentation tasks in adverse conditions. Through a comprehensive series of experiments and evaluations, we provide empirical evidence to support the efficacy of our approach. Project page at https://diffprompter.github.io.
    摘要 “严阵天气下的 semantic segmentation 是自动驾驶系统中的一个重要任务。 Foundation model 已经显示了承认的能力,但对于更加具体的enario 需要特殊的 adaptor 来扩展学习能力。我们介绍 DiffPrompter,一种新的可 differentiable 的visual 和 latent 启发机制,用于扩展现有 adaptor 的学习能力。我们的 proposed $\nabla$HFC 图像处理封页在恶劣天气下表现特别出色, conventional 方法通常在这些情况下失败。此外,我们调查了同时训练 visual 和 latent 启发的共同优点,并证明这种结合方法可以在非常规情况下明显提高性能。我们的可 differentiable 的 visual 启发使用并行和串行架构来生成启发,实际地改善了对于恶劣天气的object segmentation任务。通过了一系列的实验和评估,我们提供了实践证据支持我们的方法的有效性。Project page 为 https://diffprompter.github.io。”

Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2310.04180
  • repo_url: https://github.com/i2-multimedia-lab/dsat
  • paper_authors: Qingguo Liu, Pan Gao, Kang Han, Ningzhong Liu, Wei Xiang
  • for: 提出了一种基于Transformer的盲超分辨率网络模型,以适应各种不确定噪声的环境。
  • methods: 该模型 integrates CNN和Transformer两种组件,首先使用CNN模ulated by degradation information来EXTRACT LOCAL FEATURES,然后employs degradation-aware Transformer来EXTRACT GLOBAL SEMANTIC FEATURES。
  • results: 对多个流行的大规模 benchmark dataset进行测试,实现了与现有方法相比的最佳性能,包括Urban100 dataset的PSNR提高0.94 dB和26.62 dB。Source code可以在https://github.com/I2-Multimedia-Lab/DSAT/tree/main中获取。
    Abstract Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. Source code is available at: https://github.com/I2-Multimedia-Lab/DSAT/tree/main.
    摘要 Comparing to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their ability to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. 源代码可以在 GitHub 上获取:https://github.com/I2-Multimedia-Lab/DSAT/tree/main.

Entropic Score metric: Decoupling Topology and Size in Training-free NAS

  • paper_url: http://arxiv.org/abs/2310.04179
  • repo_url: None
  • paper_authors: Niccolò Cavagnero, Luca Robbiano, Francesca Pistilli, Barbara Caputo, Giuseppe Averta
  • for: 这个研究旨在提高适用于边缘应用的高性能卷积神经网络设计,特别是面临资源受限的实际应用情况下。
  • methods: 本研究提出了一个新的训练自由度量表(Entropic Score),用于估算神经网络的表达能力,以及一种循环搜索算法来独立地搜索神经网络的结构和大小。
  • results: 本研究获得了在 less than 1 GPU 小时内完全设计高性能的 Hybrid Transformers 模型,并在 ImageNet 类别任务上获得了最高精度和最快速的 NAS 方法。
    Abstract Neural Networks design is a complex and often daunting task, particularly for resource-constrained scenarios typical of mobile-sized models. Neural Architecture Search is a promising approach to automate this process, but existing competitive methods require large training time and computational resources to generate accurate models. To overcome these limits, this paper contributes with: i) a novel training-free metric, named Entropic Score, to estimate model expressivity through the aggregated element-wise entropy of its activations; ii) a cyclic search algorithm to separately yet synergistically search model size and topology. Entropic Score shows remarkable ability in searching for the topology of the network, and a proper combination with LogSynflow, to search for model size, yields superior capability to completely design high-performance Hybrid Transformers for edge applications in less than 1 GPU hour, resulting in the fastest and most accurate NAS method for ImageNet classification.
    摘要 neural networks 设计是一个复杂和具有挑战性的任务,特别是在移动设备上进行训练的小型模型 scenario 下。 neuronal architecture search 是一种有前途的方法,可以自动化这个过程,但现有的竞争性方法具有大量训练时间和计算资源,以生成高精度模型。为了突破这些限制,这篇论文做出了以下贡献:1. 一种新的训练时间无关的指标, named Entropic Score,可以通过汇集 activations 的元素级 entropy 来估算模型表达能力。2. 一种循环搜索算法,可以分别 yet synergistically 搜索模型的结构和大小。 Entropic Score 表现出了remarkable 的能力来搜索模型的结构,而与 LogSynflow 的组合可以在 less than 1 GPU 小时内完全设计高性能的 Hybrid Transformers 模型,并在 ImageNet 预测中达到最快和最准确的 NAS 方法。

Improving Neural Radiance Field using Near-Surface Sampling with Point Cloud Generation

  • paper_url: http://arxiv.org/abs/2310.04152
  • repo_url: None
  • paper_authors: Hye Bin Yoo, Hyun Min Han, Sung Soo Hwang, Il Yong Chun
  • for: 提高NeRF的渲染质量和减少训练时间
  • methods: 采用近表面抽象法,使用训练集中的深度图像来估算3D对象的表面,并且在这个表面附近进行采样。同时,该方法还提出了一种3D点云生成方法和一种简单的修正方法来获取novel view中的深度信息。
  • results: 实验结果显示,提出的近表面采样NeRF框架可以显著提高NeRF的渲染质量,并且可以减少NeRF模型的训练时间。
    Abstract Neural radiance field (NeRF) is an emerging view synthesis method that samples points in a three-dimensional (3D) space and estimates their existence and color probabilities. The disadvantage of NeRF is that it requires a long training time since it samples many 3D points. In addition, if one samples points from occluded regions or in the space where an object is unlikely to exist, the rendering quality of NeRF can be degraded. These issues can be solved by estimating the geometry of 3D scene. This paper proposes a near-surface sampling framework to improve the rendering quality of NeRF. To this end, the proposed method estimates the surface of a 3D object using depth images of the training set and sampling is performed around there only. To obtain depth information on a novel view, the paper proposes a 3D point cloud generation method and a simple refining method for projected depth from a point cloud. Experimental results show that the proposed near-surface sampling NeRF framework can significantly improve the rendering quality, compared to the original NeRF and a state-of-the-art depth-based NeRF method. In addition, one can significantly accelerate the training time of a NeRF model with the proposed near-surface sampling framework.
    摘要 神经辐射场(NeRF)是一种崛起的视图合成方法,它在三维空间中随机 sampling 点并估算它们的存在和颜色概率。NeRF 的缺点是它需要训练时间很长,因为它需要随机 sampling 大量的三维点。此外,如果从遮盖区域或不可能存在的空间中随机 sampling 点,NeRF 的渲染质量将受到降低。这些问题可以通过估算三维场景的geometry来解决。这篇论文提议一种靠近表面 sampling 框架,以改善 NeRF 的渲染质量。为此,提议方法使用训练集的深度图像来估算三维 объек 的表面,然后在那里进行随机 sampling。要在新视图中获取深度信息,论文提议一种三维点云生成方法和一种简单的修正方法。实验结果表明,提议的靠近表面 sampling NeRF 框架可以 significatively 改善 NeRF 的渲染质量,相比于原始 NeRF 和一种状态流行的深度基于 NeRF 方法。此外,可以通过提议的靠近表面 sampling 框架快速加速 NeRF 模型的训练时间。

TiC: Exploring Vision Transformer in Convolution

  • paper_url: http://arxiv.org/abs/2310.04134
  • repo_url: https://github.com/zs670980918/msa-conv
  • paper_authors: Song Zhang, Qingzhong Wang, Jiang Bian, Haoyi Xiong
  • for: 提高transformer模型在不同尺度图像处理中的灵活性和计算效率,即使不需要重新训练或resize图像。
  • methods: 提出了Multi-Head Self-Attention Convolution(MSA-Conv),它将自我注意力 incorporated into generalized convolutions,包括标准、扩展和深度的卷积。
  • results: 提出了Vision Transformer in Convolution(TiC),并实现了两种可能性提高策略:Multi-Directional Cyclic Shifted Mechanism和Inter-Pooling Mechanism。通过实验证明了TiC的总效果,并通过精准权重分析证明了MSA-Conv和两种可能性提高策略的性能提升。
    Abstract While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.
    摘要 “ mentre i modelli derivati dai Vision Transformers (ViTs) hanno avuto un'espansione fenomenale, i modelli pre-tramati non possono adattarsi facilmente alle immagini di risoluzione arbitraria senza modificare l'architettura e la configurazione, come la sampling del coding posizionale, limitando la loro flessibilità per diverse task di visione. Ad esempio, il modello Segment Anything Model (SAM) basato su ViT-Huge richiede che tutte le immagini di input siano resezzate a 1024x1024. Per superare questa limitazione, propongo la Multi-Head Self-Attention Convolution (MSA-Conv) che incorpora l'Auto-Attention all'interno delle convolutioni generalizzate, compresse standard, diluate e depthwise. In questo modo, i transformers possono gestire immagini di dimensioni diverse senza dover rinunciare o ridimensionare, riducendo i costi computazionali rispetto all'attenzione globale in ViT, che cresce costoso con l'aumentare delle dimensioni dell'immagine. Successivamente, presento il Vision Transformer in Convolution (TiC) come una prova di concetto per la classificazione di immagini con MSA-Conv, dove due strategie di miglioria della capacità, ossia la Multi-Directional Cyclic Shifted Mechanism e l'Inter-Pooling Mechanism, sono state proposte, attraverso la creazione di connessioni a distanza lunga tra i token e l'aumento del campo rettangolare efficace. Sono state eseguite estese esperienze per validare l'efficacia generale di TiC. Inoltre, gli studi di ablazione hanno confermato l'improvemento delle prestazioni ottenuto da MSA-Conv e dalle due strategie di miglioria della capacità separate. Nota che la nostra proposta si rivolge allo studio di un'alternativa all'attenzione globale utilizzata in ViT, mentre MSA-Conv incontra il nostro obiettivo facendo di TiC comparabile ai migliori risultati su ImageNet-1K. Il codice verrà rilasciato sul sito GitHub https://github.com/zs670980918/MSA-Conv.”

VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification

  • paper_url: http://arxiv.org/abs/2310.04122
  • repo_url: None
  • paper_authors: Han Huang, Yan Huang, Liang Wang
    for:VI-ReID task with single-modality labeled datamethods:unpaired image-to-image translation techniques, diffusion model (VI-Diff)results:outperforms existing diffusion and GAN models, promising solution for VI-ReID task with single-modality labeled data.
    Abstract Visible-Infrared person re-identification (VI-ReID) in real-world scenarios poses a significant challenge due to the high cost of cross-modality data annotation. Different sensing cameras, such as RGB/IR cameras for good/poor lighting conditions, make it costly and error-prone to identify the same person across modalities. To overcome this, we explore the use of single-modality labeled data for the VI-ReID task, which is more cost-effective and practical. By labeling pedestrians in only one modality (e.g., visible images) and retrieving in another modality (e.g., infrared images), we aim to create a training set containing both originally labeled and modality-translated data using unpaired image-to-image translation techniques. In this paper, we propose VI-Diff, a diffusion model that effectively addresses the task of Visible-Infrared person image translation. Through comprehensive experiments, we demonstrate that VI-Diff outperforms existing diffusion and GAN models, making it a promising solution for VI-ReID with single-modality labeled data. Our approach can be a promising solution to the VI-ReID task with single-modality labeled data and serves as a good starting point for future study. Code will be available.
    摘要 visible-infrared人识别(VI-ReID)在实际场景中存在 significanthigh cost of cross-modality数据标注问题。不同的感知镜头,如RGB/IR镜头 для不同的照明条件,使得在不同感知模式之间进行人识别变得昂贵和容易出错。为了解决这个问题,我们研究了使用单模态标注数据来进行VI-ReID任务,这更加经济实用。我们将人员标注在一个模式(例如可见图像)中,然后在另一个模式(例如红外图像)中进行检索。我们希望通过不同的图像对应关系技术来创建一个包含原始标注和模式翻译数据的训练集。在这篇论文中,我们提出了VI-Diff,一种难涨模型,可以有效地解决可见红外人像翻译任务。通过广泛的实验,我们证明了VI-Diff在 diffusion和GAN模型之上表现出色,使其成为VI-ReID任务中单模态标注数据的可靠解决方案。我们的方法可以在VI-ReID任务中提供可靠的解决方案,并作为未来研究的开端。代码将可以提供。

Aorta Segmentation from 3D CT in MICCAI SEG.A. 2023 Challenge

  • paper_url: http://arxiv.org/abs/2310.04114
  • repo_url: https://github.com/Project-MONAI/MONAI
  • paper_authors: Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu
  • for: 这个研究是为了提出一种自动化的血管分割方法,以帮助早期发现和监测血管疾病。
  • methods: 这个研究使用了一种名为Auto3DSeg的自动化分割方法,可以在MONAI中使用。
  • results: 这个方法在3D CT图像中对血管进行分割,得到了平均的 dice分数0.920和95%的 Hausdorff 距离6.013,这与其他参赛者相比,得到了第一名和赢得了SEG.A. 2023挑战。
    Abstract Aorta provides the main blood supply of the body. Screening of aorta with imaging helps for early aortic disease detection and monitoring. In this work, we describe our solution to the Segmentation of the Aorta (SEG.A.231) from 3D CT challenge. We use automated segmentation method Auto3DSeg available in MONAI. Our solution achieves an average Dice score of 0.920 and 95th percentile of the Hausdorff Distance (HD95) of 6.013, which ranks first and wins the SEG.A. 2023 challenge.
    摘要 “冠状动脉提供身体主要血液供应。实时侦测冠状动脉可以早期检测和监控冠状疾病。这里我们介绍我们对三维CT图像中的冠状动脉分类挑战的解决方案。我们使用自动分类方法Auto3DSeg,该方法在MONAI中可用。我们的解决方案得到了0.920的 dice分数和6.013的 Hausdorff距离(HD95)的95%分布,它在SEG.A. 2023挑战中排名第一,获得了首奖。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Dense Random Texture Detection using Beta Distribution Statistics

  • paper_url: http://arxiv.org/abs/2310.04111
  • repo_url: None
  • paper_authors: Soeren Molander
  • for: 检测粗糙随机文本使用完全连接点 sampling on image edges
  • methods: 使用完全连接点 sampling on image edges,计算点对点的L2距离,并对每个点进行边检查,如果 intersects with image edge,则添加 unity value,否则添加 zero。从而计算出完全连接边图的edge excess index,该指标在[1.0..2.0]范围内,表示不存在边的情况。
  • results: 该方法应用于实时SLAM-based moving object detection中,点受限于跟踪框(ROIs)。
    Abstract This note describes a method for detecting dense random texture using fully connected points sampled on image edges. An edge image is randomly sampled with points, the standard L2 distance is calculated between all connected points in a neighbourhood. For each point, a check is made if the point intersects with an image edge. If this is the case, a unity value is added to the distance, otherwise zero. From this an edge excess index is calculated for the fully connected edge graph in the range [1.0..2.0], where 1.0 indicate no edges. The ratio can be interpreted as a sampled Bernoulli process with unknown probability. The Bayesian posterior estimate of the probability can be associated with its conjugate prior which is a Beta($\alpha$, $\beta$) distribution, with hyper parameters $\alpha$ and $\beta$ related to the number of edge crossings. Low values of $\beta$ indicate a texture rich area, higher values less rich. The method has been applied to real-time SLAM-based moving object detection, where points are confined to tracked boxes (rois).
    摘要 这份备忘录详细介绍了一种用于检测紧密随机文本的方法,该方法基于完全连接点在图像边缘上进行采样。首先,一张边像被随机采样点,然后计算所有连接点的标准L2距离。对于每个点,检查该点是否与图像边缘交叠。如果交叠,则将unity值添加到距离中,否则为零。根据这些距离,计算一个完全连接边图的Edge Excess Index,该指标在[1.0..2.0]范围内,其中1.0表示没有边。这个指标可以被解释为一个随机 Bernoulli 过程的采样,其中unknown probability。使用 conjugate prior 的 Bayesian posterior estimator,其中 conjugate prior 是一个 Beta(α,β)分布,其中 α 和 β 参数与边 crossing 相关。低值 beta 指标表示繁殖的文本区域,高值则表示缺乏文本。该方法已经应用于基于 SLAM 实时移动 объек特点检测,其中点被限制在跟踪的盒子(ROI)中。

Automated 3D Segmentation of Kidneys and Tumors in MICCAI KiTS 2023 Challenge

  • paper_url: http://arxiv.org/abs/2310.04110
  • repo_url: https://github.com/Project-MONAI/MONAI
  • paper_authors: Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu
  • for: 本文参加2023年度的肾茵减减挑战(KiTS),用于比较多种解决方案的肾茵分割问题。
  • methods: 本文使用MONAI中的Auto3DSeg自动分割工具进行肾茵分割。
  • results: 本文的解决方案在KiTS 2023挑战中得到了平均 dice 值为0.835和表面 dice 值为0.723,并获得了肾茵减减挑战的冠军。
    Abstract Kidney and Kidney Tumor Segmentation Challenge (KiTS) 2023 offers a platform for researchers to compare their solutions to segmentation from 3D CT. In this work, we describe our submission to the challenge using automated segmentation of Auto3DSeg available in MONAI. Our solution achieves the average dice of 0.835 and surface dice of 0.723, which ranks first and wins the KiTS 2023 challenge.
    摘要 “干织肿瘤分类挑战(KiTS)2023 提供了一个研究者可以比较他们的解决方案的平台。在这个工作中,我们描述了我们对于自动分类的 Auto3DSeg 可用于 MONAI 的解决方案。我们的解决方案实现了平均 dice 0.835 和表面 dice 0.723,排名第一,获得 KiTS 2023 挑战的冠军。”Note that "KiTS" is short for "Kidney and Kidney Tumor Segmentation Challenge", and "MONAI" is a medical imaging analysis platform.

ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer

  • paper_url: http://arxiv.org/abs/2310.04099
  • repo_url: None
  • paper_authors: Yifan Xu, Pourya Shamsolmoali, Jie Yang
  • for: 这篇论文的目的是提出一个新的方法来解决视觉地点识别(VPR)中的缺失和重复信息问题。
  • methods: 这篇论文使用了一个新的架构,即汇集基于权重的transformer网络(CWTNet),并引入了一个新的优化后VLAD层(OptLAD)以降低模型的维度和提高效率。
  • results: 实验结果显示,这篇论文的模型在四个VPR数据集上表现较好,并且比较简单。
    Abstract Visual place recognition (VPR) is a highly challenging task that has a wide range of applications, including robot navigation and self-driving vehicles. VPR is particularly difficult due to the presence of duplicate regions and the lack of attention to small objects in complex scenes, resulting in recognition deviations. In this paper, we present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects. Different from existing methods that rely on Convolutional Neural Networks (CNNs) for feature map generation, ClusVPR introduces a unique paradigm called Clustering-based Weighted Transformer Network (CWTNet). CWTNet leverages the power of clustering-based weighted feature maps and integrates global dependencies to effectively address visual deviations encountered in large-scale VPR problems. We also introduce the optimized-VLAD (OptLAD) layer that significantly reduces the number of parameters and enhances model efficiency. This layer is specifically designed to aggregate the information obtained from scale-wise image patches. Additionally, our pyramid self-supervised strategy focuses on extracting representative and diverse information from scale-wise image patches instead of entire images, which is crucial for capturing representative and diverse information in VPR. Extensive experiments on four VPR datasets show our model's superior performance compared to existing models while being less complex.
    摘要 “视觉地点识别(VPR)是一项非常具有挑战性的任务,它在 робо特 naviation 和自动驾驶汽车等领域有广泛的应用。VPR 尤其是由于区域重复和小 объек 的忽略,导致识别偏差。在这篇论文中,我们提出了 ClusVPR,一种新的方法,用于解决 VPR 中的特定问题。与现有方法不同,ClusVPR 不仅仅采用 Convolutional Neural Networks(CNNs)来生成特征地图,而是引入了归一化-based Weighted Transformer Network(CWTNet)。CWTNet 利用了归一化-based 权重特征地图,并考虑到全局依赖关系,以有效地解决 VPR 中的视觉偏差。我们还提出了优化后的 VLAD(OptLAD)层,它可以减少参数的数量,提高模型的效率。这层特地设计用于聚合 scale-wise 图像块中的信息。此外,我们还提出了一种适应性自我超vised 策略,它会从 scale-wise 图像块中提取代表性和多样化的信息,而不是整个图像,这是关键的 для捕捉 VPR 中的代表性和多样化信息。我们在四个 VPR 数据集上进行了广泛的实验,结果显示我们的模型在与现有模型的比较中表现出优于性,同时具有较低的复杂度。”

End-to-End Chess Recognition

  • paper_url: http://arxiv.org/abs/2310.04086
  • repo_url: None
  • paper_authors: Athanasios Masouris, Jan van Gemert
  • for: 识别棋盘配置,即从棋盘图像中识别棋子的配置。
  • methods: 我们采用深度学习模型,并提出了两种新的方法来直接从整个图像中预测棋盘配置,从而避免了传统的顺序处理方法中的错误积累和中间注解的需求。
  • results: 我们使用新建的 Chess Recognition Dataset (ChessReD) 进行训练和测试,并证明了我们的方法在这个新的标准数据集上的表现,其中 board recognition accuracy 为 15.26%(相比现有的状态的艺术而言,这是大约7倍的提升)。
    Abstract Chess recognition refers to the task of identifying the chess pieces configuration from a chessboard image. Contrary to the predominant approach that aims to solve this task through the pipeline of chessboard detection, square localization, and piece classification, we rely on the power of deep learning models and introduce two novel methodologies to circumvent this pipeline and directly predict the chessboard configuration from the entire image. In doing so, we avoid the inherent error accumulation of the sequential approaches and the need for intermediate annotations. Furthermore, we introduce a new dataset, Chess Recognition Dataset (ChessReD), specifically designed for chess recognition that consists of 10,800 images and their corresponding annotations. In contrast to existing synthetic datasets with limited angles, this dataset comprises a diverse collection of real images of chess formations captured from various angles using smartphone cameras; a sensor choice made to ensure real-world applicability. We use this dataset to both train our model and evaluate and compare its performance to that of the current state-of-the-art. Our approach in chess recognition on this new benchmark dataset outperforms related approaches, achieving a board recognition accuracy of 15.26% ($\approx$7x better than the current state-of-the-art).
    摘要 <>cheshire recognition指的是从棋盘图像中识别棋子的配置。与传统方法不同,我们不是通过棋盘检测、方块定位和棋子类别化的管道来解决这个任务,而是直接将棋盘配置从整个图像中预测。这样可以避免累积错误的问题,并不需要中间注解。此外,我们还提出了两种新的方法ологи,以避免管道式approach的缺点。为了训练和评估我们的模型,我们创建了一个新的数据集,棋盘识别数据集(ChessReD)。这个数据集包含10800个图像和其相应的注解,与现有的 sintetic数据集不同,这个数据集包含了多种角度的真实棋盘形态,通过智能手机摄像头拍摄。我们使用这个数据集来训练和评估我们的模型,并与当前状态的最佳方法进行比较。我们的approach在这个新的benchmark数据集上表现出色,实现了棋盘识别精度15.26%(大约7倍于当前状态的最佳方法)。>>>

  • paper_url: http://arxiv.org/abs/2310.04043
  • repo_url: https://github.com/zhanghaiwei1234/single-eye-emotion-recognition
  • paper_authors: Haiwei Zhang, Jiqing Zhang, Bo Dong, Pieter Peers, Wenwei Wu, Xiaopeng Wei, Felix Heide, Xin Yang
  • for: 这种眼镜可以识别人们的情绪,特别是在照明条件变化时。
  • methods: 这种方法使用了生物体现的事件驱动摄像机和一种新型的轻量级神经网络SEEN。
  • results: 对于单眼事件驱动摄像机和神经网络SEEN,我们在压缩数据集上进行了广泛验证和证明,并证明了该方法的有效性。
    Abstract We introduce a wearable single-eye emotion recognition device and a real-time approach to recognizing emotions from partial observations of an emotion that is robust to changes in lighting conditions. At the heart of our method is a bio-inspired event-based camera setup and a newly designed lightweight Spiking Eye Emotion Network (SEEN). Compared to conventional cameras, event-based cameras offer a higher dynamic range (up to 140 dB vs. 80 dB) and a higher temporal resolution. Thus, the captured events can encode rich temporal cues under challenging lighting conditions. However, these events lack texture information, posing problems in decoding temporal information effectively. SEEN tackles this issue from two different perspectives. First, we adopt convolutional spiking layers to take advantage of the spiking neural network's ability to decode pertinent temporal information. Second, SEEN learns to extract essential spatial cues from corresponding intensity frames and leverages a novel weight-copy scheme to convey spatial attention to the convolutional spiking layers during training and inference. We extensively validate and demonstrate the effectiveness of our approach on a specially collected Single-eye Event-based Emotion (SEE) dataset. To the best of our knowledge, our method is the first eye-based emotion recognition method that leverages event-based cameras and spiking neural network.
    摘要 我们介绍了一种穿戴式单眼情感识别设备和一种实时方法来从部分情感识别中robust避免变化的照明条件。我们的方法的核心是基于生物体的bio-inspired事件驱动摄像头设置和一种新设计的轻量级Spiking Eye Emotion Network (SEEN)。相比传统摄像头,事件驱动摄像头可以提供更高的动态范围(达到140 dBvs. 80 dB)和更高的时间分辨率。因此,捕获的事件可以嵌入丰富的时间信息。然而,这些事件缺乏文本信息,从而导致解码时间信息的问题。SEEN解决了这个问题从两个不同的角度。首先,我们采用了 convolutional spiking层来利用快速神经网络的能力来解码相关的时间信息。其次,SEEN学习了提取相应的空间信息,并使用一种新的重复计数套件来传递空间注意力到convolutional spiking层 durante training和inference。我们对特制的Single-eye Event-based Emotion (SEE)数据集进行了广泛验证和证明了我们的方法的有效性。据我们所知,我们的方法是首个通过事件驱动摄像头和快速神经网络进行眼球情感识别的方法。

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

  • paper_url: http://arxiv.org/abs/2310.03986
  • repo_url: None
  • paper_authors: Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif
  • for: 提高多modalitate下批处程序的总性表现
  • methods: 利用低级别适应和调制中间特征来补偿缺失modalities
  • results: 提高多modalitate下的鲁棒性,在一些情况下超越独立的、专门为可用modalitate组合培育的网络In English, this means:
  • for: To improve the overall performance of downstream tasks in multimodal learning
  • methods: Using low-rank adaptation and modulation of intermediate features to compensate for missing modalities
  • results: Improving robustness in multimodal learning, outperforming independent networks trained for available modality combinations in some cases.
    Abstract Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose simple and parameter-efficient adaptation procedures for pretrained multimodal networks. In particular, we exploit low-rank adaptation and modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters in most experiments). We conduct a series of experiments to highlight the robustness of our proposed method using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
    摘要 多模态学习旨在利用多种数据来提高下游任务的总性能。可以利用多模态数据的重复性来使多模态系统具有缺失或损坏观测的某些相关模态时的Robustness。然而,我们发现许多现有的多模态网络在测试时缺失一或多个模态时表现出现较差的性能。为实现缺失模态的Robustness,我们提议使用简单和参数有效的适应过程来修改预训练的多模态网络。具体来说,我们利用低级别适应和修改中间特征来补做缺失模态。我们示出,这种适应可以部分弥补由缺失模态导致的性能下降,并在一些情况下超过独立、专门为可用模态组合培 trained的独立网络。我们的提议适应需要非常少的参数(例如, fewer than 0.7% of the total parameters in most experiments)。我们通过多种实验表明了我们提议的方法的Robustness,使用RGB-热成像、RGB-深度semantic segmentation、多模态物体 segmentation和多模态情感分析任务。我们的提议方法具有多任务多数据集的多样性,并在不同任务和数据集上超越现有的robust多模态学习方法。

Towards Increasing the Robustness of Predictive Steering-Control Autonomous Navigation Systems Against Dash Cam Image Angle Perturbations Due to Pothole Encounters

  • paper_url: http://arxiv.org/abs/2310.03959
  • repo_url: None
  • paper_authors: Shivam Aarya
  • for: 本研究旨在提高自动驾驶车辆的稳定性和安全性,通过对摄像头数据进行修正来减少摄像头角度变化引起的迹象预测错误。
  • methods: 本研究使用了一种新的修正模型,该模型可以根据摄像头数据进行修正,以减少由摄像头角度变化引起的迹象预测错误。
  • results: 在使用公共可用数据集进行评估时,本研究发现该修正模型可以将迹象预测错误率降低至2.3%,从而提高自动驾驶车辆的稳定性和安全性。
    Abstract Vehicle manufacturers are racing to create autonomous navigation and steering control algorithms for their vehicles. These software are made to handle various real-life scenarios such as obstacle avoidance and lane maneuvering. There is some ongoing research to incorporate pothole avoidance into these autonomous systems. However, there is very little research on the effect of hitting a pothole on the autonomous navigation software that uses cameras to make driving decisions. Perturbations in the camera angle when hitting a pothole can cause errors in the predicted steering angle. In this paper, we present a new model to compensate for such angle perturbations and reduce any errors in steering control prediction algorithms. We evaluate our model on perturbations of publicly available datasets and show our model can reduce the errors in the estimated steering angle from perturbed images to 2.3%, making autonomous steering control robust against the dash cam image angle perturbations induced when one wheel of a car goes over a pothole.
    摘要 自动驾驶车制造商正在奔腾地开发自动导航和推力控制算法,以适应不同的实际景景,如避免障碍物和车道弯道。然而,关于弹射坑的影响在自动驾驶系统中的研究很少。当车辆过坑时,摄像头角度的偏移会导致驾驶控制预测错误。在这篇论文中,我们提出了一种新的模型,以减少由摄像头角度偏移引起的驾驶控制预测错误。我们使用公共可用的数据集进行评估,并证明我们的模型可以将摄像头角度偏移引起的错误降低至2.3%,使自动驾驶控制更加Robust againstdash cam image angle perturbations induced by potholes.

Understanding prompt engineering may not require rethinking generalization

  • paper_url: http://arxiv.org/abs/2310.03957
  • repo_url: None
  • paper_authors: Victor Akinwande, Yiding Jiang, Dylan Sam, J. Zico Kolter
  • for: 这篇论文旨在解释逻辑推理模型在不需要训练的情况下,如何具有良好的泛化性。
  • methods: 该论文使用的方法是通过设计提示来建立分类器,而不需要显式的训练过程。
  • results: 该论文显示,使用提示的方法可以具有remarkably tight的泛化 bound,并且可以用来 justify the widespread practice of prompt engineering,即通过设计提示来实现良好的测试性能。
    Abstract Zero-shot learning in prompted vision-language models, the practice of crafting prompts to build classifiers without an explicit training process, has achieved impressive performance in many settings. This success presents a seemingly surprising observation: these methods suffer relatively little from overfitting, i.e., when a prompt is manually engineered to achieve low error on a given training set (thus rendering the method no longer actually zero-shot), the approach still performs well on held-out test data. In this paper, we show that we can explain such performance well via recourse to classical PAC-Bayes bounds. Specifically, we show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are remarkably tight by the standards of the literature: for instance, the generalization bound of an ImageNet classifier is often within a few percentage points of the true test error. We demonstrate empirically that this holds for existing handcrafted prompts and prompts generated through simple greedy search. Furthermore, the resulting bound is well-suited for model selection: the models with the best bound typically also have the best test performance. This work thus provides a possible justification for the widespread practice of prompt engineering, even if it seems that such methods could potentially overfit the training data.
    摘要 zero-shot learning 在提示语言模型中,通过手动设计提示来建立分类器而不需要显式训练过程,已经实现了许多场景中的出色表现。这一成功呈现出一个意外的观察:这些方法具有相对较少的过拟合现象,即在手动设计提示以实现训练集的低错误率(这样的方法再不是真正的零shot learning)时,这种方法仍然能够在封闭测试数据上表现良好。在这篇论文中,我们展示了我们可以通过经典的 PAC-Bayes bound 来解释这种表现。具体来说,我们表明了提示的整数性,加上基于语言模型的 PAC-Bayes prior,导致的泛化 bound 是文献中非常紧张的:例如,ImageNet 分类器的泛化 bound 经常在真实测试错误率的几个百分点之间。我们通过实验证明了这一点,并且表明了这种 bound 适用于模型选择:具有最好的 bound 的模型通常也有最好的测试性能。这项工作因此提供了逻辑 justify prompt engineering 的做法,即使这些方法可能会过拟合训练数据。

Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction

  • paper_url: http://arxiv.org/abs/2310.03956
  • repo_url: None
  • paper_authors: Sara Fridovich-Keil, Fabrizio Valdivia, Gordon Wetzstein, Benjamin Recht, Mahdi Soltanolkotabi
  • for: This paper aims to improve the accuracy and reduce artifacts in computed tomography (CT) reconstruction, particularly in the presence of high-density materials such as metal.
  • methods: The authors propose a direct nonlinear CT reconstruction technique that bypasses the conventional preprocessing step of inverting the nonlinear measurement preprocessing. Instead, they use gradient descent to optimize the nonlinear forward model and reconstruct the underlying signal directly from the raw measurements.
  • results: The authors demonstrate the effectiveness of their proposed technique through experiments on synthetic and real 3D volumes using cone-beam CT. They show that their approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns, and achieves a near minimal number of random measurements with a geometric rate of convergence. Additionally, they prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal, by enforcing prior structural information about the signal through constraints on the optimization variables.
    Abstract In computed tomography (CT), the forward model consists of a linear Radon transform followed by an exponential nonlinearity based on the attenuation of light according to the Beer-Lambert Law. Conventional reconstruction often involves inverting this nonlinearity as a preprocessing step and then solving a convex inverse problem. However, this nonlinear measurement preprocessing required to use the Radon transform is poorly conditioned in the vicinity of high-density materials, such as metal. This preprocessing makes CT reconstruction methods numerically sensitive and susceptible to artifacts near high-density regions. In this paper, we study a technique where the signal is directly reconstructed from raw measurements through the nonlinear forward model. Though this optimization is nonconvex, we show that gradient descent provably converges to the global optimum at a geometric rate, perfectly reconstructing the underlying signal with a near minimal number of random measurements. We also prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal. This is achieved by enforcing prior structural information about the signal through constraints on the optimization variables. We illustrate the benefits of direct nonlinear CT reconstruction with cone-beam CT experiments on synthetic and real 3D volumes. We show that this approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns.
    摘要 在计算Tomography(CT)中,前向模型包括线性的朗逊变换,然后是基于减弱光的泽米特律的不对称非线性。常见的重建通常需要在这种非线性预处理中进行逆转,然后解决一个凸 inverse problem。然而,这种非线性测量预处理在高密度材料,如金属附近,是糟糕的conditioned。这种预处理使CT重建方法数值敏感和受到artifacts的影响。在这篇论文中,我们研究了一种技术,其中信号直接从Raw Measurements中重建 через非线性前向模型。虽这个优化是非凸的,但我们显示了gradient descent可提able地 converge到全局最优点,完美地重建下面的信号,使用最小量的随机测量。我们也证明了相似的结果在下determined setting中,其中测量量 Significantly smaller than the dimension of the signal。这是通过在优化变量上添加信号的先验结构信息来实现的。我们在Synthetic和实际3Dvolumes上进行了 cone-beam CT实验,并示出了这种方法可以减少金属残余相比于一个商业重建的人骨头with metal dental crowns。

ILSH: The Imperial Light-Stage Head Dataset for Human Head View Synthesis

  • paper_url: http://arxiv.org/abs/2310.03952
  • repo_url: None
  • paper_authors: Jiali Zheng, Youngkyoon Jang, Athanasios Papaioannou, Christos Kampouris, Rolandos Alexandros Potamias, Foivos Paraperas Papantoniou, Efstathios Galanakis, Ales Leonardis, Stefanos Zafeiriou
  • for: 本研究 introduce Imperial Light-Stage Head (ILSH) dataset, a novel light-stage-captured human head dataset to support view synthesis academic challenges for human heads.
  • methods: 本研究使用 specifically designed light-stage to capture high-resolution (4K) human head images, and addresses challenges (preprocessing, ethical issues) in collecting high-quality data.
  • results: 研究 obtained 1,248 close-up head images, border masks, and camera pose pairs from 52 subjects captured using 24 cameras with all 82 lighting sources turned on.
    Abstract This paper introduces the Imperial Light-Stage Head (ILSH) dataset, a novel light-stage-captured human head dataset designed to support view synthesis academic challenges for human heads. The ILSH dataset is intended to facilitate diverse approaches, such as scene-specific or generic neural rendering, multiple-view geometry, 3D vision, and computer graphics, to further advance the development of photo-realistic human avatars. This paper details the setup of a light-stage specifically designed to capture high-resolution (4K) human head images and describes the process of addressing challenges (preprocessing, ethical issues) in collecting high-quality data. In addition to the data collection, we address the split of the dataset into train, validation, and test sets. Our goal is to design and support a fair view synthesis challenge task for this novel dataset, such that a similar level of performance can be maintained and expected when using the test set, as when using the validation set. The ILSH dataset consists of 52 subjects captured using 24 cameras with all 82 lighting sources turned on, resulting in a total of 1,248 close-up head images, border masks, and camera pose pairs.
    摘要 这篇论文介绍了帝国光台头(ILSH)数据集,这是一个新的光台捕捉的人头数据集,旨在支持人头视 synthesis学术挑战。ILSH数据集旨在促进多种方法的发展,如场景特定或通用神经渲染、多视图几何、3D视觉和计算机图形等,以提高人头化的图像质量。本文介绍了使用特定设计的光台捕捉高分辨率(4K)人头图像的过程,以及收集数据时遇到的挑战和伦理问题的处理方法。此外,文章还详细介绍了数据集的分区方法,包括训练集、验证集和测试集的分割。我们的目标是设计和支持一个公平的视 synthesis挑战任务,以便在使用测试集时和使用验证集时的表现水平具有相同的稳定性。ILSH数据集包含52名参与者,通过24台摄像头拍摄,共有82个灯光源 turned on,得到了1,248个close-up头像、边框mask和相机pose对。