methods: 这个论文使用了一种新的学习框架,即WaH-NeRF,来解决NeRF-S中的“CONFUSION”问题。这个框架包括一种可变 sampling策略和一种Weight-based Mutual Information Loss,以及一种semi-supervised learning paradigm和一种Pixel-Patch Correspondence Loss。
results: 根据实验结果,WaH-NeRF在NeRF-S Setting下表现出色,超越了之前的方法。Abstract
Neural Radiance Fields from Sparse input} (NeRF-S) have shown great potential in synthesizing novel views with a limited number of observed viewpoints. However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as "CONFUSION" during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: "WHERE" and "HOW". To this end, we present a novel learning framework, WaH-NeRF, which effectively mitigates confusion by tackling the following challenges: (i)"WHERE" to Sample? in NeRF-S -- we introduce a Deformable Sampling strategy and a Weight-based Mutual Information Loss to address sample-position confusion arising from the limited number of viewpoints; and (ii) "HOW" to Predict? in NeRF-S -- we propose a Semi-Supervised NeRF learning Paradigm based on pose perturbation and a Pixel-Patch Correspondence Loss to alleviate prediction confusion caused by the disparity between training and testing viewpoints. By integrating our proposed modules and loss functions, WaH-NeRF outperforms previous methods under the NeRF-S setting. Code is available https://github.com/bbbbby-99/WaH-NeRF.
摘要
neural radiance fields from sparse input (NeRF-S) 有很大的潜力synthesize novel views with a limited number of observed viewpoints. However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as "CONFUSION" during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: "WHERE" and "HOW". To this end, we present a novel learning framework, WaH-NeRF, which effectively mitigates confusion by tackling the following challenges: (i)"WHERE" to Sample? in NeRF-S -- we introduce a Deformable Sampling strategy and a Weight-based Mutual Information Loss to address sample-position confusion arising from the limited number of viewpoints; and (ii) "HOW" to Predict? in NeRF-S -- we propose a Semi-Supervised NeRF learning Paradigm based on pose perturbation and a Pixel-Patch Correspondence Loss to alleviate prediction confusion caused by the disparity between training and testing viewpoints. By integrating our proposed modules and loss functions, WaH-NeRF outperforms previous methods under the NeRF-S setting. Code is available at https://github.com/bbbbby-99/WaH-NeRF.
for: 提高Scene Text Editing(STE)的研究表现,解决现有文本修改方法的缺点,包括复杂的背景、多种字体风格和文本内容的变化。
methods: 提出了一种新的font-agnostic Scene Text Editing框架,named FAST,可以同时生成文本在任意风格和位置,保持自然和现实的外观,通过组合掩码生成和风格传递来实现。与传统方法不同,该方法不直接修改整个图像像素,而是引入了一种滤波机制,以除掉背景干扰,让网络专注于需要修改的文本区域。此外,还设计了一个文本风格传递模块,以解决文本内容的变化。
results: 对比传统方法,提出的方法在质量和量化两个方面均有显著提高,可以更好地修改图像中的文本。Abstract
Scene Text Editing (STE) is a challenging research problem, and it aims to modify existing texts in an image while preserving the background and the font style of the original text of the image. Due to its various real-life applications, researchers have explored several approaches toward STE in recent years. However, most of the existing STE methods show inferior editing performance because of (1) complex image backgrounds, (2) various font styles, and (3) varying word lengths within the text. To address such inferior editing performance issues, in this paper, we propose a novel font-agnostic scene text editing framework, named FAST, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance through combined mask generation and style transfer. The proposed approach differs from the existing methods as they directly modify all image pixels. Instead, the proposed method has introduced a filtering mechanism to remove background distractions, allowing the network to focus solely on the text regions where editing is required. Additionally, a text-style transfer module has been designed to mitigate the challenges posed by varying word lengths. Extensive experiments and ablations have been conducted, and the results demonstrate that the proposed method outperforms the existing methods both qualitatively and quantitatively.
摘要
场景文本编辑(STE)是一个具有挑战性的研究问题,它目标是修改图像中的文本而不改变图像背景和字体风格。由于它在实际生活中的多种应用,研究人员在过去几年内对STE进行了许多研究。然而,大多数现有的STE方法具有质量不高的编辑性能,主要归结于图像背景的复杂性、字体风格的多样性以及文本中单词的变化。为解决这些问题,在这篇论文中,我们提出了一种新的字体无关场景文本编辑框架,名为FAST,它可以同时生成文本在任意风格和位置上,并保持自然和现实的外观。我们的方法与现有方法不同,它们直接修改所有图像像素。相反,我们的方法引入了一种过滤机制,以除除背景干扰,让网络专注于需要编辑的文本区域。此外,我们还设计了一个文本风格传递模块,以适应文本中单词的变化。我们进行了广泛的实验和缺省分析,结果表明,我们的方法在质量和量上都超越了现有的方法。
An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability
results: 对于多个数据集,提出的 AdaEA 可以获得显著的提高,并且可以further boost 现有的传递基于攻击。这表明 AdaEA 的效果和多样性。Abstract
While the transferability property of adversarial examples allows the adversary to perform black-box attacks (i.e., the attacker has no knowledge about the target model), the transfer-based adversarial attacks have gained great attention. Previous works mostly study gradient variation or image transformations to amplify the distortion on critical parts of inputs. These methods can work on transferring across models with limited differences, i.e., from CNNs to CNNs, but always fail in transferring across models with wide differences, such as from CNNs to ViTs. Alternatively, model ensemble adversarial attacks are proposed to fuse outputs from surrogate models with diverse architectures to get an ensemble loss, making the generated adversarial example more likely to transfer to other models as it can fool multiple models concurrently. However, existing ensemble attacks simply fuse the outputs of the surrogate models evenly, thus are not efficacious to capture and amplify the intrinsic transfer information of adversarial examples. In this paper, we propose an adaptive ensemble attack, dubbed AdaEA, to adaptively control the fusion of the outputs from each model, via monitoring the discrepancy ratio of their contributions towards the adversarial objective. Furthermore, an extra disparity-reduced filter is introduced to further synchronize the update direction. As a result, we achieve considerable improvement over the existing ensemble attacks on various datasets, and the proposed AdaEA can also boost existing transfer-based attacks, which further demonstrates its efficacy and versatility.
摘要
而 transferability 性能的敌方例可以进行黑盒攻击(即攻击者没有对目标模型的知识),而转移基于敌方例的攻击吸引了大量关注。先前的工作主要研究了梯度变化或图像变换以增强输入的扰动部分。这些方法可以在有限差异的模型之间传输,例如从CNNs到CNNs,但总是失败在差异较广的模型之间,如从CNNs到ViTs。作为一个替代方案,我们提出了一种ensemble攻击,即将多个模型的输出 fusion 到一起,以获得一个ensemble损失,使得生成的敌方例更可能传输到其他模型,因为它可以同时欺骗多个模型。然而,现有的ensemble攻击简单地将每个模型的输出平均 fusion,因此无法捕捉和增强敌方例的内在传输信息。在这篇论文中,我们提出了一种适应 ensemble 攻击,称为AdaEA,以适应控制每个模型的输出 fusión。我们通过监测敌方例的扰动率来控制输出的权重,并 introduce 了一个降低差异的滤波器,以进一步同步更新方向。因此,我们实现了对多个数据集的现有ensemble攻击的显著改进,并且我们的提案的AdaEA还可以增强现有的传输基于攻击,这再次证明了它的有效性和多样性。
Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation
methods: 该论文提出了一种新的3DLSS设定,其中有一个2D数据集(源) WITH semantic annotation,并有一个paired but unannotated 2D图像和3D LiDAR数据(目标)。以实现3DLSS在这种情况下,该论文提出了一种名为 Cross-Modal and Cross-Domain Learning(CoMoDaL)的方法。CoMoDaL的目标是模型1) 不同模式和频谱之间的交叉频谱域distillation,以及2) 目标数据集中不同模式之间的交叉模式引导。
results: 在该论文中,CoMoDaL可以在没有标注的LiDAR数据的情况下实现3DLSS。在几个数据集上进行了实验,并进行了ablations来提供更多的分析。Abstract
In recent years, cross-modal domain adaptation has been studied on the paired 2D image and 3D LiDAR data to ease the labeling costs for 3D LiDAR semantic segmentation (3DLSS) in the target domain. However, in such a setting the paired 2D and 3D data in the source domain are still collected with additional effort. Since the 2D-3D projections can enable the 3D model to learn semantic information from the 2D counterpart, we ask whether we could further remove the need of source 3D data and only rely on the source 2D images. To answer it, this paper studies a new 3DLSS setting where a 2D dataset (source) with semantic annotations and a paired but unannotated 2D image and 3D LiDAR data (target) are available. To achieve 3DLSS in this scenario, we propose Cross-Modal and Cross-Domain Learning (CoMoDaL). Specifically, our CoMoDaL aims at modeling 1) inter-modal cross-domain distillation between the unpaired source 2D image and target 3D LiDAR data, and 2) the intra-domain cross-modal guidance between the target 2D image and 3D LiDAR data pair. In CoMoDaL, we propose to apply several constraints, such as point-to-pixel and prototype-to-pixel alignments, to associate the semantics in different modalities and domains by constructing mixed samples in two modalities. The experimental results on several datasets show that in the proposed setting, the developed CoMoDaL can achieve segmentation without the supervision of labeled LiDAR data. Ablations are also conducted to provide more analysis. Code will be available publicly.
摘要
近年来,跨Modal频率适应(Cross-Modal Domain Adaptation,CMDA)在paired 2D图像和3D LiDAR数据上进行研究,以减少目标频率上的标注成本 для3D LiDAR semantic segmentation(3DLSS)。然而,在这种设置下,paired 2D和3D数据在源频率上仍然需要额外努力收集。由于2D-3D投影可以帮助3D模型从2D对应的semantic信息,我们问 whether we could further remove the need of source 3D data and only rely on the source 2D images。为了回答这个问题,这篇论文研究了一种新的3DLSS设置,其中一个2D数据集(源)具有semantic标注,以及一个paired但未标注的2D图像和3D LiDAR数据(目标)。为了实现3DLSS在这种enario中,我们提出了 Cross-Modal and Cross-Domain Learning(CoMoDaL)。具体来说,CoMoDaL的目标是模型:1. между不同模式和频率之间的跨Modal频率适应(inter-modal cross-domain distillation),从未标注的源2D图像和目标3D LiDAR数据中学习semantic信息。2. 目标2D图像和3D LiDAR数据对的内部模式协调(intra-domain cross-modal guidance),从目标2D图像和3D LiDAR数据对中学习semantic信息。在CoMoDaL中,我们提出了一些约束,例如点对像和原型对像的匹配约束,以将不同模式和频率之间的semantic信息相关联。我们通过构建两个模式之间的混合样本来实现这一点。实验结果表明,在我们的设置下,我们提出的CoMoDaL可以在未supervise的LiDAR数据上实现semantic segmentation。我们还进行了一些ablation来提供更多的分析。代码将公开发布。
Sketch and Text Guided Diffusion Model for Colored Point Cloud Generation
results: 实验结果显示,该模型在颜色点云生成任务上表现出色,比之前的状态当中的模型有更好的表现。Abstract
Diffusion probabilistic models have achieved remarkable success in text guided image generation. However, generating 3D shapes is still challenging due to the lack of sufficient data containing 3D models along with their descriptions. Moreover, text based descriptions of 3D shapes are inherently ambiguous and lack details. In this paper, we propose a sketch and text guided probabilistic diffusion model for colored point cloud generation that conditions the denoising process jointly with a hand drawn sketch of the object and its textual description. We incrementally diffuse the point coordinates and color values in a joint diffusion process to reach a Gaussian distribution. Colored point cloud generation thus amounts to learning the reverse diffusion process, conditioned by the sketch and text, to iteratively recover the desired shape and color. Specifically, to learn effective sketch-text embedding, our model adaptively aggregates the joint embedding of text prompt and the sketch based on a capsule attention network. Our model uses staged diffusion to generate the shape and then assign colors to different parts conditioned on the appearance prompt while preserving precise shapes from the first stage. This gives our model the flexibility to extend to multiple tasks, such as appearance re-editing and part segmentation. Experimental results demonstrate that our model outperforms recent state-of-the-art in point cloud generation.
摘要
Diffusion probabilistic models have achieved remarkable success in text-guided image generation. However, generating 3D shapes is still challenging due to the lack of sufficient data containing 3D models along with their descriptions. Moreover, text-based descriptions of 3D shapes are inherently ambiguous and lack details. In this paper, we propose a sketch and text-guided probabilistic diffusion model for colored point cloud generation that conditions the denoising process jointly with a hand-drawn sketch of the object and its textual description. We incrementally diffuse the point coordinates and color values in a joint diffusion process to reach a Gaussian distribution. Colored point cloud generation thus amounts to learning the reverse diffusion process, conditioned by the sketch and text, to iteratively recover the desired shape and color. Specifically, to learn effective sketch-text embedding, our model adaptively aggregates the joint embedding of text prompt and the sketch based on a capsule attention network. Our model uses staged diffusion to generate the shape and then assigns colors to different parts conditioned on the appearance prompt while preserving precise shapes from the first stage. This gives our model the flexibility to extend to multiple tasks, such as appearance re-editing and part segmentation. Experimental results demonstrate that our model outperforms recent state-of-the-art in point cloud generation.Here's the translation in Traditional Chinese:Diffusion probabilistic models have achieved remarkable success in text-guided image generation. However, generating 3D shapes is still challenging due to the lack of sufficient data containing 3D models along with their descriptions. Moreover, text-based descriptions of 3D shapes are inherently ambiguous and lack details. In this paper, we propose a sketch and text-guided probabilistic diffusion model for colored point cloud generation that conditions the denoising process jointly with a hand-drawn sketch of the object and its textual description. We incrementally diffuse the point coordinates and color values in a joint diffusion process to reach a Gaussian distribution. Colored point cloud generation thus amounts to learning the reverse diffusion process, conditioned by the sketch and text, to iteratively recover the desired shape and color. Specifically, to learn effective sketch-text embedding, our model adaptively aggregates the joint embedding of text prompt and the sketch based on a capsule attention network. Our model uses staged diffusion to generate the shape and then assigns colors to different parts conditioned on the appearance prompt while preserving precise shapes from the first stage. This gives our model the flexibility to extend to multiple tasks, such as appearance re-editing and part segmentation. Experimental results demonstrate that our model outperforms recent state-of-the-art in point cloud generation.
NP-SemiSeg: When Neural Processes meet Semi-Supervised Semantic Segmentation
results: 实验结果表明,NP-SemiSeg模型在PASCAL VOC 2012和Cityscapes公共benchmark上具有效果,并且在不同的训练设置下可以达到比较高的精度和可靠性。Abstract
Semi-supervised semantic segmentation involves assigning pixel-wise labels to unlabeled images at training time. This is useful in a wide range of real-world applications where collecting pixel-wise labels is not feasible in time or cost. Current approaches to semi-supervised semantic segmentation work by predicting pseudo-labels for each pixel from a class-wise probability distribution output by a model. If the predicted probability distribution is incorrect, however, this leads to poor segmentation results, which can have knock-on consequences in safety critical systems, like medical images or self-driving cars. It is, therefore, important to understand what a model does not know, which is mainly achieved by uncertainty quantification. Recently, neural processes (NPs) have been explored in semi-supervised image classification, and they have been a computationally efficient and effective method for uncertainty quantification. In this work, we move one step forward by adapting NPs to semi-supervised semantic segmentation, resulting in a new model called NP-SemiSeg. We experimentally evaluated NP-SemiSeg on the public benchmarks PASCAL VOC 2012 and Cityscapes, with different training settings, and the results verify its effectiveness.
摘要
<> semi-supervised semantic segmentation 涉及到在训练时对无标图像进行像素级标注。这在各种实际应用中非常有用,因为收集像素级标注的时间和成本不符合要求。现有的半监督 semantic segmentation 方法是通过从模型输出的类别概率分布中预测每个像素的 Pseudo-labels,如果预测的概率分布错误,则会导致 segmentation 结果差,这可能会对安全关键系统,如医疗图像或自动驾驶车辆,产生不良影响。因此,了解模型无法捕捉的信息非常重要。在最近的研究中,神经过程(NP)在半监督图像分类中被探索,并被证明是计算效率高并且有效的不确定性评估方法。在这项工作中,我们将NP应用到半监督semantic segmentation,得到了NP-SemiSeg模型。我们对公共 benchmarks PASCAL VOC 2012 和 Cityscapes 进行了不同的训练设置,并得到了 resultado,证明了NP-SemiSeg 的有效性。
Improving Generalization of Image Captioning with Unsupervised Prompt Learning
results: 通过Attribute和Semantic一致性来约束提问向量,从而学习域pecific的知识,提高图像描述的通用性。Abstract
Pretrained visual-language models have demonstrated impressive zero-shot abilities in image captioning, when accompanied by hand-crafted prompts. Meanwhile, hand-crafted prompts utilize human prior knowledge to guide the model. However, due to the diversity between different domains, such hand-crafted prompt that provide invariant prior knowledge may result in mode collapse for some domains. Some researches attempted to incorporate expert knowledge and instruction datasets, but the results were costly and led to hallucinations. In this paper, we propose an unsupervised prompt learning method to improve Generalization of Image Captioning (GeneIC), which learns a domain-specific prompt vector for the target domain without requiring annotated data. GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, thus optimizing the domain-specific prompt vector from two aspects: attribute and semantic consistency. Specifically, GeneIC first generates attribute-transferred images with differing attributes, while retaining semantic similarity with original images. Then, GeneIC uses CLIP to measure the similarity between the images and the generated sentences. By exploring the variable and invariant features in the original images and attribute-transferred images, attribute consistency constrains the attribute change direction of both images and sentences to learn domain-specific knowledge. The semantic consistency directly measures the similarity between the generated sentences and images to ensure the accuracy and comprehensiveness of the generated sentences. Consequently, GeneIC only optimizes the prompt vectors, which effectively retains the knowledge in the large model and introduces domain-specific knowledge.
摘要
《Image Captioning 总结》中,我们提出了一种不需要标注数据的无监督提问学习方法,以提高图文描述(GeneIC)的通用性。我们使用预训练的 Contrastive Language-Image Pre-Training(CLIP)模型,将视觉和语言模式align,从两个方面优化域pecific提问向量:Attribute和Semantic相同性。具体来说,GeneIC首先生成了不同特征的图像,保持了原始图像的Semantic相同性。然后,GeneIC使用CLIP测量图像和生成的句子之间的相似性。通过探索原始图像和特征转移图像中的变量和不变量特征,Attribute相同性确定了特征变化方向的限制。Semantic相同性直接测量生成句子和图像之间的相似性,以确保生成句子的准确性和全面性。因此,GeneIC只优化提问向量,Effectively retains the knowledge in the large model and introduces domain-specific knowledge。
Generative Adversarial Networks for Stain Normalisation in Histopathology
results: 目前研究人员正在寻找一种能够有效地和有效率地标准化病理学像的方法,以使AI模型更加普遍和可靠。Abstract
The rapid growth of digital pathology in recent years has provided an ideal opportunity for the development of artificial intelligence-based tools to improve the accuracy and efficiency of clinical diagnoses. One of the significant roadblocks to current research is the high level of visual variability across digital pathology images, causing models to generalise poorly to unseen data. Stain normalisation aims to standardise the visual profile of digital pathology images without changing the structural content of the images. In this chapter, we explore different techniques which have been used for stain normalisation in digital pathology, with a focus on approaches which utilise generative adversarial networks (GANs). Typically, GAN-based methods outperform non-generative approaches but at the cost of much greater computational requirements. However, it is not clear which method is best for stain normalisation in general, with different GAN and non-GAN approaches outperforming each other in different scenarios and according to different performance metrics. This is an ongoing field of study as researchers aim to identify a method which efficiently and effectively normalises pathology images to make AI models more robust and generalisable.
摘要
随着数字病理学的快速发展,现在提供了一个 идеal 的机会,以便通过人工智能技术来提高临床诊断的准确性和效率。然而,一个 significante 的障碍是数字病理图像之间的视觉变化很高,使模型很难泛化到未看到的数据。图像标准化的目的是为了标准化数字病理图像的视觉oprofile,而不是改变图像的结构内容。在这个章节中,我们探讨了不同的技术,以便在数字病理中实现图像标准化,特别是使用生成对抗网络(GAN)。通常,GAN基本技术在性能上表现更好,但是需要许多更高的计算需求。然而,不清楚哪种方法是最佳的标准化方法,因为不同的 GAN 和非 GAN 方法在不同的情况下和按照不同的表现指标而出performancedifferently。这是一个持续的研究领域,研究人员希望能够找到一种能够有效地和高效地标准化病理图像的方法,以便使 AI 模型更加Robust 和泛化。
Flashlight Search Medial Axis: A Pixel-Free Pore-Network Extraction Algorithm
results: 研究表明,FSMA算法在不同的两维和三维porous media中都表现良好,无论细胞网络的topological结构如何或缺口和喉孔中心的位置如何。此外,该算法还可以处理关闭和开放边界的情况。最重要的是,FSMA算法可以搜索死绕的细胞,这在多相流动研究中具有重要意义。Abstract
Pore-network models (PNMs) have become an important tool in the study of fluid flow in porous media over the last few decades, and the accuracy of their results highly depends on the extraction of pore networks. Traditional methods of pore-network extraction are based on pixels and require images with high quality. Here, a pixel-free method called the flashlight search medial axis (FSMA) algorithm is proposed for pore-network extraction in a continuous space. The search domain in a two-dimensional space is a line, whereas a surface domain is searched in a three-dimensional scenario. Thus, the FSMA algorithm follows the dimensionality reduction idea; the medial axis can be identified using only a few points instead of calculating every point in the void space. In this way, computational complexity of this method is greatly reduced compared to that of traditional pixel-based extraction methods, thus enabling large-scale pore-network extraction. Based on cases featuring two- and three-dimensional porous media, the FSMA algorithm performs well regardless of the topological structure of the pore network or the positions of the pore and throat centers. This algorithm can also be used to examine both closed- and open-boundary cases. Finally, the FSMA algorithm can search dead-end pores, which is of great significance in the study of multiphase flow in porous media.
摘要
PORE-NETWORK MODELS (PNMs) 在过去几十年内已成为观察 fluid 流动在孔隙媒体中的重要工具,其精度准确性强度取决于绘制孔隙网络的方法。传统的方法基于像素,需要高质量的图像。在这篇文章中,我们提出了一种不基于像素的方法,即 flashlight search medial axis (FSMA) 算法,用于在连续空间中提取孔隙网络。在两维空间中,搜索域是一条线,而在三维情况下,搜索域是一个表面。因此,FSMA 算法遵循维度减少的想法,通过仅使用几个点来识别中心轴,而不是计算整个空心空间中的每个点。这种方法的计算复杂度与传统基于像素的提取方法相比,减少了很多,因此可以实现大规模的孔隙网络提取。基于二维和三维孔隙媒体的实验,FSMA 算法在不同的 topological 结构和孔隙中心位置下表现良好。此外,这种算法还可以用于考虑关闭和开放边界的情况。最后,FSMA 算法可以搜索死绕孔,这对多相流动在孔隙媒体中的研究具有重要性。
Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation
results: 实验结果显示,我们的解决方案在检测精度方面具有竞争力,并且可以帮助提高 robot-assisted 气管 intubation 的精度和效率。Abstract
Robot-assisted airway intubation application needs high accuracy in locating targets and organs. Two vital landmarks, nostrils and glottis, can be detected during the intubation to accommodate the stages of nasal intubation. Automated landmark detection can provide accurate localization and quantitative evaluation. The Detection Transformer (DeTR) leads object detectors to a new paradigm with long-range dependence. However, current DeTR requires long iterations to converge, and does not perform well in detecting small objects. This paper proposes a transformer-based landmark detection solution with deformable DeTR and the semantic-aligned-matching module for detecting landmarks in robot-assisted intubation. The semantics aligner can effectively align the semantics of object queries and image features in the same embedding space using the most discriminative features. To evaluate the performance of our solution, we utilize a publicly accessible glottis dataset and automatically annotate a nostril detection dataset. The experimental results demonstrate our competitive performance in detection accuracy. Our code is publicly accessible.
摘要
robot-assisted 气管 Intubation 应用需要高精度在目标和器官位置进行定位。两个重要的标志物,鼻孔和聊肉,可以在气管 Intubation 中被探测出来,以便进行不同的气管 Intubation 阶段。自动化标志物检测可以提供准确的定位和评估。 transformer (DeTR) 引导 object 检测器进入新的 paradigm 中,但现有的 DeTR 需要长时间迭代才能平衡,并不适合检测小型 объек。这篇论文提出了基于 transformer 的标志物检测解决方案,其中包括可变 DeTR 和 semantic-aligned-matching 模块。semantic aligner 可以很好地对 object 查询和图像特征在同一个 embedding 空间进行Semantic 对齐。为了评估我们的解决方案的性能,我们利用了公共可访问的 glottis 数据集,并自动注意nostril 检测数据集。实验结果表明我们的竞争性表现。我们的代码公开 accessible。
Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis
for: 本文targets joint scene novel view synthesis and editing based on implicit neural scene representations, with a unified Neural Radiance Field (NeRF) framework to effectively perform scene decomposition and composition.
results: 该方法可以有效地进行Scene decomposition和Composition,并且在novel-view synthesis和 editing任务上表现出色,超越了state-of-the-art方法。Abstract
Implicit neural representations have shown powerful capacity in modeling real-world 3D scenes, offering superior performance in novel view synthesis. In this paper, we target a more challenging scenario, i.e., joint scene novel view synthesis and editing based on implicit neural scene representations. State-of-the-art methods in this direction typically consider building separate networks for these two tasks (i.e., view synthesis and editing). Thus, the modeling of interactions and correlations between these two tasks is very limited, which, however, is critical for learning high-quality scene representations. To tackle this problem, in this paper, we propose a unified Neural Radiance Field (NeRF) framework to effectively perform joint scene decomposition and composition for modeling real-world scenes. The decomposition aims at learning disentangled 3D representations of different objects and the background, allowing for scene editing, while scene composition models an entire scene representation for novel view synthesis. Specifically, with a two-stage NeRF framework, we learn a coarse stage for predicting a global radiance field as guidance for point sampling, and in the second fine-grained stage, we perform scene decomposition by a novel one-hot object radiance field regularization module and a pseudo supervision via inpainting to handle ambiguous background regions occluded by objects. The decomposed object-level radiance fields are further composed by using activations from the decomposition module. Extensive quantitative and qualitative results show the effectiveness of our method for scene decomposition and composition, outperforming state-of-the-art methods for both novel-view synthesis and editing tasks.
摘要
<> translate into Simplified Chinese实现了卷积神经表示法的有力的场景模型,对实际世界场景的synthesys提供了更高的性能。在这篇论文中,我们面临更加困难的场景,即对卷积神经场景表示的同时synthesys和编辑。现有的方法通常是建立不同的网络来解决这两个任务(即synthesys和编辑),因此模型这两个任务之间的交互和相互关系很有限,这在学习高质量场景表示方面是关键。为了解决这个问题,在这篇论文中,我们提出了一个统一的神经频谱场景(NeRF)框架,以有效地进行场景的分解和组合,以模型现实世界场景。分解目标在学习独立的3D对象和背景的分解,以便场景编辑,而场景组合模型整个场景的表示,用于新视角synthesys。具体来说,我们采用了两个阶段的NeRF框架,在第一阶段,我们学习一个全局的频谱场景,作为点抽象的指导,在第二阶段,我们使用一种新的一键对象频谱场景 régularization模块,以及一种 Pseudo supervision via inpainting来处理隐藏在对象后面的背景区域。分解出的对象级频谱场景被进一步组合使用活动从分解模块。广泛的量化和质量测试结果表明,我们的方法在场景分解和组合方面具有优秀的效果,在新视角synthesys和编辑任务中都高于状态艺术方法。
A Comprehensive Analysis of Real-World Image Captioning and Scene Identification
paper_authors: Sai Suprabhanu Nallapaneni, Subrahmanyam Konakanchi
for: 本研究旨在evaluating the performance of various image captioning models in real-world environments, where images are often poor in quality and there are numerous points of attention.
results: 研究发现,使用IC3方法生成更细致的描述可以提高图像描述的准确率和完整性。Abstract
Image captioning is a computer vision task that involves generating natural language descriptions for images. This method has numerous applications in various domains, including image retrieval systems, medicine, and various industries. However, while there has been significant research in image captioning, most studies have focused on high quality images or controlled environments, without exploring the challenges of real-world image captioning. Real-world image captioning involves complex and dynamic environments with numerous points of attention, with images which are often very poor in quality, making it a challenging task, even for humans. This paper evaluates the performance of various models that are built on top of different encoding mechanisms, language decoders and training procedures using a newly created real-world dataset that consists of over 800+ images of over 65 different scene classes, built using MIT Indoor scenes dataset. This dataset is captioned using the IC3 approach that generates more descriptive captions by summarizing the details that are covered by standard image captioning models from unique view-points of the image.
摘要
Image captioning 是一种计算机视觉任务,它涉及生成图像的自然语言描述。这种方法在各个领域有广泛的应用,如图像检索系统、医学和各种产业。然而,许多研究都集中在高质量图像或控制环境下进行了研究,而忽略了实际世界中的挑战。实际世界中的图像描述 task 面临着复杂和动态的环境,以及众多的关注点,图像的质量也往往很差,这使得这个任务变得非常困难,连人类也有困难。这篇论文评估了基于不同编码机制、语言解码器和训练方法的多种模型的性能,使用了MIT室内场景 dataset 创建的超过 800 张图像、65 个不同场景类的新 dataset。这个 dataset 使用 IC3 方法进行描述,该方法通过从不同视角把握图像的细节来生成更加描述性的描述。
SwinGar: Spectrum-Inspired Neural Dynamic Deformation for Free-Swinging Garments
results: 我们的实验结果表明,我们的方法在多种自由摆服装上具有remarkable的效果,并且超越了现有的方法。Abstract
Our work presents a novel spectrum-inspired learning-based approach for generating clothing deformations with dynamic effects and personalized details. Existing methods in the field of clothing animation are limited to either static behavior or specific network models for individual garments, which hinders their applicability in real-world scenarios where diverse animated garments are required. Our proposed method overcomes these limitations by providing a unified framework that predicts dynamic behavior for different garments with arbitrary topology and looseness, resulting in versatile and realistic deformations. First, we observe that the problem of bias towards low frequency always hampers supervised learning and leads to overly smooth deformations. To address this issue, we introduce a frequency-control strategy from a spectral perspective that enhances the generation of high-frequency details of the deformation. In addition, to make the network highly generalizable and able to learn various clothing deformations effectively, we propose a spectral descriptor to achieve a generalized description of the global shape information. Building on the above strategies, we develop a dynamic clothing deformation estimator that integrates frequency-controllable attention mechanisms with long short-term memory. The estimator takes as input expressive features from garments and human bodies, allowing it to automatically output continuous deformations for diverse clothing types, independent of mesh topology or vertex count. Finally, we present a neural collision handling method to further enhance the realism of garments. Our experimental results demonstrate the effectiveness of our approach on a variety of free-swinging garments and its superiority over state-of-the-art methods.
摘要
我们的工作提出了一种新的spectrum-inspired学习基于方法,用于生成动态效果和个性化细节的服装扭曲。现有的服装动画方法在场景中有限, either static behavior or specific network models for individual garments,这限制了它们在实际应用中的可用性。我们的提议方法继承了这些限制,提供了一个统一框架,预测不同服装的动态行为,包括任意的服装结构和松弛度,从而实现了多样化和现实的扭曲。首先,我们发现了低频偏见问题,常常妨碍超级vised学习,导致扭曲过于平滑。为解决这个问题,我们引入了spectral perspective的频率控制策略,提高生成扭曲的高频细节。此外,为使网络具有高度普适性和能够有效地学习多种服装扭曲,我们提议了spectral描述器,实现了全局形状信息的总体描述。基于以上策略,我们开发了一个频率可控的注意力机制,并将其与长短期记忆相结合。该机制能够自动输出不同服装类型的连续扭曲,无论服装的结构或 vertex 数。最后,我们提出了一种神经网络碰撞处理方法,以进一步提高服装的现实性。我们的实验结果表明,我们的方法在多种自由挥移服装上得到了效果,并且与现状方法相比,具有更高的实用性。
paper_authors: Linfeng Tan, Jiangtong Li, Li Niu, Liqing Zhang
for: This paper focuses on addressing the issue of image inconsistency in image composition, specifically in the context of foreground and background.
methods: The proposed method utilizes dual color spaces, specifically $RGB$ and $Lab$, to decorrelate the color and illumination features in the image. The method consists of a $RGB$ harmonization backbone, an $Lab$ encoding module, and an $Lab$ control module.
results: The proposed method alleviates the workload in the harmonization process and provides disentangled color and illumination statistics, leading to improved image harmonization results.Abstract
Image harmonization is an essential step in image composition that adjusts the appearance of composite foreground to address the inconsistency between foreground and background. Existing methods primarily operate in correlated $RGB$ color space, leading to entangled features and limited representation ability. In contrast, decorrelated color space (e.g., $Lab$) has decorrelated channels that provide disentangled color and illumination statistics. In this paper, we explore image harmonization in dual color spaces, which supplements entangled $RGB$ features with disentangled $L$, $a$, $b$ features to alleviate the workload in harmonization process. The network comprises a $RGB$ harmonization backbone, an $Lab$ encoding module, and an $Lab$ control module. The backbone is a U-Net network translating composite image to harmonized image. Three encoders in $Lab$ encoding module extract three control codes independently from $L$, $a$, $b$ channels, which are used to manipulate the decoder features in harmonization backbone via $Lab$ control module. Our code and model are available at \href{https://github.com/bcmi/DucoNet-Image-Harmonization}{https://github.com/bcmi/DucoNet-Image-Harmonization}.
摘要
Image harmonization 是一个重要的图像组合步骤,它调整了 composite foreground 的外观,以解决背景和前景的不一致。现有的方法主要在相关的 $RGB$ 色彩空间中进行操作,导致特征被束缚在一起,表达能力有限。与此相反,decorrelated color space (例如 $Lab$)具有分离的通道,提供了分离的色度和照明统计。在这篇论文中,我们探讨了图像协调在双色空间中,该空间补充了束缚 $RGB$ 特征的 $L$, $a$, $b$ 特征,以减轻协调过程中的工作负担。网络包括 $RGB$ 协调背bone, $Lab$ 编码模块,和 $Lab$ 控制模块。背bone 是一个 U-Net 网络,将 composite image 翻译成协调后的图像。 $Lab$ 编码模块 中的三个encoder 独立地从 $L$, $a$, $b$ 通道提取三个控制代码,这些代码用于在协调背bone 中 manipulate Decoder 特征。我们的代码和模型可以在 \href{https://github.com/bcmi/DucoNet-Image-Harmonization}{https://github.com/bcmi/DucoNet-Image-Harmonization} 上获取。
paper_authors: Praneeth Chakravarthula, Jipeng Sun, Xiao Li, Chenyang Lei, Gene Chou, Mario Bijelic, Johannes Froesch, Arka Majumdar, Felix Heide
for: The paper is written for exploring an alternative to traditional compound optics in commodity camera systems, using flat nanophotonic computational cameras that employ an array of skewed lenslets and a learned reconstruction approach.
methods: The paper proposes a differentiable optimization method that continuously samples over the visible spectrum and factorizes the optical modulation for different incident fields into individual lenses. The authors also use a generative diffusion model for probabilistic reconstruction.
results: The paper demonstrates the ability to recover images from diverse scenes in broadband with a single nanophotonic layer, using both simulation and an experimental prototype. The proposed flat camera design is capable of achieving high-quality image reconstruction with a flat and thin metasurface.Abstract
Today's commodity camera systems rely on compound optics to map light originating from the scene to positions on the sensor where it gets recorded as an image. To record images without optical aberrations, i.e., deviations from Gauss' linear model of optics, typical lens systems introduce increasingly complex stacks of optical elements which are responsible for the height of existing commodity cameras. In this work, we investigate flat nanophotonic computational cameras as an alternative that employs an array of skewed lenslets and a learned reconstruction approach. The optical array is embedded on a metasurface that, at 700~nm height, is flat and sits on the sensor cover glass at 2.5~mm focal distance from the sensor. To tackle the highly chromatic response of a metasurface and design the array over the entire sensor, we propose a differentiable optimization method that continuously samples over the visible spectrum and factorizes the optical modulation for different incident fields into individual lenses. We reconstruct a megapixel image from our flat imager with a learned probabilistic reconstruction method that employs a generative diffusion model to sample an implicit prior. To tackle scene-dependent aberrations in broadband, we propose a method for acquiring paired captured training data in varying illumination conditions. We assess the proposed flat camera design in simulation and with an experimental prototype, validating that the method is capable of recovering images from diverse scenes in broadband with a single nanophotonic layer.
摘要
Translated into Simplified Chinese:今天的商用相机系统依靠复合光学来将场景中的光映射到感光器上的位置,并以图像的形式记录下来。为了不受光学偏差的影响,常见的镜系统会引入越来越复杂的光学元件,这些元件负责现有的商用相机的高度。在这个工作中,我们研究了使用平板光学计算相机,该设计包括一个倾斜的透镜阵列和一种学习重建方法。该光学阵列被嵌入在700nm的高度上的金属表面上,并位于感光器保护玻璃上的2.5mm距离上。为了解决金属表面的高度吸收和设计阵列,我们提出了一种可 diferenciable 优化方法,该方法可以不断样本在可见光谱中,并将不同的入射场分解成各个透镜。我们使用一种学习涂抹模型来重建一个兆像,并使用一种生成扩散模型来采样一个隐藏的先验。为了解决场景相依的偏差,我们提出了一种获得包含变化照明条件的对应训练数据的方法。我们在实验和模拟中验证了我们的平板相机设计,并证明了该方法可以从多种场景中恢复宽频图像,只需一个单一的 nanophotonic 层。
Unfolding Once is Enough: A Deployment-Friendly Transformer Unit for Super-Resolution
results: 这个论文的模型可以在训练和部署平台上实现高效的性能,并且提供了一些实验证明这个模型的可行性和高效性。Abstract
Recent years have witnessed a few attempts of vision transformers for single image super-resolution (SISR). Since the high resolution of intermediate features in SISR models increases memory and computational requirements, efficient SISR transformers are more favored. Based on some popular transformer backbone, many methods have explored reasonable schemes to reduce the computational complexity of the self-attention module while achieving impressive performance. However, these methods only focus on the performance on the training platform (e.g., Pytorch/Tensorflow) without further optimization for the deployment platform (e.g., TensorRT). Therefore, they inevitably contain some redundant operators, posing challenges for subsequent deployment in real-world applications. In this paper, we propose a deployment-friendly transformer unit, namely UFONE (i.e., UnFolding ONce is Enough), to alleviate these problems. In each UFONE, we introduce an Inner-patch Transformer Layer (ITL) to efficiently reconstruct the local structural information from patches and a Spatial-Aware Layer (SAL) to exploit the long-range dependencies between patches. Based on UFONE, we propose a Deployment-friendly Inner-patch Transformer Network (DITN) for the SISR task, which can achieve favorable performance with low latency and memory usage on both training and deployment platforms. Furthermore, to further boost the deployment efficiency of the proposed DITN on TensorRT, we also provide an efficient substitution for layer normalization and propose a fusion optimization strategy for specific operators. Extensive experiments show that our models can achieve competitive results in terms of qualitative and quantitative performance with high deployment efficiency. Code is available at \url{https://github.com/yongliuy/DITN}.
摘要
近年来,Single Image Super-Resolution(SISR)领域内有一些视力 трансформа器的尝试。由于高级别特征的存储和计算需求增加,efficient SISR transformer更受欢迎。基于一些流行的transformer backbone,许多方法已经探索了合理的减少自注意模块的计算复杂度的方法,以达到出色的表现。然而,这些方法只关注在训练平台(如PyTorch/TensorFlow)上的表现,没有进一步优化 для部署平台(如TensorRT)。因此,它们绝对包含一些冗余的运算符,对实际应用中的部署带来挑战。在这篇论文中,我们提出了一种部署友好的transformer单元,称为UFONE(即UnFolding ONce is Enough),以解决这些问题。在每个UFONE中,我们引入了 Inner-patch Transformer Layer(ITL),以有效地重构局部结构信息,并在每个patch之间引入Spatial-Aware Layer(SAL),以利用远程依赖关系。基于UFONE,我们提出了一种部署友好的Inner-patch Transformer Network(DITN),用于SISR任务,可以在训练和部署平台上达到低延迟和内存使用率的良好性能。此外,为了进一步提高提档端的部署效率,我们还提供了一种高效的层normalization替换方法和一种特定运算符的融合优化策略。广泛的实验表明,我们的模型可以在qualitative和quantitative方面具有竞争性的性能,同时具有高部署效率。代码可以在\url{https://github.com/yongliuy/DITN}上获取。
Few-shot Class-Incremental Semantic Segmentation via Pseudo-Labeling and Knowledge Distillation
paper_authors: Chengjia Jiang, Tao Wang, Sien Li, Jinyang Wang, Shirui Wang, Antonios Antoniou for:The paper addresses the problem of learning new classes for semantic segmentation models from few examples, which is challenging due to limited novel data and avoiding catastrophic forgetting.methods:The proposed approach uses a pseudo-labeling strategy to augment few-shot training annotations, transferring knowledge from labeled images to unlabeled images with a coarse-to-fine approach. This includes matching labeled images to their nearest neighbors in the unlabeled image set at the scene level, and obtaining pseudo-labels within this neighborhood using classifiers learned on the few-shot annotations. Knowledge distillation is also used on both labeled and unlabeled data to retain knowledge on existing classes.results:The proposed approach is validated on the Cityscapes and KITTI datasets in the self-driving domain, with extensive experiments showing its efficacy in learning new classes from few examples while retaining knowledge on existing classes.Abstract
We address the problem of learning new classes for semantic segmentation models from few examples, which is challenging because of the following two reasons. Firstly, it is difficult to learn from limited novel data to capture the underlying class distribution. Secondly, it is challenging to retain knowledge for existing classes and to avoid catastrophic forgetting. For learning from limited data, we propose a pseudo-labeling strategy to augment the few-shot training annotations in order to learn novel classes more effectively. Given only one or a few images labeled with the novel classes and a much larger set of unlabeled images, we transfer the knowledge from labeled images to unlabeled images with a coarse-to-fine pseudo-labeling approach in two steps. Specifically, we first match each labeled image to its nearest neighbors in the unlabeled image set at the scene level, in order to obtain images with a similar scene layout. This is followed by obtaining pseudo-labels within this neighborhood by applying classifiers learned on the few-shot annotations. In addition, we use knowledge distillation on both labeled and unlabeled data to retain knowledge on existing classes. We integrate the above steps into a single convolutional neural network with a unified learning objective. Extensive experiments on the Cityscapes and KITTI datasets validate the efficacy of the proposed approach in the self-driving domain. Code is available from https://github.com/ChasonJiang/FSCILSS.
摘要
我们面临的问题是从几个示例学习新的类型,这是因为以下两个原因困难:首先,从有限的新数据中学习到基于类型分布的下面是困难的;其次,保持现有类型的知识,而不是忘记旧类型。为了学习几个示例,我们提议使用假标注策略来增强几个示例的训练签名,以便更好地学习新类型。我们给每个标注的图像找到与其相似的场景图像,然后在这个场景图像集中获取pseudo-标签,并在这个邻居集中应用经过几个示例签名学习的分类器。此外,我们还使用知识继承来保持现有类型的知识。我们将上述步骤 integrate into a single convolutional neural network with a unified learning objective。我们在Cityscapes和KITTI数据集上进行了广泛的实验,并证明了我们的方法在自动驾驶领域的有效性。代码可以从https://github.com/ChasonJiang/FSCILSS中下载。
A Voting-Stacking Ensemble of Inception Networks for Cervical Cytology Classification
paper_authors: Linyi Qian, Qian Huang, Yulin Chen, Junzhou Chen for: 这个研究旨在提高乳腺癌早期检测和诊断精度,以减少女性健康威胁。methods: 本研究使用了三个Inception网络作为基础学习者,并通过投票组合集成了他们的输出。遗传学习者在投票组合集成模型中的错误样本上进行了进一步的训练,并使用了多层 Stacking 组合架构以提高性能。results: 研究在 SIPakMed、Herlev 和 Mendeley 数据集上进行了评估,实现了准确率100%、100% 和 100% 分别。评估结果超越了目前州先的方法,显示出本方法在实际应用中具有优秀的潜力,可以帮助病理学家早期检测乳腺癌。Abstract
Cervical cancer is one of the most severe diseases threatening women's health. Early detection and diagnosis can significantly reduce cancer risk, in which cervical cytology classification is indispensable. Researchers have recently designed many networks for automated cervical cancer diagnosis, but the limited accuracy and bulky size of these individual models cannot meet practical application needs. To address this issue, we propose a Voting-Stacking ensemble strategy, which employs three Inception networks as base learners and integrates their outputs through a voting ensemble. The samples misclassified by the ensemble model generate a new training set on which a linear classification model is trained as the meta-learner and performs the final predictions. In addition, a multi-level Stacking ensemble framework is designed to improve performance further. The method is evaluated on the SIPakMed, Herlev, and Mendeley datasets, achieving accuracies of 100%, 100%, and 100%, respectively. The experimental results outperform the current state-of-the-art (SOTA) methods, demonstrating its potential for reducing screening workload and helping pathologists detect cervical cancer.
摘要
cervical cancer 是女性健康中最严重的一种疾病。早期检测和诊断可以显著降低 cancer 的风险,在这个过程中,预防细胞学分类是不可或缺的。研究人员最近设计了许多自动诊断 cervical cancer 的网络,但这些个体模型的准确率和大小均不能满足实际应用需求。为解决这个问题,我们提出了一种投票汇聚策略,该策略使用三个 Inception 网络作为基础学习器,并将其输出集成到投票 ensemble。样本被模型错分类后生成一个新的训练集,并在这个集合上训练一个线性分类模型,并进行最终预测。此外,我们还设计了多级汇聚框架,以进一步提高性能。我们在 SIPakMed、Herlev 和 Mendeley 数据集上进行了实验,实现了100%、100% 和 100% 的准确率,分别。实验结果超出当前状态艺术(SOTA)方法的性能,demonstrating its potential for reducing screening workload and helping pathologists detect cervical cancer.
Dual Degradation-Inspired Deep Unfolding Network for Low-Light Image Enhancement
paper_authors: Huake Wang, Xingsong Hou, Xiaoyang Yan for: This paper proposes a novel deep learning network called DASUNet for low-light image enhancement, which explicitly simulates the deterioration mechanism of low-light images and learns two distinct image priors via considering degradation specificity between luminance and chrominance spaces.methods: The proposed DASUNet uses a dual degradation model (DDM) to simulate the deterioration mechanism of low-light images, and an alternating optimization solution to solve the proposed DDM. Additionally, the network uses a prior modeling module (PMM) to enhance the representation capability of dual degradation priors, and a space aggregation module (SAM) to boost the interaction of two degradation models.results: Extensive experiments on multiple popular low-light image datasets validate the effectiveness of DASUNet compared to canonical state-of-the-art low-light image enhancement methods.Abstract
Although low-light image enhancement has achieved great stride based on deep enhancement models, most of them mainly stress on enhancement performance via an elaborated black-box network and rarely explore the physical significance of enhancement models. Towards this issue, we propose a Dual degrAdation-inSpired deep Unfolding network, termed DASUNet, for low-light image enhancement. Specifically, we construct a dual degradation model (DDM) to explicitly simulate the deterioration mechanism of low-light images. It learns two distinct image priors via considering degradation specificity between luminance and chrominance spaces. To make the proposed scheme tractable, we design an alternating optimization solution to solve the proposed DDM. Further, the designed solution is unfolded into a specified deep network, imitating the iteration updating rules, to form DASUNet. Local and long-range information are obtained by prior modeling module (PMM), inheriting the advantages of convolution and Transformer, to enhance the representation capability of dual degradation priors. Additionally, a space aggregation module (SAM) is presented to boost the interaction of two degradation models. Extensive experiments on multiple popular low-light image datasets validate the effectiveness of DASUNet compared to canonical state-of-the-art low-light image enhancement methods. Our source code and pretrained model will be publicly available.
摘要
although low-light image enhancement has made great strides based on deep enhancement models, most of them focus on improving performance through complex black-box networks and rarely explore the physical significance of enhancement models. To address this issue, we propose a Dual degradation-inspired deep Unfolding network, termed DASUNet, for low-light image enhancement. Specifically, we construct a dual degradation model (DDM) to explicitly simulate the deterioration mechanism of low-light images. It learns two distinct image priors by considering the degradation specificity between luminance and chrominance spaces. To make the proposed scheme tractable, we design an alternating optimization solution to solve the proposed DDM. Further, the designed solution is unfolded into a specified deep network, imitating the iteration updating rules, to form DASUNet. Local and long-range information are obtained by prior modeling module (PMM), inheriting the advantages of convolution and Transformer, to enhance the representation capability of dual degradation priors. Additionally, a space aggregation module (SAM) is presented to boost the interaction of two degradation models. Extensive experiments on multiple popular low-light image datasets validate the effectiveness of DASUNet compared to canonical state-of-the-art low-light image enhancement methods. Our source code and pretrained model will be publicly available.Note: The translation is in Simplified Chinese, which is one of the two standard Chinese scripts used in mainland China and Singapore. The other script is Traditional Chinese, which is used in Taiwan, Hong Kong, and other countries.
One-stage Low-resolution Text Recognition with High-resolution Knowledge Transfer
results: 对多种低分辨率文本数据进行了广泛的实验,表明提出的方法可以效果地提高低分辨率文本识别精度和效率,并且比 tradicional two-stage 框架更加高效和稳定。Abstract
Recognizing characters from low-resolution (LR) text images poses a significant challenge due to the information deficiency as well as the noise and blur in low-quality images. Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-stage recognition. Although this pipeline is straightforward and intuitive, it has to use an additional super-resolution network, which causes inefficiencies during training and testing. Moreover, the recognition accuracy of the second stage heavily depends on the reconstruction quality of the first stage, causing ineffectiveness. In this work, we attempt to address these challenges from a novel perspective: adapting the recognizer to low-resolution inputs by transferring the knowledge from the high-resolution. Guided by this idea, we propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer. Specifically, the visual focus loss is proposed to extract the character position knowledge with resolution gap reduction and character region focus, the semantic contrastive loss is employed to exploit the contextual semantic knowledge with contrastive learning, and the soft logits loss facilitates both local word-level and global sequence-level learning from the soft teacher label. Extensive experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, accompanied by favorable robustness. Code is available at https://github.com/csguoh/KD-LTR.
摘要
Current solutions for low-resolution text recognition (LTR) typically rely on a two-stage pipeline that involves super-resolution as the first stage followed by the second-stage recognition. However, this pipeline is not efficient and effective due to the additional super-resolution network and the heavy dependence of the second stage on the reconstruction quality of the first stage. In this work, we propose a novel approach to address these challenges by adapting the recognizer to low-resolution inputs through knowledge transfer from high-resolution inputs. Specifically, we use a knowledge distillation framework that includes visual focus loss, semantic contrastive loss, and soft logits loss to extract character position knowledge, exploit contextual semantic knowledge, and facilitate both local word-level and global sequence-level learning. Extensive experiments show that our one-stage pipeline outperforms super-resolution based two-stage frameworks in terms of effectiveness and efficiency, with favorable robustness. Code is available at https://github.com/csguoh/KD-LTR.Here's the Chinese text with Traditional Chinese characters:现有的低分辨率文本识别(LTR)解决方案通常是以二阶段管道为基础,首先使用超解析,然后进行第二阶段识别。然而,这个管道并不是非常有效和高效,因为需要额外的超解析网络,并且第二阶段识别的精度仅仅取决于第一阶段的重建质量。在这个工作中,我们尝试通过将知识传递到低分辨率输入中来解决这些挑战。我们提出了一个高效和有效的知识传递框架,包括视觉钩损传学、semantic contrastive learning和软顶掌握损传学。实验结果显示,我们的一阶段管道明显超过了基于超解析的二阶段管道,在效率和稳定性方面具有优势。代码可以在https://github.com/csguoh/KD-LTR中找到。
DeDrift: Robust Similarity Search under Content Drift
methods: 这篇研究使用 nearest neighbor search in embedding space 来 investigate the impact of content drift on large-scale similarity search tools, and introduces a method called DeDrift to continuously adapt the indexing structures on-the-fly.
results: DeDrift 可以几乎完全消除因查询和数据库内容迁移而导致的搜寻精度下降,并且比全 indexes reconstruction 更快速,可以达到 100 倍的速度提升。Abstract
The statistical distribution of content uploaded and searched on media sharing sites changes over time due to seasonal, sociological and technical factors. We investigate the impact of this "content drift" for large-scale similarity search tools, based on nearest neighbor search in embedding space. Unless a costly index reconstruction is performed frequently, content drift degrades the search accuracy and efficiency. The degradation is especially severe since, in general, both the query and database distributions change. We introduce and analyze real-world image and video datasets for which temporal information is available over a long time period. Based on the learnings, we devise DeDrift, a method that updates embedding quantizers to continuously adapt large-scale indexing structures on-the-fly. DeDrift almost eliminates the accuracy degradation due to the query and database content drift while being up to 100x faster than a full index reconstruction.
摘要
随着时间的推移,媒体分享网站上上传和搜索的内容分布也发生变化,这些变化受到季节性、社会性和技术因素的影响。我们研究这些内容漂移(content drift)对大规模相似搜索工具的影响,基于最近邻居搜索空间中的嵌入式搜索。如果不定期更新索引,内容漂移会导致搜索精度和效率下降,特别是查询和数据库分布都发生变化。我们使用实际世界的图像和视频数据集,其中具有长时间的时间信息,以便学习内容漂移的影响。根据这些学习,我们提出了DeDrift方法,它可以在运行中continuously更新大规模索引结构,以适应内容漂移。DeDrift可以减少查询和数据库内容漂移导致的精度下降,同时比全Index重建更快,可以达到100倍的提升。
Discrimination of Radiologists Utilizing Eye-Tracking Technology and Machine Learning: A Case Study
paper_authors: Stanford Martinez, Carolina Ramirez-Tamayo, Syed Hasib Akhter Faruqui, Kal L. Clark, Adel Alaeddini, Nicholas Czarnek, Aarushi Aggarwal, Sahra Emamzadeh, Jeffrey R. Mock, Edward J. Golob
for: 鉴别医生经验水平
methods: 使用个性化高维度视搜策略和精度编码方法
results: 比较现有方法表现出色,可以帮助鉴别医生的经验水平和需要进一步培训Abstract
Perception-related errors comprise most diagnostic mistakes in radiology. To mitigate this problem, radiologists employ personalized and high-dimensional visual search strategies, otherwise known as search patterns. Qualitative descriptions of these search patterns, which involve the physician verbalizing or annotating the order he/she analyzes the image, can be unreliable due to discrepancies in what is reported versus the actual visual patterns. This discrepancy can interfere with quality improvement interventions and negatively impact patient care. This study presents a novel discretized feature encoding based on spatiotemporal binning of fixation data for efficient geometric alignment and temporal ordering of eye movement when reading chest X-rays. The encoded features of the eye-fixation data are employed by machine learning classifiers to discriminate between faculty and trainee radiologists. We include a clinical trial case study utilizing the Area Under the Curve (AUC), Accuracy, F1, Sensitivity, and Specificity metrics for class separability to evaluate the discriminability between the two subjects in regard to their level of experience. We then compare the classification performance to state-of-the-art methodologies. A repeatability experiment using a separate dataset, experimental protocol, and eye tracker was also performed using eight subjects to evaluate the robustness of the proposed approach. The numerical results from both experiments demonstrate that classifiers employing the proposed feature encoding methods outperform the current state-of-the-art in differentiating between radiologists in terms of experience level. This signifies the potential impact of the proposed method for identifying radiologists' level of expertise and those who would benefit from additional training.
摘要
诊断错误主要由视觉相关错误引起,以避免这个问题,辐射医生使用个性化高维度视觉搜索策略,即搜索模式。但是,这些搜索模式的 качеitative描述可能不准确,因为医生所报告的与实际视觉模式之间存在差异。这种差异可能会对质量改进 intervención和患者护理产生负面影响。本研究提出了一种新的Feature编码方法,基于维度分割和时间排序的眼动数据,以实现高效的几何对齐和时间排序。这些编码的特征被用于机器学习分类器,以分别抑制faculty和学生医生的诊断能力。我们在使用AUC、准确率、F1、感知率和特征率等度量进行评估分类表达的临床试验中,与现有方法进行比较。此外,我们还进行了使用不同数据集、实验室和眼动仪的重复试验,以评估提案的稳定性。 numerically的结果表明,使用提案的特征编码方法的分类器,在分别抑制faculty和学生医生的诊断能力方面表现出色,而且比现有方法更高效。这表明提案的方法可能对于诊断能力水平的识别和需要进行更多训练的医生进行了深刻的影响。
Exploring Part-Informed Visual-Language Learning for Person Re-Identification
paper_authors: Yin Lin, Cong Liu, Yehansen Chen, Jinshui Hu, Bing Yin, Baocai Yin, Zengfu Wang for:* 本研究旨在提高基于视觉语言学习的人识别(ReID)任务中的细腻特征。methods:* 该方法提出使用部分指导语言监督来增强细腻视觉特征。* 该方法包括人体分割指南尝试策略和层次融合方式来确保在部分特征semantic consistency。results:* 该方法在四个常用的ReID benchmark上 achieve substantial improvements,特别是在MSMT17数据集上report 90.3% Rank-1和76.5% mAP。Abstract
Recently, visual-language learning has shown great potential in enhancing visual-based person re-identification (ReID). Existing visual-language learning-based ReID methods often focus on whole-body scale image-text feature alignment, while neglecting supervisions on fine-grained part features. This choice simplifies the learning process but cannot guarantee within-part feature semantic consistency thus hindering the final performance. Therefore, we propose to enhance fine-grained visual features with part-informed language supervision for ReID tasks. The proposed method, named Part-Informed Visual-language Learning ($\pi$-VL), suggests that (i) a human parsing-guided prompt tuning strategy and (ii) a hierarchical fusion-based visual-language alignment paradigm play essential roles in ensuring within-part feature semantic consistency. Specifically, we combine both identity labels and parsing maps to constitute pixel-level text prompts and fuse multi-stage visual features with a light-weight auxiliary head to perform fine-grained image-text alignment. As a plug-and-play and inference-free solution, our $\pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks, especially reporting 90.3% Rank-1 and 76.5% mAP for the most challenging MSMT17 database without bells and whistles.
摘要
最近,视觉语言学习已经显示出增强视觉基于人识别(ReID)的潜在能力。现有的视觉语言学习基于ReID方法通常会关注全身图像特征对齐,而忽略细节特征的监督。这种选择可以简化学习过程,但无法保证内部特征 semantics的一致性,从而降低最终性能。因此,我们提出了在ReID任务中增强细节视觉特征的方法,即Part-Informed Visual-language Learning($\pi$-VL)。我们的方法包括以下两个关键组成部分:1. 人体分割指导的提问调整策略。2. 层次融合基于视觉语言对齐方法。具体来说,我们将标签和分割地图组合成像素级文本提示,并将多stage视觉特征与轻量级辅助头进行集成,以实现细节图像-文本对齐。我们的$\pi$-VL方法可以作为插件和推理自由解决方案,在四个常用的ReID benchmark上实现了显著的提高,特别是在MSMT17数据集上Report 90.3% Rank-1和76.5% mAP的最高得分。
EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer
results: 我们对假象深度估计在endoscopic imaging中的性能进行了全面评估,并与现有基eline解决方案进行了比较。结果表明,EndoDepthL Ensures depth estimation accuracy with a lightweight structure。Abstract
In this study, we address the key challenges concerning the accuracy and effectiveness of depth estimation for endoscopic imaging, with a particular emphasis on real-time inference and the impact of light reflections. We propose a novel lightweight solution named EndoDepthL that integrates Convolutional Neural Networks (CNN) and Transformers to predict multi-scale depth maps. Our approach includes optimizing the network architecture, incorporating multi-scale dilated convolution, and a multi-channel attention mechanism. We also introduce a statistical confidence boundary mask to minimize the impact of reflective areas. To better evaluate the performance of monocular depth estimation in endoscopic imaging, we propose a novel complexity evaluation metric that considers network parameter size, floating-point operations, and inference frames per second. We comprehensively evaluate our proposed method and compare it with existing baseline solutions. The results demonstrate that EndoDepthL ensures depth estimation accuracy with a lightweight structure.
摘要
在本研究中,我们解决了投射精度和效果的关键挑战,尤其是实时推理和光折射的影响。我们提出了一种新的轻量级解决方案,名为EndoDepthL,它将Convolutional Neural Networks(CNN)和Transformers结合以预测多尺度深度地图。我们的方法包括优化网络架构、使用多尺度扩展 convolution 和多通道注意机制。我们还引入了一个统计信息边界面面来最小化反射区域的影响。为更好地评估单摄深度估计在内窥影像中的性能,我们提出了一个新的复杂度评估指标,考虑网络参数大小、浮点运算和推理帧数。我们对我们的提议方法进行了全面评估,并与现有的基线解决方案进行比较。结果显示,EndoDepthL Ensures depth estimation accuracy with a lightweight structure。
Exploring the Effect of Sparse Recovery on the Quality of Image Superresolution
results: 这个论文的实验结果表明,使用不同的简单恢复算法可以影响图像恢复的质量。Abstract
Dictionary learning can be used for image superresolution by learning a pair of coupled dictionaries of image patches from high-resolution and low-resolution image pairs such that the corresponding pairs share the same sparse vector when represented by the coupled dictionaries. These dictionaries then can be used to to reconstruct the corresponding high-resolution patches from low-resolution input images based on sparse recovery. The idea is to recover the shared sparse vector using the low-resolution dictionary and then multiply it by the high-resolution dictionary to recover the corresponding high-resolution image patch. In this work, we study the effect of the sparse recovery algorithm that we use on the quality of the reconstructed images. We offer empirical experiments to search for the best sparse recovery algorithm that can be used for this purpose.
摘要
<>对于图像超分解,词典学习可以用来学习一对对应的高分辨率和低分辨率图像对的词典,使得这对对应的词典元素共享同一个简约 вектор。这些词典然后可以用来从低分辨率输入图像中重建对应的高分辨率图像片段,基于简约恢复。我们的想法是通过低分辨率词典来恢复共享的简约 вектор,然后将其乘以高分辨率词典来恢复对应的高分辨率图像片段。在这个工作中,我们研究了使用不同的简约恢复算法对图像质量的影响,并通过实验搜索最佳的简约恢复算法。Note: Simplified Chinese is also known as "Mandarin" Chinese, and it is the official language of China. It is written using the same characters as Traditional Chinese, but with simpler stroke orders and fewer characters.
EDI: ESKF-based Disjoint Initialization for Visual-Inertial SLAM Systems
results: 根据评估数据,这篇论文的EDI方法在几个挑战性环境中(包括人工散射误差)可以在几秒钟内实现精准的视像运动初始化,并且超过了其他状态艺术的视像运动初始化方法。Abstract
Visual-inertial initialization can be classified into joint and disjoint approaches. Joint approaches tackle both the visual and the inertial parameters together by aligning observations from feature-bearing points based on IMU integration then use a closed-form solution with visual and acceleration observations to find initial velocity and gravity. In contrast, disjoint approaches independently solve the Structure from Motion (SFM) problem and determine inertial parameters from up-to-scale camera poses obtained from pure monocular SLAM. However, previous disjoint methods have limitations, like assuming negligible acceleration bias impact or accurate rotation estimation by pure monocular SLAM. To address these issues, we propose EDI, a novel approach for fast, accurate, and robust visual-inertial initialization. Our method incorporates an Error-state Kalman Filter (ESKF) to estimate gyroscope bias and correct rotation estimates from monocular SLAM, overcoming dependence on pure monocular SLAM for rotation estimation. To estimate the scale factor without prior information, we offer a closed-form solution for initial velocity, scale, gravity, and acceleration bias estimation. To address gravity and acceleration bias coupling, we introduce weights in the linear least-squares equations, ensuring acceleration bias observability and handling outliers. Extensive evaluation on the EuRoC dataset shows that our method achieves an average scale error of 5.8% in less than 3 seconds, outperforming other state-of-the-art disjoint visual-inertial initialization approaches, even in challenging environments and with artificial noise corruption.
摘要
“视听初始化可以分为联合和分割两种方法。联合方法同时处理视听参数,通过IMU综合 integrating 视觉特征点的观察结果,并使用关闭式解决方案来计算初速度和重力。相比之下,分割方法独立解决 Структура从 Move(SFM)问题,并从纯粹的一频单摄像头SLAM中获取各种相机位置。然而,以往的分割方法有一些限制,如假设加速度偏移的影响为零或纯粹的一频单摄像头SLAM中的旋转估计准确。为解决这些问题,我们提出了一种新的方法——EDI,它是一种快速、准确和Robust的视听初始化方法。我们的方法通过Error-state Kalman Filter(ESKF)来估算陀螺偏移和纯粹的一频单摄像头SLAM中的旋转估计,从而超越了依赖于纯粹的一频单摄像头SLAM中的旋转估计。为估计初速度、比例因子、重力和加速度偏移的问题,我们提供了一个关闭式解决方案。为处理重力和加速度偏移之间的相互作用,我们引入了权重,确保加速度偏移的观察可见性和承受扰动。我们的方法在EuRoC数据集上进行了广泛的评估,结果显示,我们的方法在 less than 3 秒内,覆盖率Error 5.8%,超越了其他当前最佳的分割视听初始化方法,即使在复杂的环境下和人工噪声损害。”
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
results: 实验结果显示,ReCLIP 可以将 CLIP 模型的平均错误率从 30.17% 降低至 25.06% 在 22 个图像分类benchmark上。Abstract
Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
摘要
大规模预训练视语模型如CLIP已经表现出色的零例推断能力,例如在ImageNet上达到76.3%的顶部一准确率无需见过任何示例,这可能会带来许多无标示数据的任务中的优势。然而,在应用CLIP到下游目标领域时,视语频域差和交叉Modal不一致可能会对模型表现产生很大的影响。为了解决这些挑战,我们提出了ReCLIP,首个无源频段适应方法,不需要源数据或目标标记数据。ReCLIP首先学习一个抑制覆盖视语嵌入的投影空间,然后通过交叉模式自动训练,使视和文批处程序更新、精细标签和减少频段差异和不一致。经过广泛的实验,我们示出ReCLIP可以将CLIP的平均错误率从30.17%降低到25.06%。
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
methods: 该模型使用了共享冻结Convolutional CLIP backbone,并利用图像和文本特征在共享embedding空间中进行桥接,从而 Addressing the challenge of open-vocabulary segmentation.
results: 研究显示,FC-CLIP在Zero-shot manner测试时,在ADE20K、Mapillary Vistas和Cityscapes等 datasets上达到了26.8 PQ、16.8 AP、34.1 mIoU、18.2 PQ、27.9 mIoU、44.0 PQ、26.8 AP、56.2 mIoU的最高表现,与之前的艺术品相比,提高了4.2 PQ、2.4 AP、4.2 mIoU、4.0 PQ、20.1 PQ的表现。同时,FC-CLIP的训练和测试时间相比之前的艺术品快7.5倍和6.6倍,使用参数数量相对减少5.9倍。Abstract
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip
摘要
开放词汇 segmentation 是一项具有挑战性的任务,需要将图像和文本分割和识别到开放集的类别中。一种 Addressing this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient.By contrast, we propose a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining.When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieves 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas, and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets.Code:
Uncertainty Estimation and Propagation in Accelerated MRI Reconstruction
results: 研究人员发现,PHiRec可以生成高质量的重建结果,同时也可以更好地评估MR成像重建过程中的不确定性,并且可以传播这些不确定性到下游分割任务中。Abstract
MRI reconstruction techniques based on deep learning have led to unprecedented reconstruction quality especially in highly accelerated settings. However, deep learning techniques are also known to fail unexpectedly and hallucinate structures. This is particularly problematic if reconstructions are directly used for downstream tasks such as real-time treatment guidance or automated extraction of clinical paramters (e.g. via segmentation). Well-calibrated uncertainty quantification will be a key ingredient for safe use of this technology in clinical practice. In this paper we propose a novel probabilistic reconstruction technique (PHiRec) building on the idea of conditional hierarchical variational autoencoders. We demonstrate that our proposed method produces high-quality reconstructions as well as uncertainty quantification that is substantially better calibrated than several strong baselines. We furthermore demonstrate how uncertainties arising in the MR econstruction can be propagated to a downstream segmentation task, and show that PHiRec also allows well-calibrated estimation of segmentation uncertainties that originated in the MR reconstruction process.
摘要
results: 实验分析表明,引入的储量模型可以达到理论最大的短期记忆容量。同时,相比标准ESN,ES$^2$N具有更好的记忆和非线性质量的融合,以及在推理非线性模型方面的显著改善。Abstract
In this paper, we propose a new Reservoir Computing (RC) architecture, called the Edge of Stability Echo State Network (ES$^2$N). The introduced ES$^2$N model is based on defining the reservoir layer as a convex combination of a nonlinear reservoir (as in the standard ESN), and a linear reservoir that implements an orthogonal transformation. We provide a thorough mathematical analysis of the introduced model, proving that the whole eigenspectrum of the Jacobian of the ES2N map can be contained in an annular neighbourhood of a complex circle of controllable radius, and exploit this property to demonstrate that the ES$^2$N's forward dynamics evolves close to the edge-of-chaos regime by design. Remarkably, our experimental analysis shows that the newly introduced reservoir model is able to reach the theoretical maximum short-term memory capacity. At the same time, in comparison to standard ESN, ES$^2$N is shown to offer a favorable trade-off between memory and nonlinearity, as well as a significant improvement of performance in autoregressive nonlinear modeling.
摘要
在这篇论文中,我们提出了一新的储存 computing(RC)架构,称为Edge of Stability Echo State Network(ES$^2$N)。我们引入的ES$^2$N模型基于定义储存层为非线性储存(如标准ESN)和线性储存实现正交变换的权重组合。我们提供了ES$^2$N模型的完整数学分析,证明整个储存映射的雅可比矩阵的所有特征值可以包含在一个可控制的圆形范围内,并利用这个性质来证明ES$^2$N的前向动力学在设计上靠近边缘混乱 режи。在实验中,我们发现新引入的储存模型可以达到理论最大短期记忆容量。同时,相比标准ESN,ES$^2$N具有更好的记忆和非线性质量之间的交换,以及在推理非线性模型方面的显著改善。
Textual Data Mining for Financial Fraud Detection: A Deep Learning Approach
results: 研究者的结果表明,使用这些多种神经网络模型可以准确地检测金融诈骗。这些结果对金融诈骗检测有重要的意义,并为业界实践人员、 regulators 和研究人员提供了有价值的意见,帮助他们开发更加有力的诈骗检测方法。Abstract
In this report, I present a deep learning approach to conduct a natural language processing (hereafter NLP) binary classification task for analyzing financial-fraud texts. First, I searched for regulatory announcements and enforcement bulletins from HKEX news to define fraudulent companies and to extract their MD&A reports before I organized the sentences from the reports with labels and reporting time. My methodology involved different kinds of neural network models, including Multilayer Perceptrons with Embedding layers, vanilla Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM), and Gated Recurrent Unit (GRU) for the text classification task. By utilizing this diverse set of models, I aim to perform a comprehensive comparison of their accuracy in detecting financial fraud. My results bring significant implications for financial fraud detection as this work contributes to the growing body of research at the intersection of deep learning, NLP, and finance, providing valuable insights for industry practitioners, regulators, and researchers in the pursuit of more robust and effective fraud detection methodologies.
摘要
在这份报告中,我使用深度学习方法进行自然语言处理(以下简称NLP)的二分类任务,以分析金融诈骗文本。首先,我从香港证券交易所新闻中搜索了规范公告和执行公告,以定义诈骗公司并从其财务报告中提取MD&A报告。然后,我将报告中的句子按照标签和报告时间进行排序。我的方法包括多层感知器(Multilayer Perceptrons)、普通的循环神经网络(vanilla RNN)、长短期记忆神经网络(LSTM)和闭合循环神经网络(GRU)等多种神经网络模型,用于文本分类任务。通过使用这些多种模型,我想要进行全面的比较,以确定这些模型在检测金融诈骗方面的准确率。我的结果对金融诈骗检测有着重要的意义,这项工作将贡献到深度学习、NLP和金融之间的交叉领域的研究中,为业内专业人士、监管机构和研究人员提供价值的发现。
paper_authors: Alireza Rafiei, Ronald Moore, Sina Jahromi, Farshid Hajati, Rishikesan Kamaleswaran for:这篇论文旨在探讨基于知识和体验的meta-learning在医疗领域的应用,以提供对健康领域中critical挑战的解决方案。methods:这篇论文使用了多种meta-learning方法,包括多/单任务学习和多/几少射学习,并将这些方法应用到医疗领域中。results:这篇论文总结了过去几年在医疗领域中的meta-learning研究,包括各种应用和挑战,以及未来的挑战和展望。Abstract
As a subset of machine learning, meta-learning, or learning to learn, aims at improving the model's capabilities by employing prior knowledge and experience. A meta-learning paradigm can appropriately tackle the conventional challenges of traditional learning approaches, such as insufficient number of samples, domain shifts, and generalization. These unique characteristics position meta-learning as a suitable choice for developing influential solutions in various healthcare contexts, where the available data is often insufficient, and the data collection methodologies are different. This survey discusses meta-learning broad applications in the healthcare domain to provide insight into how and where it can address critical healthcare challenges. We first describe the theoretical foundations and pivotal methods of meta-learning. We then divide the employed meta-learning approaches in the healthcare domain into two main categories of multi/single-task learning and many/few-shot learning and survey the studies. Finally, we highlight the current challenges in meta-learning research, discuss the potential solutions and provide future perspectives on meta-learning in healthcare.
摘要
为了提高模型的能力,机器学习的一个子集——元学习(learning to learn),利用先前的知识和经验来提高模型的性能。元学习的思想可以有效地解决传统学习方法面临的挑战,如数据缺乏、频率不均等。这些特点使得元学习成为在医疗领域开发有力的解决方案的适用场景。本文首先介绍元学习的理论基础和关键方法,然后将医疗领域中emploied元学习方法分为多/单任务学习和多/少射学习两个主要类别,并survey相关研究。最后,我们提出了当前元学习研究中的挑战,并讨论了 potential解决方案,以及未来元学习在医疗领域的前景。
Semi-supervised Learning for Segmentation of Bleeding Regions in Video Capsule Endoscopy
paper_authors: Hechen Li, Yanan Wu, Long Bai, An Wang, Tong Chen, Hongliang Ren
for: 用于诊断不同类型的肠胃内部疾病,包括不明确出血。
methods: 采用 semi-supervised learning 方法,使用 Mean Teacher 方法构建学生 U-Net 模型和老师模型,并在训练过程中交互更新参数。
results: 实验结果表明,使用 SSL 方法可以减少模型训练所需的注释量,无需妥协准确性。Abstract
In the realm of modern diagnostic technology, video capsule endoscopy (VCE) is a standout for its high efficacy and non-invasive nature in diagnosing various gastrointestinal (GI) conditions, including obscure bleeding. Importantly, for the successful diagnosis and treatment of these conditions, accurate recognition of bleeding regions in VCE images is crucial. While deep learning-based methods have emerged as powerful tools for the automated analysis of VCE images, they often demand large training datasets with comprehensive annotations. Acquiring these labeled datasets tends to be time-consuming, costly, and requires significant domain expertise. To mitigate this issue, we have embraced a semi-supervised learning (SSL) approach for the bleeding regions segmentation within VCE. By adopting the `Mean Teacher' method, we construct a student U-Net equipped with an scSE attention block, alongside a teacher model of the same architecture. These models' parameters are alternately updated throughout the training process. We use the Kvasir-Capsule dataset for our experiments, which encompasses various GI bleeding conditions. Notably, we develop the segmentation annotations for this dataset ourselves. The findings from our experiments endorse the efficacy of the SSL-based segmentation strategy, demonstrating its capacity to reduce reliance on large volumes of annotations for model training, without compromising on the accuracy of identification.
摘要
现代诊断技术中,视频幂 capsule endoscopy (VCE) 是一种非侵入性和高效的诊断多种消化道(GI)疾病的技术,包括不明确出血。在成功诊断和治疗这些疾病的过程中,准确地识别出血液区域在 VCE 图像中是关键。深度学习基本方法已经在 VCE 图像的自动分析中展示出了强大的能力,但它们通常需要大量的训练数据集,并且需要具有很大的域专业知识来获取这些标注数据集。为了解决这个问题,我们采用了一种半监督学习(SSL)方法来进行出血区域的分割。我们采用的是“Mean Teacher”方法,通过将一个学生 U-Net 和一个教师模型相互更新参数,以实现学习。我们使用 Kvasir-Capsule 数据集进行实验,该数据集包括多种 GI 出血病变。它们的分割标注我们自己制作了。实验结果证明了 SSL 基于的分割策略的效果,表明它可以降低对大量标注的依赖,不会影响识别的精度。
Replace Scoring with Arrangement: A Contextual Set-to-Arrangement Framework for Learning-to-Rank
results: 实验结果显示,STARank在2个学习排名 benchmark dataset和3个top-N实际推荐dataset上表现出色,较前方的9个方法更好。此外,STARank还能够在对候选项目之间的上下文依赖性下提供更好的性能。Abstract
Learning-to-rank is a core technique in the top-N recommendation task, where an ideal ranker would be a mapping from an item set to an arrangement (a.k.a. permutation). Most existing solutions fall in the paradigm of probabilistic ranking principle (PRP), i.e., first score each item in the candidate set and then perform a sort operation to generate the top ranking list. However, these approaches neglect the contextual dependence among candidate items during individual scoring, and the sort operation is non-differentiable. To bypass the above issues, we propose Set-To-Arrangement Ranking (STARank), a new framework directly generates the permutations of the candidate items without the need for individually scoring and sort operations; and is end-to-end differentiable. As a result, STARank can operate when only the ground-truth permutations are accessible without requiring access to the ground-truth relevance scores for items. For this purpose, STARank first reads the candidate items in the context of the user browsing history, whose representations are fed into a Plackett-Luce module to arrange the given items into a list. To effectively utilize the given ground-truth permutations for supervising STARank, we leverage the internal consistency property of Plackett-Luce models to derive a computationally efficient list-wise loss. Experimental comparisons against 9 the state-of-the-art methods on 2 learning-to-rank benchmark datasets and 3 top-N real-world recommendation datasets demonstrate the superiority of STARank in terms of conventional ranking metrics. Notice that these ranking metrics do not consider the effects of the contextual dependence among the items in the list, we design a new family of simulation-based ranking metrics, where existing metrics can be regarded as special cases. STARank can consistently achieve better performance in terms of PBM and UBM simulation-based metrics.
摘要
学习排名是推荐任务的核心技术,其理想rankerd是一个将元素集映射到排序(即排序)的函数。现有的方法大多采用概率排名原则(PRP),即先对候选集中的每个元素分配排名并then执行排序操作来生成top排名列表。然而,这些方法忽略了候选项之间的上下文依赖关系,并且排序操作是不可导的。为了缺省这些问题,我们提出了Set-To-Arrangement Ranking(STARank),一种新的框架,可以直接生成候选项的排序列表,而不需要对每个元素进行分配排名和排序操作。此外,STARank还是端到端可导的。因此,STARank可以在只有真实排名available的情况下运行,而不需要访问真实的相关性分数。为了使用给定的真实排名来监督STARank,我们利用了Plackett-Luce模型的内部一致性性来 derivate一种 computationally efficient的列表性损失。我们在2个学习排名 benchmark dataset和3个top-N实际推荐dataset上进行了对9种状态对方法的实验比较,结果表明STARank在传统排名度量上表现出色。请注意,这些排名度量不考虑候选项列表中的上下文依赖关系,我们设计了一新的基于simulation的排名度量家族,其中现有的度量可以看作特殊情况。STARank可以在PBM和UBM simulation-based度量上提供更好的性能。
results: 本研究提出了feather的动机和值 proposition,并提供了完整的技术和实现细节。例如,SDK可支持多步骤模型,并可以自动对储存数据集进行评估。但是请注意,feather是一个休眠项目,我们已经开源了代码用于研究purposes。Abstract
At its core, feather was a tool that allowed model developers to build shareable user interfaces for their models in under 20 lines of code. Using the Python SDK, developers specified visual components that users would interact with. (e.g. a FileUpload component to allow users to upload a file). Our service then provided 1) a URL that allowed others to access and use the model visually via a user interface; 2) an API endpoint to allow programmatic requests to a model. In this paper, we discuss feather's motivations and the value we intended to offer AI researchers and developers. For example, the SDK can support multi-step models and can be extended to run automatic evaluation against held out datasets. We additionally provide comprehensive technical and implementation details. N.B. feather is presently a dormant project. We have open sourced our code for research purposes: https://github.com/feather-ai/
摘要
文本核心是一个工具,允许模型开发者在20行代码以下建立可共享用户界面 для他们的模型。使用Python SDK,开发者指定了用户会交互的视觉组件(例如文件上传组件)。我们的服务然后提供了以下两个功能:1)一个URL,允许他人通过用户界面访问和使用模型;2)一个API端点,允许程序请求模型的请求。在这篇论文中,我们讨论了feather的动机和我们对AI研究者和开发者的意图价值。例如,SDK可以支持多步骤模型,并可以扩展以自动对储存数据进行评估。我们还提供了完整的技术和实现细节。注意:feather目前是一个休眠项目,我们已经开源了我们的代码 для研究用途:https://github.com/feather-ai/。
Physics-Based Task Generation through Causal Sequence of Physical Interactions
results: 研究人员使用了游戏 Angry Birds 来示例性评估生成的任务,并通过多种指标(包括物理稳定性、可用性和不可用性)评估了这些任务的质量。Abstract
Performing tasks in a physical environment is a crucial yet challenging problem for AI systems operating in the real world. Physics simulation-based tasks are often employed to facilitate research that addresses this challenge. In this paper, first, we present a systematic approach for defining a physical scenario using a causal sequence of physical interactions between objects. Then, we propose a methodology for generating tasks in a physics-simulating environment using these defined scenarios as inputs. Our approach enables a better understanding of the granular mechanics required for solving physics-based tasks, thereby facilitating accurate evaluation of AI systems' physical reasoning capabilities. We demonstrate our proposed task generation methodology using the physics-based puzzle game Angry Birds and evaluate the generated tasks using a range of metrics, including physical stability, solvability using intended physical interactions, and accidental solvability using unintended solutions. We believe that the tasks generated using our proposed methodology can facilitate a nuanced evaluation of physical reasoning agents, thus paving the way for the development of agents for more sophisticated real-world applications.
摘要
在物理环境中完成任务是人工智能系统在真实世界中的一个关键和挑战。基于物理模拟的任务经常用于解决这一问题。在这篇论文中,我们首先提出一种系统的方法来定义物理场景,使用物理相互作用的 causal 序列来描述物理交互。然后,我们提出一种使用这些定义的场景来生成在物理模拟环境中的任务的方法。我们的方法可以帮助更好地理解物理任务的细致机理,从而帮助评估人工智能系统在物理上的理解能力。我们使用Physics-based puzzle game Angry Birds进行示例,并使用一系列指标来评估生成的任务,包括物理稳定性、可以通过意图的物理交互解决、以及意外地解决使用非意图的解决方案。我们认为生成的任务可以帮助评估物理理解代理人的能力,因此为更复杂的真实世界应用开创道路。
Multi-Agent Verification and Control with Probabilistic Model Checking
results: 这篇论文总结了一些在多代理情况下的推理进展,以及这些进展的应用。它还提出了在多代理情况下推理的挑战。Abstract
Probabilistic model checking is a technique for formal automated reasoning about software or hardware systems that operate in the context of uncertainty or stochasticity. It builds upon ideas and techniques from a diverse range of fields, from logic, automata and graph theory, to optimisation, numerical methods and control. In recent years, probabilistic model checking has also been extended to integrate ideas from game theory, notably using models such as stochastic games and solution concepts such as equilibria, to formally verify the interaction of multiple rational agents with distinct objectives. This provides a means to reason flexibly about agents acting in either an adversarial or a collaborative fashion, and opens up opportunities to tackle new problems within, for example, artificial intelligence, robotics and autonomous systems. In this paper, we summarise some of the advances in this area, and highlight applications for which they have already been used. We discuss how the strengths of probabilistic model checking apply, or have the potential to apply, to the multi-agent setting and outline some of the key challenges required to make further progress in this field.
摘要
probabistic model checking 是一种技术,用于正式自动推理软件或硬件系统在不确定性或随机性的情况下的行为。它借鉴了多种领域的想法和技巧,包括逻辑、自动机和图论,以及优化、数值方法和控制。在最近几年,probabistic model checking 还被扩展到与游戏理论相结合,使用模型如随机游戏和解迪rium such as equilibria,以正式验证多个有目的智能代理人之间的互动。这提供了一种可以灵活地关于代理人在对抗或合作的方式下行为的方式,并开阔了人工智能、 роботех和自动化系统等领域的应用可能性。在这篇文章中,我们将summarize一些在这个领域的进展,并高亮应用于哪些领域。我们讨论probabistic model checking 在多代理人设置下的优势,以及需要在这个领域进行进一步进展的关键挑战。
A Symbolic Character-Aware Model for Solving Geometry Problems
results: 在GeoQA和Geometry3K两个标准数据集上实现了新的州OF-the-art问题解决率和解决步骤的改进,具体来说是从60.0%提高到64.1%,并且在Geometry3K数据集上将问题平均解决步骤从6.9下降到6.0。Abstract
AI has made significant progress in solving math problems, but geometry problems remain challenging due to their reliance on both text and diagrams. In the text description, symbolic characters such as "$\triangle$ABC" often serve as a bridge to connect the corresponding diagram. However, by simply tokenizing symbolic characters into individual letters (e.g., 'A', 'B' and 'C'), existing works fail to study them explicitly and thus lose the semantic relationship with the diagram. In this paper, we develop a symbolic character-aware model to fully explore the role of these characters in both text and diagram understanding and optimize the model under a multi-modal reasoning framework. In the text encoder, we propose merging individual symbolic characters to form one semantic unit along with geometric information from the corresponding diagram. For the diagram encoder, we pre-train it under a multi-label classification framework with the symbolic characters as labels. In addition, we enhance the geometry diagram understanding ability via a self-supervised learning method under the masked image modeling auxiliary task. By integrating the proposed model into a general encoder-decoder pipeline for solving geometry problems, we demonstrate its superiority on two benchmark datasets, including GeoQA and Geometry3K, with extensive experiments. Specifically, on GeoQA, the question-solving accuracy is increased from 60.0\% to 64.1\%, achieving a new state-of-the-art accuracy; on Geometry3K, we reduce the question average solving steps from 6.9 down to 6.0 with marginally higher solving accuracy.
摘要
AI已经做出了重要的进步在解决数学问题上,但 geometry问题仍然是挑战的,因为它们需要同文本和图表之间建立连接。在文本描述中,符号Character such as "$\triangle$ABC" часто作为连接图表和文本之间的桥梁。然而,现有的工作往往会将符号Character tokenized into individual letters(例如,'A', 'B' 和 'C'),从而失去了图表和文本之间的semantic关系。在这篇论文中,我们开发了一种符号Character-aware的模型,以全面探讨符号Character在文本和图表理解中的角色,并在多模态推理框架下优化模型。在文本Encoder中,我们提议将个别的符号Character合并成一个semantic单元,并与图表中的 геометри信息相结合。对于图表Encoder,我们在多Label分类框架下预训练它,使其能够学习符号Character的semantic信息。此外,我们通过自动学习方法在masked image modeling auxiliary task中增强图表理解能力。通过将我们的模型集成到一个通用的encoder-decoder架构中,我们在GeoQA和Geometry3K两个标准测试集上进行了广泛的实验,并证明了我们的模型在这些测试集上的超越性。具体来说,在GeoQA上,问题解决率从60.0%提高到64.1%,创造了新的state-of-the-art纪录;在Geometry3K上,我们将问题的平均解决步骤从6.9降低到6.0,并保持了marginally高的解决率。
The changing rule of human bone density with aging based on a novel definition and mensuration of bone density with computed tomography
results: 研究发现,骨密度随着年龄增长而下降,这种下降Rate在女性和男性之间有所差异,女性在39-80岁之间的年龄段下降的速度约为1.6倍于男性。这些结果证明了骨密度不同岁龄段的变化是线性的,从而为骨健康研究和临床诊断提供了新的视角。Abstract
Osteoporosis and fragility fractures have emerged as major public health concerns in an aging population. However, measuring age-related changes in bone density using dual-energy X-ray absorptiometry has limited personalized risk assessment due to susceptibility to interference from various factors. In this study, we propose an innovative statistical model of bone pixel distribution in fine-segmented computed tomography (CT) images, along with a novel approach to measuring bone density based on CT values of bone pixels. Our findings indicate that bone density exhibits a linear decline with age during adulthood between the ages of 39 and 80, with the rate of decline being approximately 1.6 times faster in women than in men. This contradicts the widely accepted notion that bone density starts declining in women at menopause and in men at around 50 years of age. The linearity of age-related changes provides further insights into the dynamics of the aging human body. Consequently, our findings suggest that the definition of osteoporosis by the World Health Organization should be revised to the standard deviation of age-based bone density. Furthermore, these results open up new avenues for research in bone health care and clinical investigation of osteoporosis.
摘要
骨质疾病和脆骨损伤已经成为老龄人口的主要公共健康问题。然而,使用双能X射线吸收率测量年龄相关的骨密度受到了各种因素的干扰, limiting personalized risk assessment. 在本研究中,我们提出了一个创新的骨像分布统计模型,以及一种基于Computed Tomography(CT)值的骨密度衡量方法。我们的发现显示,骨密度随着年龄的增长会 linear decline,女性与男性的增长率相对约1.6倍,这与通过Menopause和50岁的男性开始衰老骨骼的观念不符。这 Linearity of age-related changes provides further insights into the dynamics of the aging human body. Therefore, our findings suggest that the definition of osteoporosis by the World Health Organization should be revised to the standard deviation of age-based bone density. Furthermore, these results open up new avenues for research in bone health care and clinical investigation of osteoporosis.
Artificial Intelligence for Molecular Communication
paper_authors: Max Bartunik, Jens Kirchner, Oliver Keszocze
for: 这个研究是为了探讨分子通信技术,尤其是在医疗设备中使用这种技术实现数据传输。
methods: 这个研究使用了人工神经网络来解读受到干扰的讯号。
results: 这个研究发现,使用人工神经网络可以实现可靠地识别受到干扰的讯号。Abstract
Molecular communication is a novel approach for data transmission between miniaturized devices, especially in contexts where electrical signals are to be avoided. The communication is based on sending molecules (or other particles) at nano scale through channel instead sending electrons over a wire. Molecular communication devices have a large potential in medical applications as they offer an alternative to antenna-based transmission systems that may not be applicable due to size, temperature, or radiation constraints. The communication is achieved by transforming a digital signal into concentrations of molecules. These molecules are then detected at the other end of the communication channel and transformed back into a digital signal. Accurately modeling the transmission channel is often not possible which may be due to a lack of data or time-varying parameters of the channel (e. g., the movements of a person wearing a medical device). This makes demodulation of the signal very difficult. Many approaches for demodulation have been discussed with one particular approach having tremendous success: artificial neural networks. These networks imitate the decision process in the human brain and are capable of reliably classifying noisy input data. Training such a network relies on a large set of training data. As molecular communication as a technology is still in its early development phase, this data is not always readily available. We discuss neural network-based demodulation approaches relying on synthetic data based on theoretical channel models as well as works using actual measurements produced by a prototype test bed. In this work, we give a general overview over the field molecular communication, discuss the challenges in the demodulations process of transmitted signals, and present approaches to these challenges that are based on artificial neural networks.
摘要
молекуляр通信是一种新的方法 для数据传输 между微型设备,特别在电子信号被避免的情况下。通信基于发送分子(或其他粒子)在nanoscale通道上进行传输,而不是在电缆上发送电子信号。分子通信设备在医疗应用中有很大的潜力,因为它们可以提供一种antenna-based传输系统的替代方案,这些系统可能因为尺寸、温度或辐射限制而无法适用。通信是将数字信号转换成分子的浓度。这些分子然后在通信频道的另一端被探测,并转换回数字信号。因为模拟传输频道的模型很难,因此分子通信的许多挑战在于模拟频道的困难。许多适用于模拟频道的方法已经被讨论,其中一种方法具有惊人的成功:人工神经网络。这些网络模仿人脑中的决策过程,可以可靠地分类噪音输入数据。训练这种网络需要大量的训练数据。由于分子通信技术还处于早期发展阶段,这些数据不总是可用。我们讨论基于理论频道模型生成的 sintetic数据,以及使用实际测量生成的数据来训练人工神经网络。在这篇文章中,我们提供了分子通信领域的总体介绍,讨论了传输信号的各种挑战,以及基于人工神经网络的解决方案。
A generative model for surrogates of spatial-temporal wildfire nowcasting
for: This paper aims to provide a generative model for real-time wildfire nowcasting, using a three-dimensional Vector-Quantized Variational Autoencoders to generate spatial-temporal sequences of unseen wildfire burned areas in a given ecoregion.
methods: The proposed method uses a generative model based on a three-dimensional Vector-Quantized Variational Autoencoders to generate coherent and structured fire scenarios, taking into account the impact from geophysical variables such as vegetation and slope.
results: The generated data are used to train a surrogate model for predicting wildfire dissemination, which has been tested on both simulation data and the real Chimney fire event, showing promising results.Abstract
Recent increase in wildfires worldwide has led to the need for real-time fire nowcasting. Physics-driven models, such as cellular automata and computational fluid dynamics can provide high-fidelity fire spread simulations but they are computationally expensive and time-consuming. Much effort has been put into developing machine learning models for fire prediction. However, these models are often region-specific and require a substantial quantity of simulation data for training purpose. This results in a significant amount of computational effort for different ecoregions. In this work, a generative model is proposed using a three-dimensional Vector-Quantized Variational Autoencoders to generate spatial-temporal sequences of unseen wildfire burned areas in a given ecoregion. The model is tested in the ecoregion of a recent massive wildfire event in California, known as the Chimney fire. Numerical results show that the model succeed in generating coherent and structured fire scenarios, taking into account the impact from geophysical variables, such as vegetation and slope. Generated data are also used to train a surrogate model for predicting wildfire dissemination, which has been tested on both simulation data and the real Chimney fire event.
摘要
全球各地的野火增加,带来了实时火灾预测的需求。物理驱动的模型,如细胞自动机和计算流体力学,可以提供高精度的火灾传播模拟,但是它们 computationally expensive 和 time-consuming。大量的努力已经投入到了机器学习模型的开发中,以预测野火。然而,这些模型通常是地域特定的,需要大量的 simulate 数据来训练目的。这导致了不同的生态区域需要巨量的计算力。在这篇文章中,一种基于三维 вектор量化自适应器的生成模型被提议,用于生成未经见过的野火烧毁区域的三维空间时间序列。模型在加利福尼亚州的一个latest massive wildfire事件中进行了测试,称为Chimney fire。数据测试结果表明,模型成功地生成了具有相互关系的和结构的野火场景,考虑了地理物理变量,如 vegetation 和 slope。生成的数据还用于训练一个用于预测野火传播的代理模型,该模型在实际Chimney fire事件上进行了测试,以及在 simulate 数据上进行了测试。
MiAMix: Enhancing Image Classification through a Multi-stage Augmented Mixed Sample Data Augmentation Method
paper_authors: Wen Liang, Youzhi Liang, Jianguo Jia for: 这篇论文的目的是提出一种名为 MiAMix 的多阶段混合调整方法,以提高深度学习模型的性能和泛化能力。methods: 这篇论文使用了多种多标的混合方法,包括 Image Augmentation 和 Mixup 架构,并且随机选择混合几何调整方法以提高混合效果。results: 根据四个图像标准benchmark的评估结果,MiAMix 可以提高模型的性能,并且不需要额外的计算负载。相比之下,与现有的混合样本数据增强方法进行比较,MiAMix 表现更好。Abstract
Despite substantial progress in the field of deep learning, overfitting persists as a critical challenge, and data augmentation has emerged as a particularly promising approach due to its capacity to enhance model generalization in various computer vision tasks. While various strategies have been proposed, Mixed Sample Data Augmentation (MSDA) has shown great potential for enhancing model performance and generalization. We introduce a novel mixup method called MiAMix, which stands for Multi-stage Augmented Mixup. MiAMix integrates image augmentation into the mixup framework, utilizes multiple diversified mixing methods concurrently, and improves the mixing method by randomly selecting mixing mask augmentation methods. Recent methods utilize saliency information and the MiAMix is designed for computational efficiency as well, reducing additional overhead and offering easy integration into existing training pipelines. We comprehensively evaluate MiaMix using four image benchmarks and pitting it against current state-of-the-art mixed sample data augmentation techniques to demonstrate that MIAMix improves performance without heavy computational overhead.
摘要
尽管深度学习领域已取得了显著进步,但过拟合仍然是一个重要挑战,而数据扩充因其能够提高模型通用性而成为一个非常有前途的方法。虽然有很多策略被提出,但混合样本数据扩充(MSDA)表现出了很大的潜力,可以提高模型性能和通用性。我们介绍了一种新的混合方法,称为 Multi-stage Augmented Mixup(MiAMix),它将图像扩充integrated into mixup框架,同时使用多种多样化的混合方法,并通过随机选择混合maske augmentation方法来提高混合方法。现有的方法使用了注意力信息,而MiAMix也是为计算效率而设计,减少了额外的负担和提供了易于集成到现有训练管道的便利性。我们对MiaMix进行了四个图像标准曲线的完整评估,并与当前状态的混合样本数据扩充技术进行比较,以示MiAMix可以提高性能而不带重量级计算过程。
Crowdsourcing Fraud Detection over Heterogeneous Temporal MMMA Graph
results: 在一个industry-size HTG上应用CMT后,可以具有显著的性能提升,与其他方法相比。此外,CMT还在一个大规模的公共金融HTG上进行了实验,并取得了显著的结果,表明CMT可以应用于其他图像异常检测任务中。Abstract
The rise of the click farm business using Multi-purpose Messaging Mobile Apps (MMMAs) tempts cybercriminals to perpetrate crowdsourcing frauds that cause financial losses to click farm workers. In this paper, we propose a novel contrastive multi-view learning method named CMT for crowdsourcing fraud detection over the heterogeneous temporal graph (HTG) of MMMA. CMT captures both heterogeneity and dynamics of HTG and generates high-quality representations for crowdsourcing fraud detection in a self-supervised manner. We deploy CMT to detect crowdsourcing frauds on an industry-size HTG of a representative MMMA WeChat and it significantly outperforms other methods. CMT also shows promising results for fraud detection on a large-scale public financial HTG, indicating that it can be applied in other graph anomaly detection tasks.
摘要
随着多功能通信手机应用程序(MMMA)的兴起, click farm业务的营运者被黑客利用以进行卫星投票诈骗, causing financial losses to click farm workers. 在这篇论文中,我们提议一种新的对比多视图学习方法(CMT),用于在异质时间图(HTG)上探测卫星投票诈骗。 CMT 能够 Capture HTG 中的异质和动态特征,并生成高质量的 Representation 以自主supervised 方式探测卫星投票诈骗。 我们在一个industry-size HTG 上部署 CMT,并显著超越其他方法。 CMT 还在一个大规模的公共金融 HTG 上显示出了扫描投票诈骗的扩展性, indicating that it can be applied to other graph anomaly detection tasks.
Solving Logistic-Oriented Bin Packing Problems Through a Hybrid Quantum-Classical Approach
results: 在本文中,作者采用了多种特点,如箱不同类型、一维、二维和三维实例的解决方案,以及物品与箱的关联要求和交付优先级等,并测试了这些特点和Q4RealBPP的应用能力。Abstract
The Bin Packing Problem is a classic problem with wide industrial applicability. In fact, the efficient packing of items into bins is one of the toughest challenges in many logistic corporations and is a critical issue for reducing storage costs or improving vehicle space allocation. In this work, we resort to our previously published quantum-classical framework known as Q4RealBPP, and elaborate on the solving of real-world oriented instances of the Bin Packing Problem. With this purpose, this paper gravitates on the following characteristics: i) the existence of heterogeneous bins, ii) the extension of the framework to solve not only three-dimensional, but also one- and two-dimensional instances of the problem, iii) requirements for item-bin associations, and iv) delivery priorities. All these features have been tested in this paper, as well as the ability of Q4RealBPP to solve real-world oriented instances.
摘要
文本:The Bin Packing Problem is a classic problem with wide industrial applicability. In fact, the efficient packing of items into bins is one of the toughest challenges in many logistic corporations and is a critical issue for reducing storage costs or improving vehicle space allocation. In this work, we resort to our previously published quantum-classical framework known as Q4RealBPP, and elaborate on the solving of real-world oriented instances of the Bin Packing Problem. With this purpose, this paper gravitates on the following characteristics: i) the existence of heterogeneous bins, ii) the extension of the framework to solve not only three-dimensional, but also one- and two-dimensional instances of the problem, iii) requirements for item-bin associations, and iv) delivery priorities. All these features have been tested in this paper, as well as the ability of Q4RealBPP to solve real-world oriented instances.中文翻译: bin packing 问题是一个广泛的工业应用问题。实际上,fficiently packing items into bins 是许多物流公司面临的最大挑战,是降低存储成本或改善车辆空间分配的关键问题。在这种情况下,我们 resort 到我们之前发表的量子-классической框架,称为 Q4RealBPP,并详细介绍了解决实际应用场景的问题。为此,本文强调以下特征:i) 存在异ogeneous bins,ii) 扩展框架以解决不仅三维,而且一维和二维实例的问题,iii) 物品-桶关联要求,以及 iv) 交付优先级。所有这些特征都在本文中进行了测试,以及 Q4RealBPP 的应用实际应用场景。
Semi-supervised Contrastive Regression for Estimation of Eye Gaze
results: 这篇研究发现,这个对比学习框架可以从小访实验数据中获得一个具有普遍适用性的解决方案,并与许多现有的对比学习技术相比,表现更好。Abstract
With the escalated demand of human-machine interfaces for intelligent systems, development of gaze controlled system have become a necessity. Gaze, being the non-intrusive form of human interaction, is one of the best suited approach. Appearance based deep learning models are the most widely used for gaze estimation. But the performance of these models is entirely influenced by the size of labeled gaze dataset and in effect affects generalization in performance. This paper aims to develop a semi-supervised contrastive learning framework for estimation of gaze direction. With a small labeled gaze dataset, the framework is able to find a generalized solution even for unseen face images. In this paper, we have proposed a new contrastive loss paradigm that maximizes the similarity agreement between similar images and at the same time reduces the redundancy in embedding representations. Our contrastive regression framework shows good performance in comparison to several state of the art contrastive learning techniques used for gaze estimation.
摘要
随着人机交互系统中智能系统的需求增加,考虑到人机交互的非侵入性,考虑到人机交互的非侵入性, gaze controlled system 的开发已成为必要。 gaze 是一种非侵入性的人机交互方式,是最适合的方式之一。 使用 deep learning 模型来进行 gaze 估计是最常见的方法,但这些模型的性能完全受到 Labelled gaze dataset 的大小的限制,从而影响其性能的总体化。本文提出了一种 semi-supervised contrastive learning 框架,可以使用小量 Labelled gaze dataset 来估计 gaze 方向。该框架能够在未看过的面孔图像上找到一个通用的解决方案。我们提出了一种新的对比损失函数,可以最大化相似图像之间的相似性,同时减少对 embedding 表示的重复性。我们的对比回归框架在许多 state-of-the-art 对比学习技术用于 gaze 估计方面显示了良好的性能。
Surrogate Empowered Sim2Real Transfer of Deep Reinforcement Learning for ORC Superheat Control
results: 实验结果表明,提出的Sim2Real传输学习控制方法可以快速提高DRL在ORC控制问题中的训练速度,并解决多种运行条件下DRL代理人的一致性问题。Abstract
The Organic Rankine Cycle (ORC) is widely used in industrial waste heat recovery due to its simple structure and easy maintenance. However, in the context of smart manufacturing in the process industry, traditional model-based optimization control methods are unable to adapt to the varying operating conditions of the ORC system or sudden changes in operating modes. Deep reinforcement learning (DRL) has significant advantages in situations with uncertainty as it directly achieves control objectives by interacting with the environment without requiring an explicit model of the controlled plant. Nevertheless, direct application of DRL to physical ORC systems presents unacceptable safety risks, and its generalization performance under model-plant mismatch is insufficient to support ORC control requirements. Therefore, this paper proposes a Sim2Real transfer learning-based DRL control method for ORC superheat control, which aims to provide a new simple, feasible, and user-friendly solution for energy system optimization control. Experimental results show that the proposed method greatly improves the training speed of DRL in ORC control problems and solves the generalization performance issue of the agent under multiple operating conditions through Sim2Real transfer.
摘要
“对于工业废热回收中的组合式润滑循环(ORC)系统,由于其简单的结构和维护容易,因此在制程工业中广泛应用。但在智能制造中,传统的模型基于优化控制方法无法适应ORC系统的变化操作条件或突然变化的操作模式。深度强化学习(DRL)在不确定情况下有优势,因为它直接实现控制目标无需明确控制plant的模型。但是,将DRL直接应用到物理ORC系统中会带来不可接受的安全隐患,并且其通用性在多个操作条件下的表现不足以满足ORC控制需求。因此,本文提出了一种基于Sim2Real传播学习的DRL控制方法,以提供一个新的简单、可行、用户友好的能源系统优化控制解决方案。实验结果显示,提案的方法可以对ORC超载控制问题进行快速培训DRL,并且透过Sim2Real传播解决代理人在多个操作条件下的一致性问题。”
for: The paper is written for those interested in 3D representation and view synthesis, particularly in the context of Neural Radiance Fields (NeRFs) and their applications in computer graphics and vision.
methods: The paper uses a review of the NeRF representation and its applications, as well as a historical perspective on the development of 3D representation for view synthesis and related problems.
results: The paper provides insights into the current state of NeRF representations and their applications, as well as new developments and future directions in 3D representation.Here is the same information in Simplified Chinese text:
results: 这篇论文提供了NeRF表示的当前状况和应用,以及新的发展和未来方向。Abstract
Neural Radiance Fields or NeRFs have become the representation of choice for problems in view synthesis or image-based rendering, as well as in many other applications across computer graphics and vision, and beyond. At their core, NeRFs describe a new representation of 3D scenes or 3D geometry. Instead of meshes, disparity maps, multiplane images or even voxel grids, they represent the scene as a continuous volume, with volumetric parameters like view-dependent radiance and volume density obtained by querying a neural network. The NeRF representation has now been widely used, with thousands of papers extending or building on it every year, multiple authors and websites providing overviews and surveys, and numerous industrial applications and startup companies. In this article, we briefly review the NeRF representation, and describe the three decades-long quest to find the best 3D representation for view synthesis and related problems, culminating in the NeRF papers. We then describe new developments in terms of NeRF representations and make some observations and insights regarding the future of 3D representations.
摘要
neural radiance fields (NeRFs) 已成为视图合成或基于图像渲染等问题的表示方法选择,以及计算机视觉领域中许多其他应用程序的首选表示方法。 NeRFs 描述了一种新的三维场景或三维几何表示方法,而不是传统的网格、投影图、多平面图像或粒子网格。 NeRF 表示方法通过访问神经网络来获取视依赖的光度和体积密度,并且表示场景为一个连续体,而不是分割的几何体。在这篇文章中,我们将简要介绍 NeRF 表示方法,并描述了三十年来寻找最佳视图合成和相关问题的解决方案,即 NeRF 论文。然后,我们将介绍 NeRF 表示方法的新发展,以及一些关于未来三维表示方法的观察和发现。
Nonlinear Controller Design for a Quadrotor with Inverted Pendulum
results: 实现了四轴直径自适应控制系统的轨迹跟踪和稳定控制,并在四轴直径自适应控制系统和圆柱吊车组合情况下实现了轨迹跟踪。Abstract
The quadrotor is a $6$ degrees-of-freedom (DoF) system with underactuation. Adding a spherical pendulum on top of a quadrotor further complicates the task of achieving any output tracking while stabilizing the rest. In this report, we present different types of controllers for the nonlinear dynamical system of quadrotor and pendulum combination, utilizing feedback-linearization and control Lyapunov function with quadratic programming (CLF-QP) approaches. We demonstrated trajectory tracking for quadrotor-only case as well as quadrotor-pendulum-combined case.
摘要
四旋翼机是6度自由度(DoF)系统,受不足动作控制的影响。在这种情况下,加装球形悬挂在四旋翼机之上,使得控制任务变得更加复杂。本报告介绍了不同类型的控制器,用于非线性动力学系统的四旋翼机和悬挂组合,包括反馈线性化和控制 Lyapunov 函数(CLF)的方法。我们也展示了四旋翼机只有情况和四旋翼机与悬挂相结合的轨迹跟踪情况。
Assessing the impact of emergency department short stay units using length-of-stay prediction and discrete event simulation
results: 我们的结果表明,建议的表现是一般可接受的,并不需要特征选择。此外,结果表明可以使用患者入院特征、实验室测试结果、Radiology、生命体征和临床记录来预测住院日期,并且预测效果可以达到0.69的AUC值(分类短期和长期住院患者)。Abstract
Accurately predicting hospital length-of-stay at the time a patient is admitted to hospital may help guide clinical decision making and resource allocation. In this study we aim to build a decision support system that predicts hospital length-of-stay for patients admitted to general internal medicine from the emergency department. We conduct an exploratory data analysis and employ feature selection methods to identify the attributes that result in the best predictive performance. We also develop a discrete-event simulation model to assess the performances of the prediction models in a practical setting. Our results show that the recommendation performances of the proposed approaches are generally acceptable and do not benefit from the feature selection. Further, the results indicate that hospital length-of-stay could be predicted with reasonable accuracy (e.g., AUC value for classifying short and long stay patients is 0.69) using patient admission demographics, laboratory test results, diagnostic imaging, vital signs and clinical documentation.
摘要
可以准确预测入院病人的医院length-of-stay(LoS)可以帮助医疗决策和资源分配。在这个研究中,我们想要建立一个决策支持系统,用于预测急诊医学科病人的医院LoS。我们进行了探索性数据分析,并使用特征选择方法来确定最佳预测性能的属性。我们还开发了一个离散事件模拟模型,以评估预测模型在实际场景中的表现。我们的结果表明,提议的方法的表现是一般可接受的,并且不受特征选择的影响。此外,结果表明,可以使用病人入院的人数、实验室测试结果、 radiológica图像、生命体征和临床记录来预测医院LoS,并且预测的准确率(例如,AUC值用于分类短长停病人是0.69)。
Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction
results: 对多种模型,包括MSNet、FTANet和新引入的PianoNet,进行了修改,并通过实验证明了提议的修改对歌唱 мелодии提取有 empirical 效果。Abstract
In deep learning research, many melody extraction models rely on redesigning neural network architectures to improve performance. In this paper, we propose an input feature modification and a training objective modification based on two assumptions. First, harmonics in the spectrograms of audio data decay rapidly along the frequency axis. To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity (CFP) representation using discrete z-transform. Second, the vocal and non-vocal segments with extremely short duration are uncommon. To ensure a more stable melody contour, we design a differentiable loss function that prevents the model from predicting such segments. We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network. Our experimental results demonstrate that the proposed modifications are empirically effective for singing melody extraction.
摘要
在深度学习研究中,许多旋律提取模型采用重新设计神经网络架构以提高性能。在这篇论文中,我们提出了输入特征修改和训练目标修改,基于两个假设。首先, audio数据spectrogram中的谱harmonics快速衰减 along the frequency axis。为了增强模型对后续谱harmonics的敏感性,我们使用discrete z-transform修改Combined Frequency and Periodicity (CFP)表示。其次, vocals和非 vocals的极短时间段是非常罕见的。为了确保更稳定的旋律轮廓,我们设计了可导的损失函数,防止模型预测这些时间段。我们应用这些修改于several models,包括MSNet、FTANet和一种新引入的模型,PianoNet,该模型来自钢琴谱写网络。我们的实验结果表明,提出的修改是empirically effective дляSinging melody extraction。
Solving Witness-type Triangle Puzzles Faster with an Automatically Learned Human-Explainable Predicate
results: 使用这种预测方法可以加速搜索,平均提高搜索速度六倍,而且可以在固定搜索时间预算下解决更大的谜题。Abstract
Automatically solving puzzle instances in the game The Witness can guide players toward solutions and help puzzle designers generate better puzzles. In the latter case such an Artificial Intelligence puzzle solver can inform a human puzzle designer and procedural puzzle generator to produce better instances. The puzzles, however, are combinatorially difficult and search-based solvers can require large amounts of time and memory. We accelerate such search by automatically learning a human-explainable predicate that predicts whether a partial path to a Witness-type puzzle is not completable to a solution path. We prove a key property of the learned predicate which allows us to use it for pruning successor states in search thereby accelerating search by an average of six times while maintaining completeness of the underlying search. Conversely given a fixed search time budget per puzzle our predicate-accelerated search can solve more puzzle instances of larger sizes than the baseline search.
摘要
自动解决游戏《证人》中的PUZZLE实例可以帮助玩家找到解决方案,同时也可以帮助游戏设计者生成更好的PUZZLE。在后一种情况下,人工智能PUZZLE解决器可以告诉人类PUZZLE设计者和生成器制造更好的实例。然而,这些PUZZLE具有复杂的组合性,搜索基于的解决方法可能需要大量的时间和内存。我们使用自动学习的人类可读 predicate来预测partial path是否可能不能完成到解决方案。我们证明了该 predicate 的一个关键性质,这使得我们可以在搜索中使用它进行减少继承状态,从而加速搜索,而且平均加速 six 倍,保持搜索的完整性。相反,给定一个固定的搜索时间预算,我们的 predicate-加速搜索可以解决更大的PUZZLE实例,比基eline搜索更多。
Let’s Give a Voice to Conversational Agents in Virtual Reality
results: 作者在 Unity 中创建了两个对话 прототип,一个是非 immerse 显示,另一个是 VR 头sets,并在数字健康领域中进行了测试和评估。Abstract
The dialogue experience with conversational agents can be greatly enhanced with multimodal and immersive interactions in virtual reality. In this work, we present an open-source architecture with the goal of simplifying the development of conversational agents operating in virtual environments. The architecture offers the possibility of plugging in conversational agents of different domains and adding custom or cloud-based Speech-To-Text and Text-To-Speech models to make the interaction voice-based. Using this architecture, we present two conversational prototypes operating in the digital health domain developed in Unity for both non-immersive displays and VR headsets.
摘要
对话体验可以通过多模态和 immerse 交互在虚拟现实中进行增强。在这项工作中,我们提出了一个开源架构,以便简化在虚拟环境中运行 conversational agent 的开发。该架构允许插入不同领域的 conversational agent,并可以添加自定义或云端的 Speech-To-Text 和 Text-To-Speech 模型,以使交互voice基于。使用这个架构,我们向大家展示了在 Unity 中为非 immerse 显示和 VR 头戴设备开发的两个对话原型,其中一个是在医疗领域中进行了开发。
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
paper_authors: Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang
For: The paper proposes an evaluation benchmark called MM-Vet to evaluate large multimodal models (LMMs) on complicated multimodal tasks.* Methods: The paper uses six core vision-language (VL) capabilities to define the tasks and evaluates the 16 integrations of interest derived from the capability combination. The paper also proposes an LLM-based evaluator for open-ended outputs to evaluate the models across different question types and answer styles.* Results: The paper evaluates representative LMMs on MM-Vet and provides insights into the capabilities of different LMM system paradigms and models.Here is the simplified Chinese version of the three key points:* For: 文章提出了一个名为 MM-Vet 的评估准则,用于评估大型多Modal模型(LMMs)在复杂多Modal任务上的表现。* 方法: 文章使用 six core 视力语言(VL)能力定义任务,并评估了这些能力的16种组合。文章还提出了基于 LLM 的评估器,用于评估不同类型的问题和答案风格。* 结果: 文章对 representative LMMs 进行了 MM-Vet 的评估,并提供了不同系统 paradigm 和模型的能力的详细分析。Abstract
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models. Code and data are available at https://github.com/yuweihao/MM-Vet.
摘要
我们提出了MM-Vet,一个评估标准用于测试大型多modal模型(LMM)在复杂多modal任务上的能力。最近的LMM模型已经展示了各种奇妙的能力,如解决written on the blackboard的数学问题,理解新闻图片中的事件和名人,以及解释视觉笑话。由于模型的快速进步,评估标准的开发受到挑战。问题包括:(1)如何系统地结构化和评估复杂多modal任务;(2)如何设计评估指标,能够适应不同的问题类型和答案风格;以及(3)如何为模型提供更深刻的理解,不仅是简单的性能排名。为此,我们提出了MM-Vet,基于多modal视语(VL)能力的核心能力的设计。MM-Vet定义了6个核心VL能力,并评估16种 интерес的核心能力组合。 для评估指标,我们提议一种基于LLM的评估器,可以评估不同的问题类型和答案风格,从而获得统一的评估指标。我们对代表性的LMM模型进行了MM-Vet的评估,从而获得了不同模型系统和模型 paradigm的能力的具体情况。代码和数据可以在https://github.com/yuweihao/MM-Vet中获取。
Generation of Realistic Synthetic Raw Radar Data for Automated Driving Applications using Generative Adversarial Networks
results: 这个方法可以实现激光数据的生成,并且可以实现与实际数据之间的对比,并且可以提高处理激光数据的算法的可能性。Abstract
The main approaches for simulating FMCW radar are based on ray tracing, which is usually computationally intensive and do not account for background noise. This work proposes a faster method for FMCW radar simulation capable of generating synthetic raw radar data using generative adversarial networks (GAN). The code and pre-trained weights are open-source and available on GitHub. This method generates 16 simultaneous chirps, which allows the generated data to be used for the further development of algorithms for processing radar data (filtering and clustering). This can increase the potential for data augmentation, e.g., by generating data in non-existent or safety-critical scenarios that are not reproducible in real life. In this work, the GAN was trained with radar measurements of a motorcycle and used to generate synthetic raw radar data of a motorcycle traveling in a straight line. For generating this data, the distance of the motorcycle and Gaussian noise are used as input to the neural network. The synthetic generated radar chirps were evaluated using the Frechet Inception Distance (FID). Then, the Range-Azimuth (RA) map is calculated twice: first, based on synthetic data using this GAN and, second, based on real data. Based on these RA maps, an algorithm with adaptive threshold and edge detection is used for object detection. The results have shown that the data is realistic in terms of coherent radar reflections of the motorcycle and background noise based on the comparison of chirps, the RA maps and the object detection results. Thus, the proposed method in this work has shown to minimize the simulation-to-reality gap for the generation of radar data.
摘要
主要方法 для模拟雷达是基于射线跟踪,通常是计算机处理昂贵并不考虑背景噪声。这项工作提出了一种更快的雷达数据生成方法,使用生成对抗网络(GAN)来生成合成雷达数据。代码和预训练 веса公开源代码在GitHub上可用。这种方法生成了16个同时的雷达射击,使得生成的数据可以用于雷达数据处理算法的进一步开发(滤波和归一化)。这可以增加数据增强的潜在性,例如生成不可能或安全关键的场景中的数据,以便在实际生活中不可能重现。在这项工作中,GAN被训练使用雷达测量数据,并用来生成雷达数据。输入到神经网络的距离和高斯噪声来生成雷达射击。生成的雷达射击被评估使用彩色卷积扩散(FID)。然后,基于生成的数据和实际数据,计算雷达-方向(RA)图。RA图被计算两次:首先,基于生成数据使用这个GAN;其次,基于实际数据。基于这些RA图,使用适应阈值和边检测算法进行对象检测。结果表明,生成的数据具有准确的干扰雷达反射和背景噪声,根据射击、RA图和对象检测结果进行比较。因此,提出的方法在本工作中减少了模拟到现实差距。
Nonprehensile Planar Manipulation through Reinforcement Learning with Multimodal Categorical Exploration
results: 论文表明,使用多态探索方法可以让RL控制器在不同的初始和目标物体位置和方向下进行灵活的操作,并且对于外部干扰和观测噪声 Display textshowed improved accuracy and smooth trajectories compared to previous RL literature. Furthermore, the learned policies were shown to be transferable to physical robot hardware.Abstract
Developing robot controllers capable of achieving dexterous nonprehensile manipulation, such as pushing an object on a table, is challenging. The underactuated and hybrid-dynamics nature of the problem, further complicated by the uncertainty resulting from the frictional interactions, requires sophisticated control behaviors. Reinforcement Learning (RL) is a powerful framework for developing such robot controllers. However, previous RL literature addressing the nonprehensile pushing task achieves low accuracy, non-smooth trajectories, and only simple motions, i.e. without rotation of the manipulated object. We conjecture that previously used unimodal exploration strategies fail to capture the inherent hybrid-dynamics of the task, arising from the different possible contact interaction modes between the robot and the object, such as sticking, sliding, and separation. In this work, we propose a multimodal exploration approach through categorical distributions, which enables us to train planar pushing RL policies for arbitrary starting and target object poses, i.e. positions and orientations, and with improved accuracy. We show that the learned policies are robust to external disturbances and observation noise, and scale to tasks with multiple pushers. Furthermore, we validate the transferability of the learned policies, trained entirely in simulation, to a physical robot hardware using the KUKA iiwa robot arm. See our supplemental video: https://youtu.be/vTdva1mgrk4.
摘要
开发能够完成无握持的机械人控制器,例如将物体推push到表面上,是一项复杂的任务。由于机械人的下降和混合动力学性质,以及与物体之间的摩擦交互的不确定性,需要开发出复杂的控制方法。学习回归(RL)是一种强大的框架 для开发这类机械人控制器。然而,先前RL文献中的非握持推动任务往往具有低精度、不稳定的轨迹和简单的运动,即不包括机械人 manipulate对象的旋转。我们 conjecture that previous unimodal exploration strategies fail to capture the inherent hybrid dynamics of the task, arising from the different possible contact interaction modes between the robot and the object, such as sticking, sliding, and separation.在这种情况下,我们提出了多模态探索方法,通过分类分布来实现。这使得我们可以训练平面推动RL策略,对于任意起始和目标对象位姿和orientation,并且具有改善的精度。我们示出了学习的策略对于外部干扰和观测噪声具有鲁棒性和扩展性,并且可以扩展到多个推动器。此外,我们验证了学习的策略,完全在 simulate 环境中训练的,可以转移到物理机械人硬件上,使用KUKA iiwa robot臂。请参考我们的补充视频:https://youtu.be/vTdva1mgrk4。
A Survey on Temporal Knowledge Graph Completion: Taxonomy, Progress, and Prospects
results: 本研究对 TKGC 方法进行了详细的介绍和分类,并提出了未来研究方向。Abstract
Temporal characteristics are prominently evident in a substantial volume of knowledge, which underscores the pivotal role of Temporal Knowledge Graphs (TKGs) in both academia and industry. However, TKGs often suffer from incompleteness for three main reasons: the continuous emergence of new knowledge, the weakness of the algorithm for extracting structured information from unstructured data, and the lack of information in the source dataset. Thus, the task of Temporal Knowledge Graph Completion (TKGC) has attracted increasing attention, aiming to predict missing items based on the available information. In this paper, we provide a comprehensive review of TKGC methods and their details. Specifically, this paper mainly consists of three components, namely, 1)Background, which covers the preliminaries of TKGC methods, loss functions required for training, as well as the dataset and evaluation protocol; 2)Interpolation, that estimates and predicts the missing elements or set of elements through the relevant available information. It further categorizes related TKGC methods based on how to process temporal information; 3)Extrapolation, which typically focuses on continuous TKGs and predicts future events, and then classifies all extrapolation methods based on the algorithms they utilize. We further pinpoint the challenges and discuss future research directions of TKGC.
摘要
temporal characteristics are prominently evident in a substantial volume of knowledge, which underscores the pivotal role of Temporal Knowledge Graphs (TKGs) in both academia and industry. However, TKGs often suffer from incompleteness due to three main reasons: the continuous emergence of new knowledge, the weakness of the algorithm for extracting structured information from unstructured data, and the lack of information in the source dataset. Therefore, the task of Temporal Knowledge Graph Completion (TKGC) has attracted increasing attention, aiming to predict missing items based on the available information. In this paper, we provide a comprehensive review of TKGC methods and their details. Specifically, this paper mainly consists of three components, namely, 1)Background, which covers the preliminaries of TKGC methods, loss functions required for training, as well as the dataset and evaluation protocol; 2)Interpolation, that estimates and predicts the missing elements or set of elements through the relevant available information. It further categorizes related TKGC methods based on how to process temporal information; 3)Extrapolation, which typically focuses on continuous TKGs and predicts future events, and then classifies all extrapolation methods based on the algorithms they utilize. We further pinpoint the challenges and discuss future research directions of TKGC.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
From Military to Healthcare: Adopting and Expanding Ethical Principles for Generative Artificial Intelligence
paper_authors: David Oniani, Jordan Hilsman, Yifan Peng, COL, Ronald K. Poropatich, COL Jeremy C. Pamplin, LTC Gary L. Legault, Yanshan Wang
For: The paper is written to propose ethical principles for the use of generative AI in healthcare, with the goal of addressing ethical dilemmas and challenges posed by the technology.* Methods: The paper uses a framework called GREAT PLEA to outline the ethical principles for generative AI in healthcare, which includes governance, reliability, equity, accountability, traceability, privacy, lawfulness, empathy, and autonomy.* Results: The paper aims to provide a proactive approach to addressing the ethical challenges of generative AI in healthcare, with the goal of ensuring the technology is used in a responsible and ethical manner.Abstract
In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require quick decision-making. Medical providers experience similar challenges in a rapidly changing healthcare environment, such as in the emergency department or during surgery treating a life-threatening condition. Generative AI, an emerging technology designed to efficiently generate valuable information, holds great promise. As computing power becomes more accessible and the abundance of health data, such as electronic health records, electrocardiograms, and medical images, increases, it is inevitable that healthcare will be revolutionized by this technology. Recently, generative AI has captivated the research community, leading to debates about its application in healthcare, mainly due to concerns about transparency and related issues. Meanwhile, concerns about the potential exacerbation of health disparities due to modeling biases have raised notable ethical concerns regarding the use of this technology in healthcare. However, the ethical principles for generative AI in healthcare have been understudied, and decision-makers often fail to consider the significance of generative AI. In this paper, we propose GREAT PLEA ethical principles, encompassing governance, reliability, equity, accountability, traceability, privacy, lawfulness, empathy, and autonomy, for generative AI in healthcare. We aim to proactively address the ethical dilemmas and challenges posed by the integration of generative AI in healthcare.
摘要
在2020年,美国国防部官方公布了一组伦理原则,用于导引人工智能技术在未来战场上的使用。尽管Military和医疗服务之间存在显著的不同,但在面临生命改变的情况下,战士和医疗提供者都需要快速做出决策。生成AI,一种以计算机功能为基础的技术,可以高效生成有价值信息。随着计算机技术的进步和医疗数据的增加,医疗将被生成AI技术改革。Recently, generative AI has captivated the research community, leading to debates about its application in healthcare, mainly due to concerns about transparency and related issues. Meanwhile, concerns about the potential exacerbation of health disparities due to modeling biases have raised notable ethical concerns regarding the use of this technology in healthcare. However, the ethical principles for generative AI in healthcare have been understudied, and decision-makers often fail to consider the significance of generative AI. In this paper, we propose GREAT PLEA ethical principles, encompassing governance, reliability, equity, accountability, traceability, privacy, lawfulness, empathy, and autonomy, for generative AI in healthcare. We aim to proactively address the ethical dilemmas and challenges posed by the integration of generative AI in healthcare.
results: 在公共数据集上达到了领域内最佳的result,包括零shot和几shot适应情况Abstract
Cross-lingual adaptation has proven effective in spoken language understanding (SLU) systems with limited resources. Existing methods are frequently unsatisfactory for intent detection and slot filling, particularly for distant languages that differ significantly from the source language in scripts, morphology, and syntax. Latent Dialogue Action (LaDA) layer is proposed to optimize decoding strategy in order to address the aforementioned issues. The model consists of an additional layer of latent dialogue action. It enables our model to improve a system's capability of handling conversations with complex multilingual intent and slot values of distant languages. To the best of our knowledge, this is the first exhaustive investigation of the use of latent variables for optimizing cross-lingual SLU policy during the decode stage. LaDA obtains state-of-the-art results on public datasets for both zero-shot and few-shot adaptation.
摘要
cross-lingual adaptation 已经证明对于听说语言理解(SLU)系统来说是有效的。现有的方法frequently不满足意向检测和槽填充,特别是对于远程语言而言,这些语言在字母、形态和语法方面存在显著的差异。为了解决这些问题,我们提出了 latent dialogue action(LaDA)层。这层允许我们的模型在decode阶段进行优化,以提高系统处理多语言意向和槽值的能力。据我们所知,这是首次对于在decode阶段使用隐藏变量进行cross-lingual SLU策略优化的实验。LaDA在公共数据集上获得了状态之arte Results for both zero-shot and few-shot adaptation。
ApproBiVT: Lead ASR Models to Generalize Better Using Approximated Bias-Variance Tradeoff Guided Early Stopping and Checkpoint Averaging
paper_authors: Fangyuan Wang, Ming Hao, Yuhai Shi, Bo Xu
for: 提高自动语音识别(ASR)模型的性能
methods: 基于偏差-差异质量评估的早停止和模型权重平均
results: 在使用高级ASR模型时,提供2.5%-3.7%和3.1%-4.6% CER降准的改进Here’s the breakdown of each point:
for: The paper aims to improve the performance of ASR models.
methods: The authors rethink and update the early stopping and checkpoint averaging methods based on the bias-variance tradeoff, using training loss and validation loss as proxies for bias and variance.
results: The proposed method provides a 2.5%-3.7% and 3.1%-4.6% CER reduction on the AISHELL-1 and AISHELL-2 datasets, respectively, when evaluated with advanced ASR models.Abstract
The conventional recipe for Automatic Speech Recognition (ASR) models is to 1) train multiple checkpoints on a training set while relying on a validation set to prevent overfitting using early stopping and 2) average several last checkpoints or that of the lowest validation losses to obtain the final model. In this paper, we rethink and update the early stopping and checkpoint averaging from the perspective of the bias-variance tradeoff. Theoretically, the bias and variance represent the fitness and variability of a model and the tradeoff of them determines the overall generalization error. But, it's impractical to evaluate them precisely. As an alternative, we take the training loss and validation loss as proxies of bias and variance and guide the early stopping and checkpoint averaging using their tradeoff, namely an Approximated Bias-Variance Tradeoff (ApproBiVT). When evaluating with advanced ASR models, our recipe provides 2.5%-3.7% and 3.1%-4.6% CER reduction on the AISHELL-1 and AISHELL-2, respectively.
摘要
传统的自动语音识别(ASR)模型的制作方式是:1)在训练集上训练多个检查点,使用验证集来避免过拟合,并2)将多个最后的检查点或验证loss的平均值作为最终模型。在这篇论文中,我们重新思考了早期停止和检查点平均值的方法,从偏差-方差质量的角度进行了更新。在理论上,偏差和方差表示模型的适应度和多样性,它们之间的质量评估会决定模型的总化适应错误。但是,不可能准确地评估它们。作为替代方案,我们使用训练损失和验证损失作为偏差和方差的代理,并通过它们之间的质量评估来引导早期停止和检查点平均值,即 Approximated Bias-Variance Tradeoff(ApproBiVT)。当评估高级ASR模型时,我们的方法可以提供2.5%-3.7%和3.1%-4.6%的CER降减在AISHELL-1和AISHELL-2上。
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education
results: 已经在线上作为开源项目提供,其代码、数据和模型参数在平台(如GitHub和Hugging Face)上公开,同时准备了其功能示例(https://vimeo.com/851004454),旨在推动LLM的教育应用研究和应用。Abstract
EduChat (https://www.educhat.top/) is a large-scale language model (LLM)-based chatbot system in the education domain. Its goal is to support personalized, fair, and compassionate intelligent education, serving teachers, students, and parents. Guided by theories from psychology and education, it further strengthens educational functions such as open question answering, essay assessment, Socratic teaching, and emotional support based on the existing basic LLMs. Particularly, we learn domain-specific knowledge by pre-training on the educational corpus and stimulate various skills with tool use by fine-tuning on designed system prompts and instructions. Currently, EduChat is available online as an open-source project, with its code, data, and model parameters available on platforms (e.g., GitHub https://github.com/icalk-nlp/EduChat, Hugging Face https://huggingface.co/ecnu-icalk ). We also prepare a demonstration of its capabilities online (https://vimeo.com/851004454). This initiative aims to promote research and applications of LLMs for intelligent education.
摘要
eduChat(https://www.educhat.top/)是一个基于大规模语言模型(LLM)的教育领域聊天机器人系统。其目标是支持个性化、公正、慈善的智能教育,为教师、学生和家长提供支持。受心理学和教育理论引导,它进一步增强了教育功能,如开放问答、作文评价、索洛尼教学和情感支持,基于现有的基础 LLM。特别是,我们通过预训练教育词汇库来学习域语言知识,并通过细化系统提示和指导来刺激多种技能。目前,eduChat已经在线上开放,可以免费下载和使用(如GitHub https://github.com/icalk-nlp/EduChat、Hugging Face https://huggingface.co/ecnu-icalk)。我们还准备了其功能示例视频(https://vimeo.com/851004454)。这一 iniciative 的目标是推动 LLM 的应用和研究在智能教育领域。
Meta-Tsallis-Entropy Minimization: A New Self-Training Approach for Domain Adaptation on Text Classification
results: 实验结果显示,MTEM 可以提高 BERT 预测模型的适应性,平均提高了4%的benchmark数据集上的适应性。Abstract
Text classification is a fundamental task for natural language processing, and adapting text classification models across domains has broad applications. Self-training generates pseudo-examples from the model's predictions and iteratively trains on the pseudo-examples, i.e., minimizes the loss on the source domain and the Gibbs entropy on the target domain. However, Gibbs entropy is sensitive to prediction errors, and thus, self-training tends to fail when the domain shift is large. In this paper, we propose Meta-Tsallis Entropy minimization (MTEM), which applies a meta-learning algorithm to optimize the instance adaptive Tsallis entropy on the target domain. To reduce the computation cost of MTEM, we propose an approximation technique to approximate the Second-order derivation involved in the meta-learning. To efficiently generate pseudo labels, we propose an annealing sampling mechanism for exploring the model's prediction probability. Theoretically, we prove the convergence of the meta-learning algorithm in MTEM and analyze the effectiveness of MTEM in achieving domain adaptation. Experimentally, MTEM improves the adaptation performance of BERT with an average of 4 percent on the benchmark dataset.
摘要
results: 研究发现,传统模型在数据外部分布上的泛化性比较好,而深度学习模型在特定任务上的表现可能更好,但最佳的模型可能取决于具体的任务。Abstract
Automatic fake news detection with machine learning can prevent the dissemination of false statements before they gain many views. Several datasets labeling statements as legitimate or false have been created since the 2016 United States presidential election for the prospect of training machine learning models. We evaluate the robustness of both traditional and deep state-of-the-art models to gauge how well they may perform in the real world. We find that traditional models tend to generalize better to data outside the distribution it was trained on compared to more recently-developed large language models, though the best model to use may depend on the specific task at hand.
摘要
自动化假新闻检测使用机器学习可以防止假陈述在得到许多观看之前传播。自2016年美国总统选举以来,已经创建了多个标注声明为真实或假的数据集,用于训练机器学习模型。我们评估了传统和深度状态对的模型Robustness,以评估它们在真实世界中的表现。我们发现传统模型在不同的数据分布外的总体性能比较好,但最佳模型可能取决于具体的任务。Note: "Simplified Chinese" is a romanization of Chinese that uses simpler characters and grammar to facilitate learning and communication. It is not a formal standard, but rather a common informal writing system used in online communication and education.Translation notes:* "Automatic fake news detection" is translated as "自动化假新闻检测" (zìhuì zhìyì xiǎngxìn)* "with machine learning" is translated as "使用机器学习" (shǐyòu jīshù xuéxí)* "can prevent the dissemination of false statements" is translated as "可以防止假陈述的传播" (kěyǐ fángzhì zhēnshuō de chuánxì)* "before they gain many views" is translated as "前于它们得到许多观看" (qián yú tāmen dékuò yīduō guānkàn)* "Several datasets labeling statements as legitimate or false have been created" is translated as "已经创建了多个标注声明为真实或假的数据集" (yǐjīng chéngchǎi le duō gè tiǎnzi xiǎngxìn zhī shí yǔ bìng shí)* "since the 2016 United States presidential election" is translated as "自2016年美国总统选举以来" (zì 2016 nián míguó zǒngtǒng jiǎngdǎo zhīlì)* "for the prospect of training machine learning models" is translated as "用于训练机器学习模型" (yǐng yú xùnxíng jīshù xuéxí módel)* "We evaluate the robustness of both traditional and deep state-of-the-art models" is translated as "我们评估了传统和深度状态对的模型Robustness" (wǒmen píngyì le zhòngyì yǔ shēngrán zhìyì de módel Robustness)* "to gauge how well they may perform in the real world" is translated as "以评估它们在真实世界中的表现" (yǐ píngyì tāmen zhè yì shì jièshí zhìyì de biǎo xiǎng)* "We find that traditional models tend to generalize better to data outside the distribution it was trained on" is translated as "我们发现传统模型在不同的数据分布外的总体性能比较好" (wǒmen fāxìn zhòngyì módel zài bùdàng de xiǎngxì yì shì zhìyì de zǒngtǐ xìngkě)* "though the best model to use may depend on the specific task at hand" is translated as "尽管最佳模型可能取决于具体的任务" (juéyàng zuìjì módel kěnéng qùjué yú gè tâi zhì)
Forget Demonstrations, Focus on Learning from Textual Instructions
for: Zero-shot cross-task generalization in demonstration-free learning from textual instructions.
methods: automatically find critical sentences in the definition, and a ranking objective to force the model to generate gold outputs with higher probabilities when those critical parts are highlighted.
results: state-of-the-art performance on a challenging benchmark.Abstract
This work studies a challenging yet more realistic setting for zero-shot cross-task generalization: demonstration-free learning from textual instructions, presuming the existence of a paragraph-style task definition while no demonstrations exist. To better learn the task supervision from the definition, we propose two strategies: first, to automatically find out the critical sentences in the definition; second, a ranking objective to force the model to generate the gold outputs with higher probabilities when those critical parts are highlighted in the definition. The joint efforts of the two strategies yield state-of-the-art performance on the challenging benchmark. Our code will be released in the final version of the paper.
摘要
Automatically identifying crucial sentences in the definition.2. A ranking objective to encourage the model to generate the correct outputs with higher probabilities when the critical parts are highlighted in the definition.The combination of these two strategies achieves state-of-the-art performance on a challenging benchmark. Our code will be released with the final version of the paper.
Adapting the NICT-JLE Corpus for Disfluency Detection Models
results: 本研究提供了一个标准化的训练、保留和测试集,可供未来的干扰检测研究使用。I hope that helps! Let me know if you have any other questions.Abstract
The detection of disfluencies such as hesitations, repetitions and false starts commonly found in speech is a widely studied area of research. With a standardised process for evaluation using the Switchboard Corpus, model performance can be easily compared across approaches. This is not the case for disfluency detection research on learner speech, however, where such datasets have restricted access policies, making comparison and subsequent development of improved models more challenging. To address this issue, this paper describes the adaptation of the NICT-JLE corpus, containing approximately 300 hours of English learners' oral proficiency tests, to a format that is suitable for disfluency detection model training and evaluation. Points of difference between the NICT-JLE and Switchboard corpora are explored, followed by a detailed overview of adaptations to the tag set and meta-features of the NICT-JLE corpus. The result of this work provides a standardised train, heldout and test set for use in future research on disfluency detection for learner speech.
摘要
研究干扰现象如停顿、重复和开始错误的检测在语音中是广泛的研究领域。使用标准化的评估过程和 Switchboard Corporare,模型的性能可以方便地比较。但是对于学习者语音干扰检测研究,这些数据集却有限制的访问政策,使得对比和后续模型的改进更加困难。为解决这问题,这篇论文描述了将 NICT-JLE corpora(约为 300 小时的英语学习者口语考试)改进为适于干扰检测模型训练和评估的格式。本文首先探讨了 NICT-JLE 和 Switchboard corpora 之间的差异,然后详细介绍了 NICT-JLE corpora 的标签集和 ме타特征的修改。结果是提供了一个标准化的训练集、保留集和测试集,用于未来learner speech 干扰检测研究。
results: 我们的实验结果表明,RadFM在处理多种医学问题时表现出色,比既有的多Modal基础模型更高。代码、数据和模型检查点都将公开发布,以便更多人继续研究和开发这一领域。Abstract
In this study, we aim to initiate the development of Radiology Foundation Model, termed as RadFM.We consider the construction of foundational models from the perspectives of data, model design, and evaluation thoroughly. Our contribution can be concluded as follows: (i), we construct a large-scale Medical Multi-modal Dataset, MedMD, consisting of 16M 2D and 3D medical scans. To the best of our knowledge, this is the first multi-modal dataset containing 3D medical scans. (ii), We propose an architecture that enables visually conditioned generative pre-training, allowing for the integration of text input interleaved with 2D or 3D medical scans to generate response for diverse radiologic tasks. The model was initially pre-trained on MedMD and subsequently domain-specific fine-tuned on RadMD, a radiologic cleaned version of MedMD, containing 3M radiologic visual-language pairs. (iii), we propose a new evaluation benchmark that comprises five tasks, aiming to comprehensively assess the capability of foundation models in handling practical clinical problems. Our experimental results confirm that RadFM significantly outperforms existing multi-modal foundation models. The codes, data, and model checkpoint will all be made publicly available to promote further research and development in the field.
摘要
在本研究中,我们目标是开发Radiology Foundation Model(RadFM)。我们对foundational model的构建进行了全面的考虑,包括数据、模型设计和评估方面。我们的贡献可以结合以下三点:(i) 我们构建了一个大规模的医疗多modal数据集(MedMD),包含1600万个2D和3D医疗扫描图像。到目前为止,这是第一个包含3D医疗扫描图像的多modal数据集。(ii) 我们提出了一种拥有可视条件生成的架构,允许将文本输入与2D或3D医疗扫描图像相互融合,以生成多种医疗任务的回应。该模型首先在MedMD上进行了 pré-training,然后在RadMD上进行了域pecific的精度调整,RadMD包含300万个医疗视语对。(iii) 我们提出了一个新的评估指标,包括五个任务,旨在全面评估基础模型在解决实际临床问题时的能力。我们的实验结果表明,RadFM在多modal基础模型中表现出色,在不同的医疗任务上具有优异的表现。我们将代码、数据和模型检查点一起公开,以便进一步的研究和发展。
Legal Summarisation through LLMs: The PRODIGIT Project
results: 根据专业税务法官和律师的评价,研究所得到的结果得到了满意的评价。在此基础之上,正在建立一个protoype应用程序,将在公共领域中公布。Abstract
We present some initial results of a large-scale Italian project called PRODIGIT which aims to support tax judges and lawyers through digital technology, focusing on AI. We have focused on generation of summaries of judicial decisions and on the extraction of related information, such as the identification of legal issues and decision-making criteria, and the specification of keywords. To this end, we have deployed and evaluated different tools and approaches to extractive and abstractive summarisation. We have applied LLMs, and particularly on GPT4, which has enabled us to obtain results that proved satisfactory, according to an evaluation by expert tax judges and lawyers. On this basis, a prototype application is being built which will be made publicly available.
摘要
我们现在宣布一些初步结果的大规模意大利项目,名为PRODIGIT,旨在通过数字技术支持税务法官和律师,重点是人工智能。我们对判决摘要生成和相关信息提取进行了重点尝试,包括法律问题识别和决策标准的特定,以及关键词的指定。为此,我们已经部署和评估了不同的工具和方法,包括自然语言处理技术和GPT4。经专业税务法官和律师评估,我们的结果得到了满意的评价。基于这个基础,我们正在建立一个原型应用程序,计划将其公开发布。
results: 我们的实验分析显示,引入的储量模型可以达到理论上的最大短期记忆容量。同时,相比标准ESN,ES$^2$N具有更好的记忆和非线性之间的质量比,以及在推理非线性模型方面的显著改善。Abstract
In this paper, we propose a new Reservoir Computing (RC) architecture, called the Edge of Stability Echo State Network (ES$^2$N). The introduced ES$^2$N model is based on defining the reservoir layer as a convex combination of a nonlinear reservoir (as in the standard ESN), and a linear reservoir that implements an orthogonal transformation. We provide a thorough mathematical analysis of the introduced model, proving that the whole eigenspectrum of the Jacobian of the ES2N map can be contained in an annular neighbourhood of a complex circle of controllable radius, and exploit this property to demonstrate that the ES$^2$N's forward dynamics evolves close to the edge-of-chaos regime by design. Remarkably, our experimental analysis shows that the newly introduced reservoir model is able to reach the theoretical maximum short-term memory capacity. At the same time, in comparison to standard ESN, ES$^2$N is shown to offer a favorable trade-off between memory and nonlinearity, as well as a significant improvement of performance in autoregressive nonlinear modeling.
摘要
在这篇论文中,我们提出了一种新的储量计算(Reservoir Computing,RC)架构,称为边缘稳定响应网络(ES$^2$N)。我们在引入ES$^2$N模型时定义了储量层为一种非线性储量(如标准ESN中的储量)和一种线性储量,该线性储量实现了正交变换。我们对引入的模型进行了住所数学分析,证明了整个特征值谱的Jacobian可以是一个可控的圆形谱的一部分,并利用这个性质来证明ES$^2$N的前向动力学在设计上靠近边缘混乱 regime。实验分析表明,我们新引入的储量模型可以 дости到理论上的最大短期记忆容量。同时,相比标准ESN,ES$^2$N具有更好的记忆与非线性之间的平衡,以及在推理非线性模型方面的显著改善。
Textual Data Mining for Financial Fraud Detection: A Deep Learning Approach
results: 研究结果表明,使用这些多种神经网络模型可以准确地检测金融诈骗行为,并且可以为金融监管机构、企业和研究人员提供有价值的信息,以帮助他们制定更加有效和robust的诈骗检测策略。Abstract
In this report, I present a deep learning approach to conduct a natural language processing (hereafter NLP) binary classification task for analyzing financial-fraud texts. First, I searched for regulatory announcements and enforcement bulletins from HKEX news to define fraudulent companies and to extract their MD&A reports before I organized the sentences from the reports with labels and reporting time. My methodology involved different kinds of neural network models, including Multilayer Perceptrons with Embedding layers, vanilla Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM), and Gated Recurrent Unit (GRU) for the text classification task. By utilizing this diverse set of models, I aim to perform a comprehensive comparison of their accuracy in detecting financial fraud. My results bring significant implications for financial fraud detection as this work contributes to the growing body of research at the intersection of deep learning, NLP, and finance, providing valuable insights for industry practitioners, regulators, and researchers in the pursuit of more robust and effective fraud detection methodologies.
摘要
在这份报告中,我采用深度学习方法进行自然语言处理(以下简称NLP)二分类任务,以分析金融诈骗文本。首先,我从香港证券交易所新闻查找了规定公告和执行通知,以定义诈骗公司并提取其财务报告。然后,我将报告句子与标签和发布时间进行了分类。我的方法包括多层感知器、vanilla RNN、LSTM和GRU等不同类型的神经网络模型,用于文本分类任务。通过使用这些多样化的模型,我希望通过对它们的准确率进行比较,以评估不同模型在检测金融诈骗方面的精度。我的结果对金融诈骗检测有着重要的意义,这些研究贡献于深度学习、NLP和金融之间的交叉领域,为行业实践者、 regulators 和研究人员提供了价值的意见,以帮助他们开发更加Robust和有效的诈骗检测方法。
Elucidate Gender Fairness in Singing Voice Transcription
methods: 本研究使用了不同的模型和数据集,并证明了女性SVT系统的超越性。furthermore, the authors propose using an attribute predictor to predict gender labels and adversarially training the SVT system to enforce gender-invariance of acoustic representations.
results: 实验结果显示,这种方法可以降低性别差异(最多超过50%),而不会影响声音识别系统的总性能。Abstract
It is widely known that males and females typically possess different sound characteristics when singing, such as timbre and pitch, but it has never been explored whether these gender-based characteristics lead to a performance disparity in singing voice transcription (SVT), whose target includes pitch. Such a disparity could cause fairness issues and severely affect the user experience of downstream SVT applications. Motivated by this, we first demonstrate the female superiority of SVT systems, which is observed across different models and datasets. We find that different pitch distributions, rather than gender data imbalance, contribute to this disparity. To address this issue, we propose using an attribute predictor to predict gender labels and adversarially training the SVT system to enforce the gender-invariance of acoustic representations. Leveraging the prior knowledge that pitch distributions may contribute to the gender bias, we propose conditionally aligning acoustic representations between demographic groups by feeding note events to the attribute predictor. Empirical experiments on multiple benchmark SVT datasets show that our method significantly reduces gender bias (up to more than 50%) with negligible degradation of overall SVT performance, on both in-domain and out-of-domain singing data, thus offering a better fairness-utility trade-off.
摘要
广泛知道,♂和♀在唱歌时通常具有不同的声音特征,如音质和高低音,但这些性别基本特征是否会导致唱歌voice识别(SVT)性别偏远问题,这可能会导致公平问题并且严重地影响下游SVT应用程序的用户体验。为此,我们首先证明♀ SVT系统的优势,这被观察到在不同的模型和数据集上。我们发现,不同的抽象分布,而不是性别数据不均衡,是导致这种偏迷的原因。为解决这个问题,我们提议使用一个属性预测器预测性别标签,并对SVT系统进行对gender-invariant的听音表示的反对抗训练。利用我们知道抽象分布可能会导致性别偏迷的知识,我们提议在注意力机制中条件对不同人群的声音表示进行 conditional alignment。实验表明,我们的方法可以在多个benchmark SVT数据集上显著减少性别偏迷(超过50%),无损SVT总性能,从而提供更好的公平利用折衡。
Physics-informed Gaussian process model for Euler-Bernoulli beam elements
methods: 该模型采用多输出 Gaussian process 方法,使用 Euler-Bernoulli 梁 equation 来形式化模型,并使用数据驱动法来学习模型 Parameters。
results: 研究人员通过使用数据驱动法来更新模型 Parameters,并使用 Mahalanobis 距离来评估结构系统中可能的损害的位置和范围。模型的预测质量也被研究了,并发现 measurement noise 对预测质量的影响。Abstract
A physics-informed machine learning model, in the form of a multi-output Gaussian process, is formulated using the Euler-Bernoulli beam equation. Given appropriate datasets, the model can be used to regress the analytical value of the structure's bending stiffness, interpolate responses, and make probabilistic inferences on latent physical quantities. The developed model is applied on a numerically simulated cantilever beam, where the regressed bending stiffness is evaluated and the influence measurement noise on the prediction quality is investigated. Further, the regressed probabilistic stiffness distribution is used in a structural health monitoring context, where the Mahalanobis distance is employed to reason about the possible location and extent of damage in the structural system. To validate the developed framework, an experiment is conducted and measured heterogeneous datasets are used to update the assumed analytical structural model.
摘要
Physics-informed machine learning model, 形式为多输出 Gaussian process, 使用欧拉-伯恩逊梁 equation 建模。通过适当的数据集,该模型可以用来回归分析结构的弯矩刚性, interpolate 响应,以及对隐藏物理量进行 probabilistic 推论。在 numerically 模拟的 cantilever 梁上应用了该模型,并评估了预测质量的影响因素。此外,通过使用 regressed 的可信度分布,Structural health monitoring 上进行了可能的损害位置和范围的判断。为验证建立的框架,对 measured 的不同类型数据进行了更新,并使用 Mahalanobis distance 进行了可能的损害识别。
results: 研究结果显示,这个方法可以实现传输隐私信息的安全通信,并且可以降低听到者的攻击率。在CIFAR-10数据集上进行了验证,并且在不同通信频道上进行了验证。Abstract
In this paper, a generalization of deep learning-aided joint source channel coding (Deep-JSCC) approach to secure communications is studied. We propose an end-to-end (E2E) learning-based approach for secure communication against multiple eavesdroppers over complex-valued fading channels. Both scenarios of colluding and non-colluding eavesdroppers are studied. For the colluding strategy, eavesdroppers share their logits to collaboratively infer private attributes based on ensemble learning method, while for the non-colluding setup they act alone. The goal is to prevent eavesdroppers from inferring private (sensitive) information about the transmitted images, while delivering the images to a legitimate receiver with minimum distortion. By generalizing the ideas of privacy funnel and wiretap channel coding, the trade-off between the image recovery at the legitimate node and the information leakage to the eavesdroppers is characterized. To solve this secrecy funnel framework, we implement deep neural networks (DNNs) to realize a data-driven secure communication scheme, without relying on a specific data distribution. Simulations over CIFAR-10 dataset verifies the secrecy-utility trade-off. Adversarial accuracy of eavesdroppers are also studied over Rayleigh fading, Nakagami-m, and AWGN channels to verify the generalization of the proposed scheme. Our experiments show that employing the proposed secure neural encoding can decrease the adversarial accuracy by 28%.
摘要
在这篇论文中,我们研究了对通信安全进行扩展的深度学习协助集成源 Channel 编码(Deep-JSCC)方法。我们提议了一种基于全端到端学习的方法来保护通信 against multiple 窃听者。我们分析了两种情况:协同窃听者和不协同窃听者。在协同情况下,窃听者们共享它们的 logits 以共同推断私有属性,而在不协同情况下,窃听者们 acted alone。我们的目标是防止窃听者们对传输的图像进行推断,同时将图像传递到合法接收者,并最小化质量损失。通过扩展隐私飞行和窃听频道编码的想法,我们定义了隐私飞行框架。为解决这个隐私飞行框架,我们实现了深度神经网络(DNN)来实现数据驱动的安全通信方案,不需要具体的数据分布。我们的实验表明,通过使用我们的安全神经编码,可以降低窃听者的敌对精度 by 28%。我们还对 Rayleigh 抽象、Nakagami-m 抽象和AWGN 抽象频道进行了窃听者的敌对精度研究,以验证我们的方案的普适性。
Private Federated Learning with Dynamic Power Control via Non-Coherent Over-the-Air Computation
methods: Edge devices(ED)通过活化两个邻近的orthogonal frequency division multi-plexing(OFDM)子带,将本地随机梯度签名传输给 Edge Server(ES),然后通过利用储存在子带上的能量强制获得多数投票(MV)。
results: 提出了一种动态功率控制算法,以减少MV聚合值偏置聚合值的影响。并证明了整个方案可以减少时间同步误差、通道抑噪和噪声的影响。Abstract
To further preserve model weight privacy and improve model performance in Federated Learning (FL), FL via Over-the-Air Computation (AirComp) scheme based on dynamic power control is proposed. The edge devices (EDs) transmit the signs of local stochastic gradients by activating two adjacent orthogonal frequency division multi-plexing (OFDM) subcarriers, and majority votes (MVs) at the edge server (ES) are obtained by exploiting the energy accumulation on the subcarriers. Then, we propose a dynamic power control algorithm to further offset the biased aggregation of the MV aggregation values. We show that the whole scheme can mitigate the impact of the time synchronization error, channel fading and noise. The theoretical convergence proof of the scheme is re-derived.
摘要
为了更好地保护模型权重隐私和提高 Federated Learning(FL)的性能,我们提议使用基于动态功率控制的 Over-the-Air Computation(AirComp)方案。边缘设备(ED)通过活动两个相邻的 ortogonal frequency division multi-plexing(OFDM)子频,传输本地随机梯度的证明,然后在边缘服务器(ES)中通过能量积累获得多数投票(MV)。然后,我们提议使用动态功率控制算法来进一步减少MV聚合值的偏见。我们示示了整个方案可以减轻时间同步错误、频率抑减和噪声的影响。我们还重新证明了方案的理论收敛证明。
results: 研究发现,meta-learning在医疗领域的应用可以提高模型的能力,并可以Addressing various healthcare challenges, such as insufficient data and domain shifts. However, there are still several challenges in meta-learning research, such as the need for more diverse and high-quality datasets, and the need for better understanding of the underlying mechanisms of meta-learning.Abstract
As a subset of machine learning, meta-learning, or learning to learn, aims at improving the model's capabilities by employing prior knowledge and experience. A meta-learning paradigm can appropriately tackle the conventional challenges of traditional learning approaches, such as insufficient number of samples, domain shifts, and generalization. These unique characteristics position meta-learning as a suitable choice for developing influential solutions in various healthcare contexts, where the available data is often insufficient, and the data collection methodologies are different. This survey discusses meta-learning broad applications in the healthcare domain to provide insight into how and where it can address critical healthcare challenges. We first describe the theoretical foundations and pivotal methods of meta-learning. We then divide the employed meta-learning approaches in the healthcare domain into two main categories of multi/single-task learning and many/few-shot learning and survey the studies. Finally, we highlight the current challenges in meta-learning research, discuss the potential solutions and provide future perspectives on meta-learning in healthcare.
摘要
Traditional learning approaches often struggle with insufficient data, domain shifts, and generalization. However, meta-learning, a subset of machine learning, aims to improve the model's capabilities by leveraging prior knowledge and experience. This survey discusses the applications of meta-learning in the healthcare domain, exploring how it can address critical challenges such as insufficient data and diverse data collection methodologies.First, we provide an overview of the theoretical foundations and key methods of meta-learning. We then categorize the meta-learning approaches used in healthcare into two main types: multi/single-task learning and many/few-shot learning. Next, we survey the relevant studies and highlight the current challenges in meta-learning research. Finally, we discuss potential solutions and provide future perspectives on the use of meta-learning in healthcare.In this survey, we focus on the applications of meta-learning in healthcare, exploring how it can improve the accuracy and adaptability of machine learning models in various healthcare contexts. We believe that this survey will provide valuable insights into the potential of meta-learning in healthcare and inspire further research in this promising area.
Data-Based Design of Multi-Model Inferential Sensors
results: 研究人员通过设计了一种多模型软感知测量系统,并对其进行了比较性评估,并显示了与现有单模型软感知测量系统和referential软感知测量系统相比,其性能有所提高。Abstract
This paper deals with the problem of inferential (soft) sensor design. The nonlinear character of industrial processes is usually the main limitation to designing simple linear inferential sensors with sufficient accuracy. In order to increase the inferential sensor predictive performance and yet to maintain its linear structure, multi-model inferential sensors represent a straightforward option. In this contribution, we propose two novel approaches for the design of multi-model inferential sensors aiming to mitigate some drawbacks of the state-of-the-art approaches. For a demonstration of the developed techniques, we design inferential sensors for a Vacuum Gasoil Hydrogenation unit, which is a real-world petrochemical refinery unit. The performance of the multi-model inferential sensor is compared against various single-model inferential sensors and the current (referential) inferential sensor used in the refinery. The results show substantial improvements over the state-of-the-art design techniques for single-/multi-model inferential sensors.
摘要
Simplified Chinese translation:这篇论文关注了软传感器设计问题,特别是因为工业过程的非线性所带来的限制。为了提高多模型软传感器的预测性能而不失 linearity,我们提出了两种新的设计方法。我们通过设计一个真实存在的油化工厂中的真空气油氢化单元的软传感器,以示出我们的技术的效果。与现有的单模型和多模型软传感器相比,我们的设计方法具有显著的改进。
NP-SemiSeg: When Neural Processes meet Semi-Supervised Semantic Segmentation
results: 实验表明,NP-SemiSeg模型在PASCAL VOC 2012和Cityscapes等公共 bencmark上,在不同的训练设置下,具有显著的效果。Abstract
Semi-supervised semantic segmentation involves assigning pixel-wise labels to unlabeled images at training time. This is useful in a wide range of real-world applications where collecting pixel-wise labels is not feasible in time or cost. Current approaches to semi-supervised semantic segmentation work by predicting pseudo-labels for each pixel from a class-wise probability distribution output by a model. If the predicted probability distribution is incorrect, however, this leads to poor segmentation results, which can have knock-on consequences in safety critical systems, like medical images or self-driving cars. It is, therefore, important to understand what a model does not know, which is mainly achieved by uncertainty quantification. Recently, neural processes (NPs) have been explored in semi-supervised image classification, and they have been a computationally efficient and effective method for uncertainty quantification. In this work, we move one step forward by adapting NPs to semi-supervised semantic segmentation, resulting in a new model called NP-SemiSeg. We experimentally evaluated NP-SemiSeg on the public benchmarks PASCAL VOC 2012 and Cityscapes, with different training settings, and the results verify its effectiveness.
摘要
semi-supervised semantic segmentation是将无标记图像的像素级别标签分配给训练图像的过程。这有助于在各种现实世界应用中,例如收集像素级别标签是不可能或太昂贵。当前的semi-supervised semantic segmentation方法采用预测每个像素的假标签方法,从模型输出的类别概率分布中预测 pseudo-labels。如果预测的概率分布错误,则会导致 segmentation 结果差,这可能会影响到安全关键系统,如医疗图像或自动驾驶车。因此,了解模型不知道的内容非常重要,主要通过uncertainty量化来实现。最近,神经过程(NP)在 semi-supervised 图像分类中被探索,它们是一种计算效率高和可靠的不确定性量化方法。在这项工作中,我们将NP adapted to semi-supervised semantic segmentation,得到了NP-SemiSeg模型。我们对NP-SemiSeg进行了不同的训练设置,并在公共 benchmark PASCAL VOC 2012 和 Cityscapes 上进行了实验,结果证明了它的效果。
Generative Adversarial Networks for Stain Normalisation in Histopathology
results: 然而,不同的GAN和非GAN方法在不同的场景和效能指标下可能会出现不同的表现,这是该领域的繁殖领域。Abstract
The rapid growth of digital pathology in recent years has provided an ideal opportunity for the development of artificial intelligence-based tools to improve the accuracy and efficiency of clinical diagnoses. One of the significant roadblocks to current research is the high level of visual variability across digital pathology images, causing models to generalise poorly to unseen data. Stain normalisation aims to standardise the visual profile of digital pathology images without changing the structural content of the images. In this chapter, we explore different techniques which have been used for stain normalisation in digital pathology, with a focus on approaches which utilise generative adversarial networks (GANs). Typically, GAN-based methods outperform non-generative approaches but at the cost of much greater computational requirements. However, it is not clear which method is best for stain normalisation in general, with different GAN and non-GAN approaches outperforming each other in different scenarios and according to different performance metrics. This is an ongoing field of study as researchers aim to identify a method which efficiently and effectively normalises pathology images to make AI models more robust and generalisable.
摘要
随着数字patology的快速发展,近年来提供了开发基于人工智能技术的诊断工具,以提高诊断精度和效率的 идеal 机会。然而,当前的研究面临一个重要的障碍:数字patology图像之间的视觉变化很大,导致模型尝试将数据推断到未经见过的数据。正常化着色方法可以帮助标准化数字patology图像的视觉profile,不会改变图像的结构内容。本章介绍了不同的数字patology图像正常化技术,主要关注使用生成对抗网络(GAN)的方法。通常,GAN基本方法在性能上表现更好,但是计算需求相对较高。然而,没有一种最佳的正常化方法,不同的GAN和非GAN方法在不同的场景和指标下各有优劣。这是一个持续的研究领域,研究人员希望找到一种高效、可靠地标准化pathology图像,以使AI模型更加稳定和泛化。
Approximating Positive Homogeneous Functions with Scale Invariant Neural Networks
paper_authors: Stefan Bamberger, Reinhard Heckel, Felix Krahmer
for: investigate the possibility of solving linear inverse problems with $ReLu$ networks
methods: use positive homogeneous functions and neural networks to recover sparse vectors and low-rank matrices
results: show that $ReLu$ networks with two hidden layers can approximately recover sparse vectors with arbitrary precision and low-rank matrices in a stable way, and establish new results on the approximation of general positive homogeneous functions with neural networks.Abstract
We investigate to what extent it is possible to solve linear inverse problems with $ReLu$ networks. Due to the scaling invariance arising from the linearity, an optimal reconstruction function $f$ for such a problem is positive homogeneous, i.e., satisfies $f(\lambda x) = \lambda f(x)$ for all non-negative $\lambda$. In a $ReLu$ network, this condition translates to considering networks without bias terms. We first consider recovery of sparse vectors from few linear measurements. We prove that $ReLu$- networks with only one hidden layer cannot even recover $1$-sparse vectors, not even approximately, and regardless of the width of the network. However, with two hidden layers, approximate recovery with arbitrary precision and arbitrary sparsity level $s$ is possible in a stable way. We then extend our results to a wider class of recovery problems including low-rank matrix recovery and phase retrieval. Furthermore, we also consider the approximation of general positive homogeneous functions with neural networks. Extending previous work, we establish new results explaining under which conditions such functions can be approximated with neural networks. Our results also shed some light on the seeming contradiction between previous works showing that neural networks for inverse problems typically have very large Lipschitz constants, but still perform very well also for adversarial noise. Namely, the error bounds in our expressivity results include a combination of a small constant term and a term that is linear in the noise level, indicating that robustness issues may occur only for very small noise levels.
摘要
我们 investigate 当中解决线性逆问题是否可以使用 $ReLu$ 网络。由于线性对称性,一个理想的重建函数 $f$ 的条件是对所有非负 $\lambda$ 满足 $f(\lambda x) = \lambda f(x)$。在 $ReLu$ 网络中,这条件可以翻译为不包含偏好项的网络。我们首先考虑从少数线性量测到的簇短 vectors 的重建。我们证明 $ReLu$-网络仅有一个隐藏层时不能够精确地重建 $1$-簇短 vectors,不管网络的宽度如何。然而,具有两个隐藏层时,可以在稳定的情况下,对于任意精度和簇短度 $s$ 进行精确的重建。我们随后扩展我们的结果到一个更广泛的复原问题,包括低矩组合成问题和相位重建问题。此外,我们还考虑了一般正规复合函数的近似。我们从前一些工作继续推广,并建立新的结果,解释在哪些情况下,这些函数可以通过神经网络进行近似。我们的结果还照明了对于过去一些作品,显示神经网络在反对抗变量下表现非常好,但是实际上可能存在一定的Robustness问题。具体来说,我们的错误范围包括一定的常数项和一个线性对应到噪音水平的项目,这表明在某些情况下,Robustness问题可能仅在很小的噪音水平下出现。
Reinforcement Learning for Financial Index Tracking
results: 实验结果表明,提出的方法在追踪精度和额外收益方面都有较好的表现,并且可以通过充值或减值策略实现额外收益。Abstract
We propose the first discrete-time infinite-horizon dynamic formulation of the financial index tracking problem under both return-based tracking error and value-based tracking error. The formulation overcomes the limitations of existing models by incorporating the intertemporal dynamics of market information variables not limited to prices, allowing exact calculation of transaction costs, accounting for the tradeoff between overall tracking error and transaction costs, allowing effective use of data in a long time period, etc. The formulation also allows novel decision variables of cash injection or withdraw. We propose to solve the portfolio rebalancing equation using a Banach fixed point iteration, which allows to accurately calculate the transaction costs specified as nonlinear functions of trading volumes in practice. We propose an extension of deep reinforcement learning (RL) method to solve the dynamic formulation. Our RL method resolves the issue of data limitation resulting from the availability of a single sample path of financial data by a novel training scheme. A comprehensive empirical study based on a 17-year-long testing set demonstrates that the proposed method outperforms a benchmark method in terms of tracking accuracy and has the potential for earning extra profit through cash withdraw strategy.
摘要
我们提出了首个离散时间无限远景动态形式化的金融指数跟踪问题,包括返利基于跟踪错误和价值基于跟踪错误。这种形式超越了现有模型的局限性,因为它包括市场信息变量的时间间动态,允许精确计算交易成本,考虑跟踪错误与交易成本之间的负面效应,使用长时间期内数据,等等。我们还提出了一种新的资金注入或抽取变量。我们使用巴нах固定点迭代法来解决portfolio重新平衡方程,这allowaccurately计算交易成本,它们在实践中是非线性函数。我们还提出了一种基于深度强化学习(RL)方法来解决动态形式。我们的RL方法可以解决数据有限制问题,这是因为只有一个金融数据样本路径的可用性。我们的实验结果表明,我们的方法在跟踪精度和交易成本方面表现出色,并且具有可能获得额外利润的资金抽取策略。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.
A generative model for surrogates of spatial-temporal wildfire nowcasting
results: 实验结果表明,该模型能够生成具有凝结性和结构的野火enario,考虑了地理物理变量的影响,如 Vegetation 和 Slope。生成的数据还用于训练一个野火扩散预测模型,并在实验和实际 chimney 火事件上进行了测试。Abstract
Recent increase in wildfires worldwide has led to the need for real-time fire nowcasting. Physics-driven models, such as cellular automata and computational fluid dynamics can provide high-fidelity fire spread simulations but they are computationally expensive and time-consuming. Much effort has been put into developing machine learning models for fire prediction. However, these models are often region-specific and require a substantial quantity of simulation data for training purpose. This results in a significant amount of computational effort for different ecoregions. In this work, a generative model is proposed using a three-dimensional Vector-Quantized Variational Autoencoders to generate spatial-temporal sequences of unseen wildfire burned areas in a given ecoregion. The model is tested in the ecoregion of a recent massive wildfire event in California, known as the Chimney fire. Numerical results show that the model succeed in generating coherent and structured fire scenarios, taking into account the impact from geophysical variables, such as vegetation and slope. Generated data are also used to train a surrogate model for predicting wildfire dissemination, which has been tested on both simulation data and the real Chimney fire event.
摘要
In this work, a generative model is proposed using a three-dimensional Vector-Quantized Variational Autoencoders to generate spatial-temporal sequences of unseen wildfire burned areas in a given ecoregion. The model is tested in the ecoregion of a recent massive wildfire event in California, known as the Chimney fire. Numerical results show that the model successfully generated coherent and structured fire scenarios, taking into account the impact from geophysical variables such as vegetation and slope. The generated data were also used to train a surrogate model for predicting wildfire dissemination, which was tested on both simulation data and the real Chimney fire event.
MiAMix: Enhancing Image Classification through a Multi-stage Augmented Mixed Sample Data Augmentation Method
methods: 推出一种新的多阶段混合方法(MiAMix), integrate 图像扩展 into 混合框架,并采用多种多样化混合方法并行进行混合
results: 对四个图像标准 benchmark 进行了全面评估,证明 MiAMix 可以提高性能而不增加计算负担Abstract
Despite substantial progress in the field of deep learning, overfitting persists as a critical challenge, and data augmentation has emerged as a particularly promising approach due to its capacity to enhance model generalization in various computer vision tasks. While various strategies have been proposed, Mixed Sample Data Augmentation (MSDA) has shown great potential for enhancing model performance and generalization. We introduce a novel mixup method called MiAMix, which stands for Multi-stage Augmented Mixup. MiAMix integrates image augmentation into the mixup framework, utilizes multiple diversified mixing methods concurrently, and improves the mixing method by randomly selecting mixing mask augmentation methods. Recent methods utilize saliency information and the MiAMix is designed for computational efficiency as well, reducing additional overhead and offering easy integration into existing training pipelines. We comprehensively evaluate MiaMix using four image benchmarks and pitting it against current state-of-the-art mixed sample data augmentation techniques to demonstrate that MIAMix improves performance without heavy computational overhead.
摘要
尽管深度学习领域已经做出了很大的进步,但过拟合仍然是一个重要的挑战,而数据扩充被认为是一种有力的方法,因为它可以提高模型在各种计算机视觉任务中的泛化性。虽然有很多策略被提出,但混合样本数据扩充(MSDA)表现出了很大的潜力,可以提高模型性能和泛化性。我们介绍了一种新的混合方法,即MiAMix,它是多阶段混合的混合方法。MiAMix将图像扩充integrated into the mixup framework,并同时使用多种多样化混合方法,通过随机选择混合面积混合方法来提高混合方法。现有方法使用了焦点信息,而MiAMix也是为计算效率设计的,减少了额外的负担,并且可以轻松地 integrate into existing training pipelines。我们对MiaMix进行了四个图像 benchmark 的全面评估,并与当前状态的混合样本数据扩充技术进行了比较,以示 MiAMix 可以提高性能而不需要重大的计算负担。
OBESEYE: Interpretable Diet Recommender for Obesity Management using Machine Learning and Explainable AI
For: The paper aims to develop a novel machine learning-based system to predict the amount of nutrients an individual requires for being healthy, with a focus on patients with comorbidities.* Methods: The authors applied different machine learning algorithms, including linear regression, support vector machine (SVM), decision tree, random forest, XGBoost, and LightGBM, to predict fluid, carbohydrate, protein, and fat consumption.* Results: The authors achieved high accuracy with low root mean square error (RMSE) using linear regression in fluid prediction, random forest in carbohydrate prediction, and LightGBM in protein and fat prediction. They also developed a diet recommender system called OBESEYE, which considers comorbidities and physical conditions and promotes encouragement to get rid of obesity.Here are the three points in Simplified Chinese text:* For: 本研究旨在开发一种基于机器学习的个性化营养计划,以便患有多种疾病的患者可以按照自己的身体和疾病状况进行适应的饮食规划。* Methods: 作者采用了不同的机器学习算法,包括线性回归、支持向量机(SVM)、决策树、随机森林、XGBoost和LightGBM,来预测液体、碳水化合物、蛋白质和脂肪的消耗。* Results: 作者通过线性回归在液体预测中获得了高精度低根本平方误差(RMSE),Random Forest在碳水化合物预测中获得了高精度,XGBoost和LightGBM在蛋白质和脂肪预测中获得了高精度。同时,作者还开发了一个名为OBESEYE的个性化饮食建议系统,该系统考虑了患者的身体和疾病状况,并且鼓励患者减轻肥胖。Abstract
Obesity, the leading cause of many non-communicable diseases, occurs mainly for eating more than our body requirements and lack of proper activity. So, being healthy requires heathy diet plans, especially for patients with comorbidities. But it is difficult to figure out the exact quantity of each nutrient because nutrients requirement varies based on physical and disease conditions. In our study we proposed a novel machine learning based system to predict the amount of nutrients one individual requires for being healthy. We applied different machine learning algorithms: linear regression, support vector machine (SVM), decision tree, random forest, XGBoost, LightGBM on fluid and 3 other major micronutrients: carbohydrate, protein, fat consumption prediction. We achieved high accuracy with low root mean square error (RMSE) by using linear regression in fluid prediction, random forest in carbohydrate prediction and LightGBM in protein and fat prediction. We believe our diet recommender system, OBESEYE, is the only of its kind which recommends diet with the consideration of comorbidities and physical conditions and promote encouragement to get rid of obesity.
摘要
论文标题:一种基于机器学习的健康饮食计划系统(OBESEYE) introduction:肥胖是多种非传染病的主要原因,通常由食物过量和不足活动引起。因此,保持健康需要健康饮食计划,特别是 для患有相关疾病的患者。然而,确定每个人需要的具体量的营养物质很难,因为各种物质的需求因人体和疾病状况而异。在我们的研究中,我们提出了一种基于机器学习的系统,可以预测每个人需要的营养物质量。我们使用了不同的机器学习算法,包括线性回归、支持向量机(SVM)、决策树、随机森林和XGBoost、LightGBM,对营养物质的液体和三种主要微量营养素的消耗预测。我们实现了高精度低根方差误差(RMSE)的预测,通过线性回归在液体预测中,随机森林在碳水化合物预测中,LightGBM在蛋白质和脂肪预测中。我们认为我们的饮食建议系统(OBESEYE)是目前所知的唯一一种考虑患者的相关疾病和身体状况,并且激励人们努力减轻肥胖。Here's the translation in Simplified Chinese:论文标题:一种基于机器学习的健康饮食计划系统(OBESEYE)引言:肥胖是多种非传染病的主要原因,通常由食物过量和不足活动引起。因此,保持健康需要健康饮食计划,特别是 для患有相关疾病的患者。然而,确定每个人需要的具体量的营养物质很难,因为各种物质的需求因人体和疾病状况而异。在我们的研究中,我们提出了一种基于机器学习的系统,可以预测每个人需要的营养物质量。我们使用了不同的机器学习算法,包括线性回归、支持向量机(SVM)、决策树、随机森林和XGBoost、LightGBM,对营养物质的液体和三种主要微量营养素的消耗预测。我们实现了高精度低根方差误差(RMSE)的预测,通过线性回归在液体预测中,随机森林在碳水化合物预测中,LightGBM在蛋白质和脂肪预测中。我们认为我们的饮食建议系统(OBESEYE)是目前所知的唯一一种考虑患者的相关疾病和身体状况,并且激励人们努力减轻肥胖。
OrcoDCS: An IoT-Edge Orchestrated Online Deep Compressed Sensing Framework
results: 分析和实验结果显示,OrcoDCS 比顶对称 DCDA 在训练时间、灵活性和适应性方面表现出色,并且在follow-up应用中得到了更高的性能。Abstract
Compressed data aggregation (CDA) over wireless sensor networks (WSNs) is task-specific and subject to environmental changes. However, the existing compressed data aggregation (CDA) frameworks (e.g., compressed sensing-based data aggregation, deep learning(DL)-based data aggregation) do not possess the flexibility and adaptivity required to handle distinct sensing tasks and environmental changes. Additionally, they do not consider the performance of follow-up IoT data-driven deep learning (DL)-based applications. To address these shortcomings, we propose OrcoDCS, an IoT-Edge orchestrated online deep compressed sensing framework that offers high flexibility and adaptability to distinct IoT device groups and their sensing tasks, as well as high performance for follow-up applications. The novelty of our work is the design and deployment of IoT-Edge orchestrated online training framework over WSNs by leveraging an specially-designed asymmetric autoencoder, which can largely reduce the encoding overhead and improve the reconstruction performance and robustness. We show analytically and empirically that OrcoDCS outperforms the state-of-the-art DCDA on training time, significantly improves flexibility and adaptability when distinct reconstruction tasks are given, and achieves higher performance for follow-up applications.
摘要
压缩数据聚合(CDA)在无线传感器网络(WSN)上是任务特定和环境变化的。然而,现有的压缩数据聚合(CDA)框架(例如,扩lapsed感知基于的数据聚合、深度学习(DL)基于的数据聚合)不具备适应性和灵活性,以处理不同的感知任务和环境变化。另外,它们不考虑follow-up IoT数据驱动的深度学习(DL)基于应用的性能。为了解决这些不足,我们提议OrcoDCS,一个基于IoT-Edge协调的在线深度压缩感知框架,具有高适应性和灵活性,以及高性能 дляfollow-up应用。我们利用特制的非对称自适应神经网络,可以大幅减少编码开销和提高重建性能和Robustness。我们分析和实验表明,OrcoDCS在训练时间、适应性和follow-up应用性能方面,与状态空间的DCDA有显著优势。
Semi-supervised Contrastive Regression for Estimation of Eye Gaze
paper_authors: Somsukla Maiti, Akshansh Gupta for: 这个研究旨在开发一个半supervised contrastive learning框架,用于视线方向估测。methods: 这个框架使用了深度学习模型,并提出了一个新的对照损失函数,用于增强对照对象之间的对应性。results: 该框架能够从小量类别视线数据集上学习一个通用的解决方案,并与多个现有的对照学习技术进行比较,表现良好。Abstract
With the escalated demand of human-machine interfaces for intelligent systems, development of gaze controlled system have become a necessity. Gaze, being the non-intrusive form of human interaction, is one of the best suited approach. Appearance based deep learning models are the most widely used for gaze estimation. But the performance of these models is entirely influenced by the size of labeled gaze dataset and in effect affects generalization in performance. This paper aims to develop a semi-supervised contrastive learning framework for estimation of gaze direction. With a small labeled gaze dataset, the framework is able to find a generalized solution even for unseen face images. In this paper, we have proposed a new contrastive loss paradigm that maximizes the similarity agreement between similar images and at the same time reduces the redundancy in embedding representations. Our contrastive regression framework shows good performance in comparison to several state of the art contrastive learning techniques used for gaze estimation.
摘要
“随着人机交互界面智能系统的需求增加,视线控制系统的开发已成为一项必要。视线是一种不侵入的人类互动方式,是智能系统中最适合的方法。深度学习模型是目前最广泛使用的视线估计方法,但这些模型的性能受到标注视线资料集的大小和统计特性的影响。本文提出了一个半监督对称学习框架,可以对不同的脸像图像进行视线估计。我们提出了一个新的对称损失函数,可以增加相似图像之间的相似性协议,同时将嵌入表现的统计特性减少。我们的对称回传框架在与多个现有的对称学习技术相比,表现良好。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
Dataopsy: Scalable and Fluid Visual Exploration using Aggregate Query Sculpting
results: 透过两个案例研究和三个应用例子,证明AQS的可扩展性和灵活性。Abstract
We present aggregate query sculpting (AQS), a faceted visual query technique for large-scale multidimensional data. As a "born scalable" query technique, AQS starts visualization with a single visual mark representing an aggregation of the entire dataset. The user can then progressively explore the dataset through a sequence of operations abbreviated as P6: pivot (facet an aggregate based on an attribute), partition (lay out a facet in space), peek (see inside a subset using an aggregate visual representation), pile (merge two or more subsets), project (extracting a subset into a new substrate), and prune (discard an aggregate not currently of interest). We validate AQS with Dataopsy, a prototype implementation of AQS that has been designed for fluid interaction on desktop and touch-based mobile devices. We demonstrate AQS and Dataopsy using two case studies and three application examples.
摘要
我团队推出了聚合查询雕刻(AQS),一种适用于大规模多维数据的faceted visual查询技术。作为一种“生成可扩展”的查询技术,AQS从visual化开始,使用单个视觉标记表示整个数据集的聚合。用户可以逐渐探索数据集通过一系列操作简写为P6:分 pivot(基于特性分facet)、分区(在空间中排序facet)、偷窥(使用聚合视觉表示在子集中查看内容)、堆叠(合并两个或多个子集)、 проек(抽取子集到新基础上)和剪除(不再关心的聚合)。我们验证了AQS和Dataopsy,一个实现AQS的prototype,在桌面和触摸式移动设备上实现了流畅交互。我们通过两个案例和三个应用示例来证明AQS和Dataopsy的可用性。
Neural Collapse in the Intermediate Hidden Layers of Classification Neural Networks
paper_authors: Liam Parker, Emre Onal, Anton Stengel, Jake Intrater
for: 这个论文旨在研究分类神经网络中间埋在层的Neural Collapse(NC)现象。
methods: 作者采用了多种网络架构、活化函数和数据集来研究NC现象在不同层次中的出现。
results: 研究结果显示,NC现象在大多数中间埋在层中出现,并且这个层次的强度相对于神经网络的深度有正相关性。此外,研究还发现,大多数减少内类差值的减少发生在神经网络的浅层中,角度分割between class means随着层次深度增加,并且简单的数据集只需要神经网络的浅层来完全学习它们,而更复杂的数据集则需要整个神经网络。这些结果为分类神经网络的特征传播提供了细腻的理解。Abstract
Neural Collapse (NC) gives a precise description of the representations of classes in the final hidden layer of classification neural networks. This description provides insights into how these networks learn features and generalize well when trained past zero training error. However, to date, (NC) has only been studied in the final layer of these networks. In the present paper, we provide the first comprehensive empirical analysis of the emergence of (NC) in the intermediate hidden layers of these classifiers. We examine a variety of network architectures, activations, and datasets, and demonstrate that some degree of (NC) emerges in most of the intermediate hidden layers of the network, where the degree of collapse in any given layer is typically positively correlated with the depth of that layer in the neural network. Moreover, we remark that: (1) almost all of the reduction in intra-class variance in the samples occurs in the shallower layers of the networks, (2) the angular separation between class means increases consistently with hidden layer depth, and (3) simple datasets require only the shallower layers of the networks to fully learn them, whereas more difficult ones require the entire network. Ultimately, these results provide granular insights into the structural propagation of features through classification neural networks.
摘要
WeldMon: A Cost-effective Ultrasonic Welding Machine Condition Monitoring System
results: 我们的实验结果显示,WeldMon 可以实现高性能和可靠的工具状况监控,并且比现有的监控方法更有效率和可靠。我们在实验中使用了一个商业系统和 WeldMon 进行了比较,发现它们在某些任务上有所不同。Abstract
Ultrasonic welding machines play a critical role in the lithium battery industry, facilitating the bonding of batteries with conductors. Ensuring high-quality welding is vital, making tool condition monitoring systems essential for early-stage quality control. However, existing monitoring methods face challenges in cost, downtime, and adaptability. In this paper, we present WeldMon, an affordable ultrasonic welding machine condition monitoring system that utilizes a custom data acquisition system and a data analysis pipeline designed for real-time analysis. Our classification algorithm combines auto-generated features and hand-crafted features, achieving superior cross-validation accuracy (95.8% on average over all testing tasks) compared to the state-of-the-art method (92.5%) in condition classification tasks. Our data augmentation approach alleviates the concept drift problem, enhancing tool condition classification accuracy by 8.3%. All algorithms run locally, requiring only 385 milliseconds to process data for each welding cycle. We deploy WeldMon and a commercial system on an actual ultrasonic welding machine, performing a comprehensive comparison. Our findings highlight the potential for developing cost-effective, high-performance, and reliable tool condition monitoring systems.
摘要
ultrasonic 焊接机器在锂元电池业中扮演着重要的角色,帮助焊接电池与导体。保证高品质焊接是非常重要,使工具状况监控系统成为了早期质量控制的必备。然而,现有的监控方法受到成本、机制时间和适应性等限制。在这篇论文中,我们提出了WeldMon,一个可以减少成本、提高效率和可靠性的焊接机器状况监控系统。我们的分类算法结合自动生成的特征和手工设计的特征,在条件分类任务中获得了95.8%的平均标准化精度(较state-of-the-art方法92.5%)。我们的数据增强方法解决了概念迁移问题,增强工具状况分类精度8.3%。所有算法在本地运行,仅需385毫秒处理每次焊接周期的数据。我们在实际的焊接机器上部署WeldMon和一个商业系统,进行了对比分析。我们的发现显示了开发可靠、高性能、低成本的工具状况监控系统的潜力。
DaMSTF: Domain Adversarial Learning Enhanced Meta Self-Training for Domain Adaptation
paper_authors: Menglong Lu, Zhen Huang, Yunxiang Zhao, Zhiliang Tian, Yang Liu, Dongsheng Li
For: 这篇研究探讨了自适应的发展,特别是透过模型的预测作为pseudo labels的目标领域资料自适应。* Methods: 本研究提出了一个新的自适应框架,即Domain adversarial learning enhanced Self-Training Framework (DaMSTF),包括meta-learning估算每个pseudo instance的重要性,以同时减少label noise和保留困难的例子。* Results: 理论和实验证明了提案的DaMSTF的效果,在跨领域感受分类任务上,DaMSTF可以提高BERT的性能,增加了约4%的改善。Abstract
Self-training emerges as an important research line on domain adaptation. By taking the model's prediction as the pseudo labels of the unlabeled data, self-training bootstraps the model with pseudo instances in the target domain. However, the prediction errors of pseudo labels (label noise) challenge the performance of self-training. To address this problem, previous approaches only use reliable pseudo instances, i.e., pseudo instances with high prediction confidence, to retrain the model. Although these strategies effectively reduce the label noise, they are prone to miss the hard examples. In this paper, we propose a new self-training framework for domain adaptation, namely Domain adversarial learning enhanced Self-Training Framework (DaMSTF). Firstly, DaMSTF involves meta-learning to estimate the importance of each pseudo instance, so as to simultaneously reduce the label noise and preserve hard examples. Secondly, we design a meta constructor for constructing the meta-validation set, which guarantees the effectiveness of the meta-learning module by improving the quality of the meta-validation set. Thirdly, we find that the meta-learning module suffers from the training guidance vanishment and tends to converge to an inferior optimal. To this end, we employ domain adversarial learning as a heuristic neural network initialization method, which can help the meta-learning module converge to a better optimal. Theoretically and experimentally, we demonstrate the effectiveness of the proposed DaMSTF. On the cross-domain sentiment classification task, DaMSTF improves the performance of BERT with an average of nearly 4%.
摘要
自我训练emerges as an important research line on domain adaptation. By taking the model's prediction as the pseudo labels of the unlabeled data, self-training bootstraps the model with pseudo instances in the target domain. However, the prediction errors of pseudo labels (label noise) challenge the performance of self-training. To address this problem, previous approaches only use reliable pseudo instances, i.e., pseudo instances with high prediction confidence, to retrain the model. Although these strategies effectively reduce the label noise, they are prone to miss the hard examples. In this paper, we propose a new self-training framework for domain adaptation, namely Domain adversarial learning enhanced Self-Training Framework (DaMSTF). Firstly, DaMSTF involves meta-learning to estimate the importance of each pseudo instance, so as to simultaneously reduce the label noise and preserve hard examples. Secondly, we design a meta constructor for constructing the meta-validation set, which guarantees the effectiveness of the meta-learning module by improving the quality of the meta-validation set. Thirdly, we find that the meta-learning module suffers from the training guidance vanishment and tends to converge to an inferior optimal. To this end, we employ domain adversarial learning as a heuristic neural network initialization method, which can help the meta-learning module converge to a better optimal. Theoretically and experimentally, we demonstrate the effectiveness of the proposed DaMSTF. On the cross-domain sentiment classification task, DaMSTF improves the performance of BERT with an average of nearly 4%.
For: The paper is written for those interested in 3D representation and view synthesis, particularly in the context of Neural Radiance Fields (NeRFs) and their applications in computer graphics and vision.* Methods: The paper uses a historical perspective to review the development of NeRFs, and describes the NeRF representation as a continuous volume with view-dependent radiance and volume density obtained by querying a neural network.* Results: The paper describes the widespread adoption of NeRFs in the field, with thousands of papers extending or building on the original work, and numerous industrial applications and startup companies. It also provides some observations and insights regarding the future of 3D representations.Here is the same information in Simplified Chinese text:
results: 论文描述了NeRF在领域中的广泛应用,有千余篇文章进一步扩展或基于原始工作,以及众多的工业应用和创业公司。它还提供了一些关于未来3D表示的观察和意见。Abstract
Neural Radiance Fields or NeRFs have become the representation of choice for problems in view synthesis or image-based rendering, as well as in many other applications across computer graphics and vision, and beyond. At their core, NeRFs describe a new representation of 3D scenes or 3D geometry. Instead of meshes, disparity maps, multiplane images or even voxel grids, they represent the scene as a continuous volume, with volumetric parameters like view-dependent radiance and volume density obtained by querying a neural network. The NeRF representation has now been widely used, with thousands of papers extending or building on it every year, multiple authors and websites providing overviews and surveys, and numerous industrial applications and startup companies. In this article, we briefly review the NeRF representation, and describe the three decades-long quest to find the best 3D representation for view synthesis and related problems, culminating in the NeRF papers. We then describe new developments in terms of NeRF representations and make some observations and insights regarding the future of 3D representations.
摘要
<> transtable("Neural Radiance Fields", "NeRFs", "view synthesis", "image-based rendering", "computer graphics", "vision", "3D scenes", "3D geometry", "meshes", "disparity maps", "multiplane images", "voxel grids")>文本: neural Radiance Fields(NeRFs)在视觉合成和基于图像的渲染等问题中成为了首选表示方式,同时也在计算机图形和计算机视觉等领域内广泛应用。NeRFs描述了一种新的3D场景或3D几何表示方式,而不是传统的网格、不一致度图、多平面图像或粒子网格。NeRF表示法中的场景被表示为一个连续的体积,通过问题 neural network 来获得视依endent的辐射光和体积密度。到目前为止,NeRF表示法已经广泛使用,每年有千篇以上的论文推广或基于它的研究,多个作者和网站提供了概述和评论,以及许多工业应用和 startup 公司。在这篇文章中,我们简要介绍NeRF表示法,并描述了在视觉合成和相关问题上三十年来寻找最佳3D表示方式的漫长历程,从而导致NeRF论文的出现。然后,我们描述了新的NeRF表示法发展和一些关于未来3D表示方式的见解和发现。
Exploiting On-chip Heterogeneity of Versal Architecture for GNN Inference Acceleration
results: 对于不同的模型和数据集,我们的实现在VCK5000 ACAP平台上比其他实现更快,包括CPU、GPU、ACAP和其他自定义GNN加速器。对于图 convolutional neural network(GCN)推理,我们的方法比使用PLOnly在同一ACAP设备上的设计快速了3.9-96.7倍。Abstract
Graph Neural Networks (GNNs) have revolutionized many Machine Learning (ML) applications, such as social network analysis, bioinformatics, etc. GNN inference can be accelerated by exploiting data sparsity in the input graph, vertex features, and intermediate data in GNN computations. For dynamic sparsity exploitation, we leverage the heterogeneous computing capabilities of AMD Versal ACAP architecture to accelerate GNN inference. We develop a custom hardware module that executes the sparse primitives of the computation kernel on the Programmable Logic (PL) and efficiently computes the dense primitives using the AI Engine (AIE). To exploit data sparsity during inference, we devise a runtime kernel mapping strategy that dynamically assigns computation tasks to the PL and AIE based on data sparsity. Our implementation on the VCK5000 ACAP platform leads to superior performance compared with the state-of-the-art implementations on CPU, GPU, ACAP, and other custom GNN accelerators. Compared with these implementations, we achieve significant average runtime speedup across various models and datasets of 162.42x, 17.01x, 9.90x, and 27.23x, respectively. Furthermore, for Graph Convolutional Network (GCN) inference, our approach leads to a speedup of 3.9-96.7x compared to designs using PL only on the same ACAP device.
摘要
图形神经网络(GNN)在机器学习(ML)应用中引发了革命,如社交网络分析和生物信息处理等。在GNN推理中,我们可以利用输入图的数据稀疏性加速GNN推理。为了实现动态数据稀疏性利用,我们利用AMD Versal ACAP架构的多态计算能力来加速GNN推理。我们开发了一个自定义硬件模块,该模块在计算核心中执行稀疏运算,并使用AI引擎(AIE)高效地执行稠密运算。为了在推理过程中利用数据稀疏性,我们划分了一个动态映射策略,该策略在运行时基于数据稀疏性来分配计算任务给PL和AIE。我们在VCK5000 ACAP平台上实现了与状态网络实现相比,获得了显著的平均运行时速度提升,具体的提升比为162.42倍、17.01倍、9.90倍和27.23倍。此外,对于图几何学网络(GCN)推理,我们的方法比使用PL только在同一个ACAP设备上获得了3.9-96.7倍的速度提升。
results: 在非ID Settings下表现良好,并在benign Settings下超越基eline算法,并且从理论和实验角度证明了对数据/模型毒素攻击的Robustness。Abstract
We introduce SABRE, a novel framework for robust variational Bayesian peer-to-peer federated learning. We analyze the robustness of the known variational Bayesian peer-to-peer federated learning framework (BayP2PFL) against poisoning attacks and subsequently show that BayP2PFL is not robust against those attacks. The new SABRE aggregation methodology is then devised to overcome the limitations of the existing frameworks. SABRE works well in non-IID settings, does not require the majority of the benign nodes over the compromised ones, and even outperforms the baseline algorithm in benign settings. We theoretically prove the robustness of our algorithm against data / model poisoning attacks in a decentralized linear regression setting. Proof-of-Concept evaluations on benchmark data from image classification demonstrate the superiority of SABRE over the existing frameworks under various poisoning attacks.
摘要
我们介绍SABRE,一个新的 Federated Learning框架,具有强健的Robustness。我们分析了现有的Variational Bayesian Peer-to-Peer Federated Learning框架(BayP2PFL)对于毒素攻击的Robustness,并证明BayP2PFL不具备Robustness。然后,我们提出了一新的SABRE聚合方法,以解决现有框架的限制。SABRE在非Identical Independent Distributed(non-IID)设定下运作良好,不需要大多数的良好节点超过毒素节点,甚至在良好设定下超越了基eline algorithm。我们有 teorically 证明了我们的算法对于数据/模型毒素攻击的Robustness。在分析数据集的image classification中,我们透过Proof-of-Concept的评估发现SABRE在不同的毒素攻击下表现出色,超过了现有的框架。
Meta-Tsallis-Entropy Minimization: A New Self-Training Approach for Domain Adaptation on Text Classification
results: 实验结果显示,MTEM可以提高BERT模型的适应性,平均提高了4%的适应性表现。Abstract
Text classification is a fundamental task for natural language processing, and adapting text classification models across domains has broad applications. Self-training generates pseudo-examples from the model's predictions and iteratively trains on the pseudo-examples, i.e., minimizes the loss on the source domain and the Gibbs entropy on the target domain. However, Gibbs entropy is sensitive to prediction errors, and thus, self-training tends to fail when the domain shift is large. In this paper, we propose Meta-Tsallis Entropy minimization (MTEM), which applies a meta-learning algorithm to optimize the instance adaptive Tsallis entropy on the target domain. To reduce the computation cost of MTEM, we propose an approximation technique to approximate the Second-order derivation involved in the meta-learning. To efficiently generate pseudo labels, we propose an annealing sampling mechanism for exploring the model's prediction probability. Theoretically, we prove the convergence of the meta-learning algorithm in MTEM and analyze the effectiveness of MTEM in achieving domain adaptation. Experimentally, MTEM improves the adaptation performance of BERT with an average of 4 percent on the benchmark dataset.
摘要
文本分类是自然语言处理的基本任务,并且在不同领域中适应文本分类模型有广泛的应用。self-training生成 pseudo-例子从模型预测中,并在这些 pseudo-例子上进行逐步训练,即在源领域中减少损失,并在目标领域中减少 Gibbs entropy。然而,Gibbs entropy 响应预测错误,因此 self-training 在大型领域变化时往往失败。在这篇论文中,我们提出了 Meta-Tsallis Entropy 减少 (MTEM),它使用 meta-学算法来优化目标领域中的实例适应 Tsallis entropy。为了降低 MTEM 的计算成本,我们提出了一种近似技术来 aproximate 第二个 derivation involved in meta-学。此外,我们还提出了一种熔化抽样机制来快速生成 pseudo 标签。理论上,我们证明了 MTEM 中的 meta-学算法的收敛性,并分析了 MTEM 在适应性方面的效果。实验表明,MTEM 可以提高 BERT 的适应性表现,平均提高了4%的 benchmark 数据集。
Learning to Schedule in Non-Stationary Wireless Networks With Unknown Statistics
results: 作者通过实验证明了他们的理论结果,并展示了MW-UCB的优越性。在具有非站立性的情况下,MW-UCB可以实现稳定区域的稳定性,而且可以接近稳定区域的稳定性。Abstract
The emergence of large-scale wireless networks with partially-observable and time-varying dynamics has imposed new challenges on the design of optimal control policies. This paper studies efficient scheduling algorithms for wireless networks subject to generalized interference constraint, where mean arrival and mean service rates are unknown and non-stationary. This model exemplifies realistic edge devices' characteristics of wireless communication in modern networks. We propose a novel algorithm termed MW-UCB for generalized wireless network scheduling, which is based on the Max-Weight policy and leverages the Sliding-Window Upper-Confidence Bound to learn the channels' statistics under non-stationarity. MW-UCB is provably throughput-optimal under mild assumptions on the variability of mean service rates. Specifically, as long as the total variation in mean service rates over any time period grows sub-linearly in time, we show that MW-UCB can achieve the stability region arbitrarily close to the stability region of the class of policies with full knowledge of the channel statistics. Extensive simulations validate our theoretical results and demonstrate the favorable performance of MW-UCB.
摘要
“现代无线网络中大规模的无线网络具有部分可见和时间变化的动态特性,对优化控制策略的设计带来了新的挑战。这篇论文研究了基于通信频道统计数据的无线网络 generalized interference 约束下的有效协调策略。这种模型体现了现代无线通信网络中边缘设备的实际特性。我们提出了一种名为MW-UCB的新算法,基于Max-Weight策略,利用Sliding-Window Upper-Confidence Bound来学习不站дар性下的通信频道统计数据。MW-UCB 可以在某些时间期间内的总变化率下线性增长的情况下,提供可靠的吞吐量优化策略,并且可以达到与拥有通信频道统计数据的策略的稳定区域相同的性能。广泛的实验 validate 了我们的理论结果,并证明MW-UCB 的优秀性能。”Note: Simplified Chinese is used here, which is a more casual and informal version of Chinese. If you prefer Traditional Chinese or another version, please let me know.
Personalization of Stress Mobile Sensing using Self-Supervised Learning
results: 我们发现使用我们的预训练方法学习的嵌入都高于无自我支持预训练的基线模型,并且需要 fewer than 30% of the labels to reach equivalent performance。这种个性化学习方法可以帮助精准健康系统,这些系统是针对每个人进行定制,并且需要少量标注。Abstract
Stress is widely recognized as a major contributor to a variety of health issues. Stress prediction using biosignal data recorded by wearables is a key area of study in mobile sensing research because real-time stress prediction can enable digital interventions to immediately react at the onset of stress, helping to avoid many psychological and physiological symptoms such as heart rhythm irregularities. Electrodermal activity (EDA) is often used to measure stress. However, major challenges with the prediction of stress using machine learning include the subjectivity and sparseness of the labels, a large feature space, relatively few labels, and a complex nonlinear and subjective relationship between the features and outcomes. To tackle these issues, we examine the use of model personalization: training a separate stress prediction model for each user. To allow the neural network to learn the temporal dynamics of each individual's baseline biosignal patterns, thus enabling personalization with very few labels, we pre-train a 1-dimensional convolutional neural network (CNN) using self-supervised learning (SSL). We evaluate our method using the Wearable Stress and Affect prediction (WESAD) dataset. We fine-tune the pre-trained networks to the stress prediction task and compare against equivalent models without any self-supervised pre-training. We discover that embeddings learned using our pre-training method outperform supervised baselines with significantly fewer labeled data points: the models trained with SSL require less than 30% of the labels to reach equivalent performance without personalized SSL. This personalized learning method can enable precision health systems which are tailored to each subject and require few annotations by the end user, thus allowing for the mobile sensing of increasingly complex, heterogeneous, and subjective outcomes such as stress.
摘要
Stress 被广泛认为是健康问题的一个重要 contribuutor。 使用移动感知技术记录的生物参数数据进行实时压力预测是移动感知研究中的关键领域,因为可以通过实时预测压力,防止心跳rhythm irregularities 和其他心理和生理症状。 电解质活动 (EDA) 常常用于测量压力。然而,使用机器学习预测压力的主要挑战包括标签的主观性和稀缺性,特征空间的庞大,标签的 relativelly few,以及特征和结果之间的复杂、主观关系。为解决这些问题,我们研究了个性化的模型预测:对每个用户 separately 训练一个压力预测模型。通过训练一个1维度卷积神经网络 (CNN) 使用无监督学习 (SSL),我们让神经网络学习每个人的个性基线生物参数模式,以便个性化。我们使用WESAD数据集进行评估。我们精细调整预训练后的网络,并与无个性SSL预训练的模型进行比较。我们发现,使用我们的预训练方法学习的嵌入在少于30%的标签点上达到相同性能。这种个性化学习方法可以帮助精准健康系统,这些系统是基于每个用户的特征,并且只需少量标签点Annotation by the end user,因此可以在移动感知中评估越来越复杂、多样化和主观的结果,如压力。
Synthesizing Programmatic Policies with Actor-Critic Algorithms and ReLU Networks
results: 实验结果表明,这种翻译方法可以学习出短而有效的策略,并且这些翻译后的策略与PIRL算法生成的策略相比,通常更好。Abstract
Programmatically Interpretable Reinforcement Learning (PIRL) encodes policies in human-readable computer programs. Novel algorithms were recently introduced with the goal of handling the lack of gradient signal to guide the search in the space of programmatic policies. Most of such PIRL algorithms first train a neural policy that is used as an oracle to guide the search in the programmatic space. In this paper, we show that such PIRL-specific algorithms are not needed, depending on the language used to encode the programmatic policies. This is because one can use actor-critic algorithms to directly obtain a programmatic policy. We use a connection between ReLU neural networks and oblique decision trees to translate the policy learned with actor-critic algorithms into programmatic policies. This translation from ReLU networks allows us to synthesize policies encoded in programs with if-then-else structures, linear transformations of the input values, and PID operations. Empirical results on several control problems show that this translation approach is capable of learning short and effective policies. Moreover, the translated policies are at least competitive and often far superior to the policies PIRL algorithms synthesize.
摘要
Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction
paper_authors: Keren Shao, Ke Chen, Taylor Berg-Kirkpatrick, Shlomo Dubnov
for: 提高歌唱 мелоody抽取模型的性能
methods: 输入特征修改和训练目标修改,基于两个假设:1)声音spectrogram中的谱 Harmonics快速衰减 along the frequency axis,2)歌唱和非歌唱 segment with extremely short duration rare。
results: 对多种模型,包括MSNet、FTANet和新引入的PianoNet,进行了修改,并对这些模型进行了实验,结果表明,提posed modifications是有效的提高歌唱 мелоody抽取模型的性能。Abstract
In deep learning research, many melody extraction models rely on redesigning neural network architectures to improve performance. In this paper, we propose an input feature modification and a training objective modification based on two assumptions. First, harmonics in the spectrograms of audio data decay rapidly along the frequency axis. To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity (CFP) representation using discrete z-transform. Second, the vocal and non-vocal segments with extremely short duration are uncommon. To ensure a more stable melody contour, we design a differentiable loss function that prevents the model from predicting such segments. We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network. Our experimental results demonstrate that the proposed modifications are empirically effective for singing melody extraction.
摘要
在深度学习研究中,许多旋律抽取模型通过修改神经网络架构来提高性能。在这篇论文中,我们提出了输入特征修改和训练目标修改,基于以下两个假设。首先,声音spectrogram中的谐波呈极速衰减的特性。为了增强模型对尾部谐波的敏感度,我们使用柔性变换来修改CFP表示。其次,歌曲和非歌曲段的持续时间非常短暂,以确保更稳定的旋律轨迹,我们设计了可导的损失函数,防止模型预测这些段落。我们应用这些修改到了多个模型,包括MSNet、FTANet和一个新引入的模型,PianoNet,该模型基于钢琴谱写网络。我们的实验结果表明,我们提出的修改是实际有效的,用于歌唱旋律抽取。
Fluid Property Prediction Leveraging AI and Robotics
results: 研究人员通过实验和比较分析发现,该方法可以准确地推断液体的动态粘度或液体类别,并且比传统方法更快速和高效。Abstract
Inferring liquid properties from vision is a challenging task due to the complex nature of fluids, both in behavior and detection. Nevertheless, the ability to infer their properties directly from visual information is highly valuable for autonomous fluid handling systems, as cameras are readily available. Moreover, predicting fluid properties purely from vision can accelerate the process of fluid characterization saving considerable time and effort in various experimental environments. In this work, we present a purely vision-based approach to estimate viscosity, leveraging the fact that the behavior of the fluid oscillations is directly related to the viscosity. Specifically, we utilize a 3D convolutional autoencoder to learn latent representations of different fluid-oscillating patterns present in videos. We leverage this latent representation to visually infer the category of fluid or the dynamics viscosity of fluid from video.
摘要
“从视觉中推断流体属性是一项复杂的任务,这是因为流体的行为和检测都具有复杂的特点。然而,可以直接从视觉信息中推断流体属性的能力具有高度的价值,因为摄像头是 readily available。此外,只从视觉信息中预测流体属性可以大大缩短各种实验环境中的测试时间和努力。在这个工作中,我们提出了一种完全基于视觉的方法,通过利用视觉中的液体抖动模式来估算液体的粘度。特别是,我们使用3D卷积神经网络来学习不同的液体抖动模式在视频中的秘密表示。然后,我们利用这个秘密表示来从视频中可见地推断液体的类别或液体的粘度。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Exploring the Effect of Sparse Recovery on the Quality of Image Superresolution
results: 该论文通过实验研究了不同稀疏恢复算法对图像重建质量的影响。提出了最佳稀疏恢复算法可以用于此目的。Abstract
Dictionary learning can be used for image superresolution by learning a pair of coupled dictionaries of image patches from high-resolution and low-resolution image pairs such that the corresponding pairs share the same sparse vector when represented by the coupled dictionaries. These dictionaries then can be used to to reconstruct the corresponding high-resolution patches from low-resolution input images based on sparse recovery. The idea is to recover the shared sparse vector using the low-resolution dictionary and then multiply it by the high-resolution dictionary to recover the corresponding high-resolution image patch. In this work, we study the effect of the sparse recovery algorithm that we use on the quality of the reconstructed images. We offer empirical experiments to search for the best sparse recovery algorithm that can be used for this purpose.
摘要
“字典学习可以用于图像超分辨by学习一对相关的字典集合,其中一个字典是高分辨度图像块,另一个字典是低分辨度图像块,这两个字典之间存在相同的稀疏 вектор。这些字典可以用于从低分辨度输入图像中重建对应的高分辨度图像块,基于稀疏恢复。我们的想法是通过低分辨度字典来恢复共享的稀疏 вектор,然后将其乘以高分辨度字典来恢复对应的高分辨度图像块。在这个工作中,我们研究了使用不同的稀疏恢复算法对图像质量的影响。我们通过实验搜索最佳的稀疏恢复算法。”Note that Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore. Traditional Chinese is used in Taiwan, Hong Kong, and other regions.
for: computing bounds for causal queries on causal graphs with unobserved confounders and discrete valued observed variables, where identifiability does not hold.
methods: significantly pruning a linear programming (LP) formulation to compute bounds, and extending the pruning methodology to fractional LPs for incorporating additional observations.
results: significant runtime improvement compared to benchmarks in experiments, and proposal of an efficient greedy heuristic for high-quality bounds that scales to larger problems.Abstract
We consider the problem of computing bounds for causal queries on causal graphs with unobserved confounders and discrete valued observed variables, where identifiability does not hold. Existing non-parametric approaches for computing such bounds use linear programming (LP) formulations that quickly become intractable for existing solvers because the size of the LP grows exponentially in the number of edges in the causal graph. We show that this LP can be significantly pruned, allowing us to compute bounds for significantly larger causal inference problems compared to existing techniques. This pruning procedure allows us to compute bounds in closed form for a special class of problems, including a well-studied family of problems where multiple confounded treatments influence an outcome. We extend our pruning methodology to fractional LPs which compute bounds for causal queries which incorporate additional observations about the unit. We show that our methods provide significant runtime improvement compared to benchmarks in experiments and extend our results to the finite data setting. For causal inference without additional observations, we propose an efficient greedy heuristic that produces high quality bounds, and scales to problems that are several orders of magnitude larger than those for which the pruned LP can be solved.
摘要
我们考虑了 causal 图上的 causal 查询问题,其中存在隐藏的假设和离散变量。我们提出了一种非 Parametric 方法来计算这些查询问题的 bound,使用线性Programming(LP)形式,但是这个 LP 的大小会 exponential 增长与 causal 图的边数相关。我们展示了一种可以减少这个 LP 的大小的技术,使得我们可以计算更大的 causal 推理问题。这种减少技术允许我们计算closed form的 bound для一种特殊的问题,包括一种已经广泛研究的家族问题,其中多个干扰因素影响结果。我们扩展了我们的减少方法到 fractional LPs,它们计算包含额外观测的 causal 查询问题的 bound。我们示出了我们的方法可以提供 significiant 的运行时改进 compared to 参考值,并扩展到 finite 数据设置。对于 causal 推理无需额外观测,我们提出了一种高质量的 bound 生成的快速优化策略,并可以处理许多 magnitudes 更大的问题。
FPR Estimation for Fraud Detection in the Presence of Class-Conditional Label Noise
results: 论文表明,使用模型直接清洁验证数据可能会导致低估真实的FPR和TPR,即使总错误率较低。这表明需要开发一些可以减少总错误率并且减少清洁错误率与模型分数相关的方法。Abstract
We consider the problem of estimating the false-/ true-positive-rate (FPR/TPR) for a binary classification model when there are incorrect labels (label noise) in the validation set. Our motivating application is fraud prevention where accurate estimates of FPR are critical to preserving the experience for good customers, and where label noise is highly asymmetric. Existing methods seek to minimize the total error in the cleaning process - to avoid cleaning examples that are not noise, and to ensure cleaning of examples that are. This is an important measure of accuracy but insufficient to guarantee good estimates of the true FPR or TPR for a model, and we show that using the model to directly clean its own validation data leads to underestimates even if total error is low. This indicates a need for researchers to pursue methods that not only reduce total error but also seek to de-correlate cleaning error with model scores.
摘要
我们考虑一个二分类模型预测时存在标签错误(标签噪声)的问题。我们的动机应用是防止诈骗,准确估计FPR(假阳性率)是critical的,因为准确估计可以保持好客户的体验。在这种应用中,标签噪声具有高度的不对称性。现有方法尝试最小化清洁过程中的总错误,以避免清洁不是噪声的示例,并确保清洁示例是。这是一个重要的准确度指标,但是不足以保证模型的真实FPR或TPR估计。我们显示,使用模型直接清洁自己的验证数据会导致下降,即使总错误较低。这表明研究人员需要追求不 только减少总错误,而且还需要尝试减少清洁错误与模型分数之间的相关性。
Explainable Deep Learning-based Solar Flare Prediction with post hoc Attention for Operational Forecasting
paper_authors: Chetraj Pandey, Rafal A. Angryk, Manolis K. Georgoulis, Berkay Aydin
for: 这 paper 的目的是提出一种基于深度学习的全盘太阳风暴预测模型,以便在 24 小时内预测 $\geq$M1.0-级太阳风暴的发生。
methods: 这 paper 使用了自定义的数据增强和样本权重,以解决内在的类别不均衡问题,并使用了真实技能统计量和海德ке技能分数作为评估 metric。
results: 这 paper 的分析结果显示,全盘预测太阳风暴与活动区相关,并且模型可以准确地预测近 limb 的太阳风暴。主要发现包括:(1) 我们的全盘模型可以准确地位置和预测近 limb 的太阳风暴,这是操作性风暴预测中的关键特征;(2) 我们的候选模型在 TSS 和 HSS 两个评估指标上得到了平均值为 0.51$\pm$0.05 和 0.38$\pm$0.08,分别表示预测性能的水平。Abstract
This paper presents a post hoc analysis of a deep learning-based full-disk solar flare prediction model. We used hourly full-disk line-of-sight magnetogram images and selected binary prediction mode to predict the occurrence of $\geq$M1.0-class flares within 24 hours. We leveraged custom data augmentation and sample weighting to counter the inherent class-imbalance problem and used true skill statistic and Heidke skill score as evaluation metrics. Recent advancements in gradient-based attention methods allow us to interpret models by sending gradient signals to assign the burden of the decision on the input features. We interpret our model using three post hoc attention methods: (i) Guided Gradient-weighted Class Activation Mapping, (ii) Deep Shapley Additive Explanations, and (iii) Integrated Gradients. Our analysis shows that full-disk predictions of solar flares align with characteristics related to the active regions. The key findings of this study are: (1) We demonstrate that our full disk model can tangibly locate and predict near-limb solar flares, which is a critical feature for operational flare forecasting, (2) Our candidate model achieves an average TSS=0.51$\pm$0.05 and HSS=0.38$\pm$0.08, and (3) Our evaluation suggests that these models can learn conspicuous features corresponding to active regions from full-disk magnetograms.
摘要
Recent advancements in gradient-based attention methods allow us to interpret models by sending gradient signals to assign the burden of the decision on the input features. We interpret our model using three post hoc attention methods:1. Guided Gradient-weighted Class Activation Mapping: This method sends gradient signals to the input features to identify the most important features for the model's predictions.2. Deep Shapley Additive Explanations: This method assigns a unique contribution to each input feature for the model's predictions, allowing us to understand which features are most important.3. Integrated Gradients: This method computes the gradient of the model's output with respect to the input features and visualizes the results in a heatmap.Our analysis shows that full-disk predictions of solar flares align with characteristics related to the active regions. The key findings of this study are:1. We demonstrate that our full-disk model can tangibly locate and predict near-limb solar flares, which is a critical feature for operational flare forecasting.2. Our candidate model achieves an average TSS=0.51$\pm$0.05 and HSS=0.38$\pm$0.08.3. Our evaluation suggests that these models can learn conspicuous features corresponding to active regions from full-disk magnetograms.Translation in Simplified Chinese:这篇论文介绍了一种基于深度学习的全磁盘太阳风暴预测模型的后期分析。我们使用了每小时的全磁盘视线 magnetogram 图像,并选择了二分类预测模式,以预测 Within 24 小时内的 $\geq$M1.0-级太阳风暴发生。我们利用了自定义的数据增强和样本权重,以解决内置的分类不均衡问题,并使用真实技能统计和海德基准技能分数作为评价指标。最新的梯度基于注意力方法,使得我们可以通过发送梯度信号,将决策归结托管到输入特征上。我们使用了三种后期注意力方法来解释我们的模型:1. 导航梯度权重分布图:这种方法将梯度信号发送到输入特征,以确定模型的决策中最重要的特征。2. 深度贡献添加性解释:这种方法为模型的预测做出了唯一的贡献,以便理解模型中哪些特征最重要。3. 集成梯度:这种方法计算了模型的输出与输入特征之间的梯度,并将结果视觉化为热图。我们的分析表明,全磁盘预测太阳风暴与活动区相关。研究的关键发现包括:1. 我们证明了我们的全磁盘模型可以实际地找到和预测近 limb 太阳风暴,这是操作风暴预测中的关键特点。2. 我们的候选模型的平均 TSS 为 0.51$\pm$0.05,HSS 为 0.38$\pm$0.08。3. 我们的评价表明,这些模型可以从全磁盘 magnetogram 中学习明显的活动区特征。
A Review of Change of Variable Formulas for Generative Modeling
results: 论文发现了一些关于CoV公式的 interessing关系,以及一些在文献中不够明确的distinction。同时,论文还发现了未来研究中的一些潜在的空白。Abstract
Change-of-variables (CoV) formulas allow to reduce complicated probability densities to simpler ones by a learned transformation with tractable Jacobian determinant. They are thus powerful tools for maximum-likelihood learning, Bayesian inference, outlier detection, model selection, etc. CoV formulas have been derived for a large variety of model types, but this information is scattered over many separate works. We present a systematic treatment from the unifying perspective of encoder/decoder architectures, which collects 28 CoV formulas in a single place, reveals interesting relationships between seemingly diverse methods, emphasizes important distinctions that are not always clear in the literature, and identifies surprising gaps for future research.
摘要
《变量变换(CoV)方程可以将复杂的概率密度转化为更简单的密度,通过学习的变换,其 Jacobian determinant 也是可追度的。因此,CoV 方程是maximum-likelihood 学习、 bayesian 推理、异常检测、模型选择等方面的 poderful工具。CoV 方程已经为不同类型的模型 derive,但这些信息分散在多个不同的论文中。我们在encoder/decoder 架构的统一视角下提供了一个系统性的处理方法,收集了28种CoV 方程,在一个地方集中,揭示了各种方法之间的关系,强调了文献中不一定明确的重要 отличия,并标识了未来研究中的潜在空白。》
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation
results: 在22张图像分类benchmark上,使用ReCLIP方法可以将CLIP模型的平均错误率从30.17%降低至25.06%。Abstract
Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
摘要
大规模预训练视语模型,如CLIP,在零例批量分类任务中表现出色,例如在ImageNet上 achievement 76.3% 的顶部一准确率,这可能会带来许多无标注数据的任务中的潜在优势。然而,在应用 CLIP 到下游目标领域时,视文频域差和跨模态不一致可能会对模型表现产生很大的影响。为解决这些挑战,我们提出了 ReCLIP,第一个不需要源数据或目标标记数据的源自由频域适应方法。ReCLIP 首先学习一个抑制跨模态视文嵌入的投影空间,然后通过跨模态自我训练,使用 Pseudo 标签,更新视文编码器,约束标签和降低频域差和不一致。经过广泛的实验,我们证明 ReCLIP 可以将 CLIP 的平均错误率由 30.17% 降低至 25.06% 在 22 个图像分类 benchmark 上。
Learning from Topology: Cosmological Parameter Estimation from the Large-scale Structure
results: 通过参数恢复测试,我们的模型可以准确地估算 cosmological 参数,比传统 Bayesian 推理方法更高效。Abstract
The topology of the large-scale structure of the universe contains valuable information on the underlying cosmological parameters. While persistent homology can extract this topological information, the optimal method for parameter estimation from the tool remains an open question. To address this, we propose a neural network model to map persistence images to cosmological parameters. Through a parameter recovery test, we demonstrate that our model makes accurate and precise estimates, considerably outperforming conventional Bayesian inference approaches.
摘要
宇宙大规模结构的Topology含有价值的 cosmological参数信息。 persistent homology 可以提取这种 topological 信息,但是最佳的方法 для参数估计仍然是一个开放的问题。为解决这个问题,我们提议一种 neural network 模型,将 persistency 图像映射到 cosmological 参数。通过参数恢复测试,我们证明了我们的模型可以准确和精确地估计参数,明显超过了传统的 Bayesian 推理方法。Note: "persistent homology" is a term that is not commonly used in Simplified Chinese, so I have translated it as " persistent homology" instead of using a more common term like " persistent homology theory" or " persistent homology analysis".
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
methods: 本文使用了6种核心视力语言(VL)能力来定义LMM的评估标准,并对16种 инте инте инте инте инте инте инте инте инте инте инте инте инте инте инте inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte inte INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INTE INAbstract
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models. Code and data are available at https://github.com/yuweihao/MM-Vet.
摘要
我们提出了 MM-Vet,一个评估大型多Modal模型(LMM)的评价指标,用于评估LMM在复杂多Modal任务上的能力。近期的LMM在不同的任务上展现了各种有趣的能力,如解释黑板上的数学问题,理解新闻图片中的事件和名人,以及解释视觉笑话。由于模型的快速进步,评价指标的开发受到挑战。问题包括:(1)如何系统地结构化和评估复杂多Modal任务;(2)如何设计评价指标,能够在问题和答案类型之间具有一致性;以及(3)如何为模型提供更多的探索和理解。为此,我们提出了 MM-Vet,基于视力语言(VL)能力的核心能力的设计。MM-Vet定义了6个VL核心能力,并评估了这些能力的16种组合。为评价指标,我们提议了基于大语言模型(LLM)的评估器,可以评估不同类型的问题和答案样式,从而获得一个统一的评价指标。我们对代表性的LMM进行了MM-Vet的评估,以获得不同模型系统和模型的能力的信息。代码和数据可以在 GitHub上获取。
Generation of Realistic Synthetic Raw Radar Data for Automated Driving Applications using Generative Adversarial Networks
results: 研究发现,使用这种方法可以生成高度真实的雷达数据,包括雷达反射和背景噪声。此外,这种方法还可以增加数据的扩充,例如生成不可能或安全关键的场景数据。Abstract
The main approaches for simulating FMCW radar are based on ray tracing, which is usually computationally intensive and do not account for background noise. This work proposes a faster method for FMCW radar simulation capable of generating synthetic raw radar data using generative adversarial networks (GAN). The code and pre-trained weights are open-source and available on GitHub. This method generates 16 simultaneous chirps, which allows the generated data to be used for the further development of algorithms for processing radar data (filtering and clustering). This can increase the potential for data augmentation, e.g., by generating data in non-existent or safety-critical scenarios that are not reproducible in real life. In this work, the GAN was trained with radar measurements of a motorcycle and used to generate synthetic raw radar data of a motorcycle traveling in a straight line. For generating this data, the distance of the motorcycle and Gaussian noise are used as input to the neural network. The synthetic generated radar chirps were evaluated using the Frechet Inception Distance (FID). Then, the Range-Azimuth (RA) map is calculated twice: first, based on synthetic data using this GAN and, second, based on real data. Based on these RA maps, an algorithm with adaptive threshold and edge detection is used for object detection. The results have shown that the data is realistic in terms of coherent radar reflections of the motorcycle and background noise based on the comparison of chirps, the RA maps and the object detection results. Thus, the proposed method in this work has shown to minimize the simulation-to-reality gap for the generation of radar data.
摘要
主要方法 для模拟FMCW雷达是基于射线跟踪,通常是 computationally 高度昂贵并不考虑背景噪声。这项工作提出了一种更快的FMCW雷达模拟方法,使用生成对抗网络(GAN)生成 sintetic raw radar 数据。代码和预训练 веса公开可用于 GitHub。这种方法生成了16个同时的雷达尖叫,使得生成的数据可以用于雷达数据处理算法的进一步开发(过滤和归一化)。这可以增加数据增强的潜在性,例如通过生成不可重复或安全关键的场景中的数据。在这项工作中,GAN 被训练使用雷达测量数据,并用于生成雷达数据。输入到神经网络的距离和高斯噪声,以生成雷达尖叫。生成的雷达尖叫被评估使用Frechet Inception Distance(FID)。然后,使用这些RA(距离-方向)地图进行了两次计算:一次是基于实际数据,第二次是基于生成的数据。基于这些RA地图,使用适应阈值和边检测算法进行对象检测。结果表明,生成的数据具有准确的协调雷达反射和背景噪声基于尖叫、RA 地图和对象检测结果。因此,提出的方法在这项工作中被证明可以减少 simulation-to-reality gap。
results: 实验结果表明,我们的提议的攻击方法可以准确地预测 VFL 中的标签,并且在大多数情况下可以达到 nearly 100% 的准确率。此外,我们发现了一些常见的防御机制无法防止我们的攻击,而不会影响模型在主要分类任务上的性能。Abstract
Federated learning enables collaborative training of machine learning models by keeping the raw data of the involved workers private. One of its main objectives is to improve the models' privacy, security, and scalability. Vertical Federated Learning (VFL) offers an efficient cross-silo setting where a few parties collaboratively train a model without sharing the same features. In such a scenario, classification labels are commonly considered sensitive information held exclusively by one (active) party, while other (passive) parties use only their local information. Recent works have uncovered important flaws of VFL, leading to possible label inference attacks under the assumption that the attacker has some, even limited, background knowledge on the relation between labels and data. In this work, we are the first (to the best of our knowledge) to investigate label inference attacks on VFL using a zero-background knowledge strategy. To concretely formulate our proposal, we focus on Graph Neural Networks (GNNs) as a target model for the underlying VFL. In particular, we refer to node classification tasks, which are widely studied, and GNNs have shown promising results. Our proposed attack, BlindSage, provides impressive results in the experiments, achieving nearly 100% accuracy in most cases. Even when the attacker has no information about the used architecture or the number of classes, the accuracy remained above 85% in most instances. Finally, we observe that well-known defenses cannot mitigate our attack without affecting the model's performance on the main classification task.
摘要
联合学习可以实现机器学习模型的共同训练,同时保持参与者的原始数据Private。其主要目标是提高模型的隐私、安全性和扩展性。垂直联合学习(VFL)提供了一个高效的逻辑分布式设置,在这种情况下,一些党A共同训练模型,而不需要共享同一些特征。在这种情况下,分类标签通常被视为党A拥有的敏感信息,而党B仅使用本地信息。在最近的研究中,我们发现了VFL的重要漏洞,可能导致标签推断攻击,假设攻击者有一定的背景知识关于标签和数据之间的关系。在这种情况下,我们是第一个(到我们知道的)调查VFL中的标签推断攻击,使用零背景知识策略。为了具体实现我们的提议,我们将Graph Neural Networks(GNNs)作为VFL的目标模型。特别是在节点分类任务中,GNNs已经显示出了良好的效果。我们的提出的攻击方法,BlindSage,在实验中提供了惊人的结果,在大多数情况下达到了95%的准确率。甚至当攻击者没有任何关于使用的架构或类数的信息时,攻击的准确率仍然保持在85%以上的水平。最后,我们发现了常见的防御措施无法防止我们的攻击,而不会影响模型在主要分类任务上的表现。
Universal Approximation of Linear Time-Invariant (LTI) Systems through RNNs: Power of Randomness in Reservoir Computing
results: 研究人员发现,RC可以通过随机初始化循环权重来近似任意LTI系统,并且可以通过干扰分析来理解RC在LTI系统 simulate 问题中的表现。此外,研究人员还提供了一种可靠的分布函数来生成RC的循环权重,并通过广泛的数值计算来验证这种分布函数的优化性。Abstract
Recurrent neural networks (RNNs) are known to be universal approximators of dynamic systems under fairly mild and general assumptions, making them good tools to process temporal information. However, RNNs usually suffer from the issues of vanishing and exploding gradients in the standard RNN training. Reservoir computing (RC), a special RNN where the recurrent weights are randomized and left untrained, has been introduced to overcome these issues and has demonstrated superior empirical performance in fields as diverse as natural language processing and wireless communications especially in scenarios where training samples are extremely limited. On the contrary, the theoretical grounding to support this observed performance has not been fully developed at the same pace. In this work, we show that RNNs can provide universal approximation of linear time-invariant (LTI) systems. Specifically, we show that RC can universally approximate a general LTI system. We present a clear signal processing interpretation of RC and utilize this understanding in the problem of simulating a generic LTI system through RC. Under this setup, we analytically characterize the optimal probability distribution function for generating the recurrent weights of the underlying RNN of the RC. We provide extensive numerical evaluations to validate the optimality of the derived optimum distribution of the recurrent weights of the RC for the LTI system simulation problem. Our work results in clear signal processing-based model interpretability of RC and provides theoretical explanation for the power of randomness in setting instead of training RC's recurrent weights. It further provides a complete optimum analytical characterization for the untrained recurrent weights, marking an important step towards explainable machine learning (XML) which is extremely important for applications where training samples are limited.
摘要
循环神经网络(RNN)是知名的通用逼近器,可以在一定条件下逼近动态系统。然而,标准RNN训练中的衰减和扩散梯度问题经常会出现。循环计算(RC)是一种特殊的RNN,其循环权重随机并未经过训练,能够解决这些问题,并在自然语言处理和无线通信等领域达到了优秀的实验性表现。然而,对RC的理论基础的发展并没有和实验成果相同的步伐。在这种情况下,我们展示了RNN可以通用逼近线性时变系统(LTI)。具体来说,我们证明了RC可以通用任意一个通常LTI系统。我们提供了一个清晰的信号处理 интерпрета图,使用这种理解来解决通过RC来模拟一个通常LTI系统的问题。在这种设置下,我们分析性地 caracterize了RC的恒定梯度的优化分布函数。我们进行了广泛的数值评估,以验证RC的恒定梯度的优化分布函数是LTI系统模拟问题的优质解。我们的工作导致了RC的信号处理基础模型的可见性,并提供了对XML的理论解释,这对于具有有限样本的应用场景非常重要。
Fast and Accurate Reduced-Order Modeling of a MOOSE-based Additive Manufacturing Model with Operator Learning
methods: 本研究使用了运算学益(OL)方法,通过修改过程变量来学习一家偏微方程。specifically, the authors used Fourier neural operator (FNO) and deep operator network (DeepONet) to develop ROMs for time-dependent responses.
results: 对比深度神经网络(DNN)基于ROM,OL方法可以提供相似的性能,并且在准确性和泛化性方面甚至超越DNN。FNO和DeepONet都能够预测时间序列数据,而无需采用维度减少技术。Abstract
One predominant challenge in additive manufacturing (AM) is to achieve specific material properties by manipulating manufacturing process parameters during the runtime. Such manipulation tends to increase the computational load imposed on existing simulation tools employed in AM. The goal of the present work is to construct a fast and accurate reduced-order model (ROM) for an AM model developed within the Multiphysics Object-Oriented Simulation Environment (MOOSE) framework, ultimately reducing the time/cost of AM control and optimization processes. Our adoption of the operator learning (OL) approach enabled us to learn a family of differential equations produced by altering process variables in the laser's Gaussian point heat source. More specifically, we used the Fourier neural operator (FNO) and deep operator network (DeepONet) to develop ROMs for time-dependent responses. Furthermore, we benchmarked the performance of these OL methods against a conventional deep neural network (DNN)-based ROM. Ultimately, we found that OL methods offer comparable performance and, in terms of accuracy and generalizability, even outperform DNN at predicting scalar model responses. The DNN-based ROM afforded the fastest training time. Furthermore, all the ROMs were faster than the original MOOSE model yet still provided accurate predictions. FNO had a smaller mean prediction error than DeepONet, with a larger variance for time-dependent responses. Unlike DNN, both FNO and DeepONet were able to simulate time series data without the need for dimensionality reduction techniques. The present work can help facilitate the AM optimization process by enabling faster execution of simulation tools while still preserving evaluation accuracy.
摘要
一个主要挑战在添加制造(AM)是如何在运行时间中控制特定材料属性。这种控制通常会增加现有的 simulations 工具在 AM 中所受的计算负担。目标是在 MOOSE 框架中开发一个快速和准确的减少维度模型(ROM),以降低 AM 控制和优化过程中的时间/成本。我们采用了运算学(OL)方法,通过修改过程变量来学习激光的 Gaussian 点热源生成的 differential equation 家族。我们使用了 Fourier neural operator(FNO)和深度运算网络(DeepONet)来开发 ROM для时间相依的响应。此外,我们对这些 OL 方法与传统的深度神经网络(DNN)基于 ROM 进行比较。最终,我们发现 OL 方法可以与 DNN 方法相比,在预测Scalar 模型响应方面具有相同的性能和普遍性,并且在一些情况下甚至超越 DNN。DNN 基于 ROM 提供的培训时间最快。此外,所有的 ROM 快于原始 MOOSE 模型,但仍然提供了准确的预测。FNO 对时间相依的响应有较小的平均预测错误,而 DeepONet 具有较大的差异。不同于 DNN,FNO 和 DeepONet 都可以无需采用维度减少技术来 simulate 时间序列数据。 presente 的工作可以帮助加速 AM 优化过程中的 simulation 工具执行,同时仍保持评估准确性。
Nonprehensile Planar Manipulation through Reinforcement Learning with Multimodal Categorical Exploration
results: 研究人员通过实验和仿真实验 validate了这种方法,并证明了它可以在不同的初始和目标物体姿态下,以及在受到外部干扰和观测噪声的情况下实现高精度的操作。Abstract
Developing robot controllers capable of achieving dexterous nonprehensile manipulation, such as pushing an object on a table, is challenging. The underactuated and hybrid-dynamics nature of the problem, further complicated by the uncertainty resulting from the frictional interactions, requires sophisticated control behaviors. Reinforcement Learning (RL) is a powerful framework for developing such robot controllers. However, previous RL literature addressing the nonprehensile pushing task achieves low accuracy, non-smooth trajectories, and only simple motions, i.e. without rotation of the manipulated object. We conjecture that previously used unimodal exploration strategies fail to capture the inherent hybrid-dynamics of the task, arising from the different possible contact interaction modes between the robot and the object, such as sticking, sliding, and separation. In this work, we propose a multimodal exploration approach through categorical distributions, which enables us to train planar pushing RL policies for arbitrary starting and target object poses, i.e. positions and orientations, and with improved accuracy. We show that the learned policies are robust to external disturbances and observation noise, and scale to tasks with multiple pushers. Furthermore, we validate the transferability of the learned policies, trained entirely in simulation, to a physical robot hardware using the KUKA iiwa robot arm. See our supplemental video: https://youtu.be/vTdva1mgrk4.
摘要
发展能够实现dexterous nonprehensile manipulation的机器人控制器是一个挑战。这种下动和混合动力学性的问题,受到摩擦交互的不确定性的影响,需要复杂的控制行为。使用奖励学习(RL)是一种强大的框架,但前一些RL文献对非握持推动任务的精度、非精ooth的轨迹和简单的运动只能达到了低水平。我们 conjecture previous 使用单模态探索策略无法捕捉非握持任务中的自然混合动力学, arise from 机器人和物体之间的不同可能的接触模式,如粘性、滑动和分离。在这种工作中,我们提出了多模态探索方法,通过分类分布来实现。我们可以通过这种方法训练平面推动RL策略,可以处理任意开始和目标物体姿态,并且具有改善的精度。我们还证明了学习的策略是对外部干扰和观测噪声抗性的,并且可扩展到多个推动器。此外,我们验证了学习的策略,完全在模拟环境中训练的,可以在 физи机器人硬件上运行,使用KUKA iiwa机械臂。参考我们的补充视频:https://youtu.be/vTdva1mgrk4.
Uncertainty Estimation and Propagation in Accelerated MRI Reconstruction
results: 该论文表明,PHiRec可以生成高质量的重建结果,同时也可以准确地量化重建结果的不确定性。此外,该论文还证明了如何将MR重建过程中的不确定性传播到下游分类任务中,并表明PHiRec可以提供高度准确的分类不确定性估计。Abstract
MRI reconstruction techniques based on deep learning have led to unprecedented reconstruction quality especially in highly accelerated settings. However, deep learning techniques are also known to fail unexpectedly and hallucinate structures. This is particularly problematic if reconstructions are directly used for downstream tasks such as real-time treatment guidance or automated extraction of clinical paramters (e.g. via segmentation). Well-calibrated uncertainty quantification will be a key ingredient for safe use of this technology in clinical practice. In this paper we propose a novel probabilistic reconstruction technique (PHiRec) building on the idea of conditional hierarchical variational autoencoders. We demonstrate that our proposed method produces high-quality reconstructions as well as uncertainty quantification that is substantially better calibrated than several strong baselines. We furthermore demonstrate how uncertainties arising in the MR econstruction can be propagated to a downstream segmentation task, and show that PHiRec also allows well-calibrated estimation of segmentation uncertainties that originated in the MR reconstruction process.
摘要
Translation notes:* "MRI reconstruction" is translated as "MRI重建" (MRI reconstruction)* "deep learning" is translated as "深度学习" (deep learning)* "hallucinate" is translated as "假象" (hallucination)* "downstream tasks" is translated as "下游任务" (downstream tasks)* "uncertainty quantification" is translated as "不确定性评估" (uncertainty quantification)* "well-calibrated" is translated as "准确报告" (well-calibrated)* "segmentation" is translated as "分割" (segmentation)* "uncertainties arising in the MR econstruction" is translated as "MRI重建中出现的不确定性" (uncertainties arising in the MRI reconstruction)* "propagated" is translated as "传播" (propagated)* "well-calibrated estimation" is translated as "准确报告的估算" (well-calibrated estimation)
Generative Modelling of Lévy Area for High Order SDE Simulation
results: 该论文通过多种 metric 测试,证明了该模型在四维 Браунов运动中的性能非常出色,并且在数学金融中的 log-Heston 模型中进行了一个数值实验,证明了高质量的 synthetic Lévy 区域可以导致高顺序弱收敛和偏差减少。Abstract
It is well known that, when numerically simulating solutions to SDEs, achieving a strong convergence rate better than O(\sqrt{h}) (where h is the step size) requires the use of certain iterated integrals of Brownian motion, commonly referred to as its "L\'{e}vy areas". However, these stochastic integrals are difficult to simulate due to their non-Gaussian nature and for a d-dimensional Brownian motion with d > 2, no fast almost-exact sampling algorithm is known. In this paper, we propose L\'{e}vyGAN, a deep-learning-based model for generating approximate samples of L\'{e}vy area conditional on a Brownian increment. Due to our "Bridge-flipping" operation, the output samples match all joint and conditional odd moments exactly. Our generator employs a tailored GNN-inspired architecture, which enforces the correct dependency structure between the output distribution and the conditioning variable. Furthermore, we incorporate a mathematically principled characteristic-function based discriminator. Lastly, we introduce a novel training mechanism termed "Chen-training", which circumvents the need for expensive-to-generate training data-sets. This new training procedure is underpinned by our two main theoretical results. For 4-dimensional Brownian motion, we show that L\'{e}vyGAN exhibits state-of-the-art performance across several metrics which measure both the joint and marginal distributions. We conclude with a numerical experiment on the log-Heston model, a popular SDE in mathematical finance, demonstrating that high-quality synthetic L\'{e}vy area can lead to high order weak convergence and variance reduction when using multilevel Monte Carlo (MLMC).
摘要
通常情况下,当数值解决泊松方程时,要达到比O(\sqrt{h})更好的收敛率(其中h是步长)需要使用泊松动机的特定迭代积分,即其“Lévy区”。然而,这些随机积分的非泊松性和高维泊松动机(d > 2)的情况下,没有快速准确抽样算法。在这篇论文中,我们提出LévyGAN,一种基于深度学习的模型,用于生成泊松区 conditional on Brownian increment的 approximate samples。由于我们的“桥接”操作,输出样本匹配所有的共同和条件奇偶幂。我们的生成器使用特制的GNN-inspired架构,以保证输出分布和条件变量之间的正确依赖关系。此外,我们采用一种基于数学原理的特征函数基的批量分类器。最后,我们介绍一种新的训练机制,称为“Chen-training”,它利用我们的两个主要理论结论。对于4维泊松动机,我们显示LévyGAN在多个维度上达到了状态机器人的性能。我们结束于一个数学金融领域的log-Heston模型的numerical experiment,表明高质量的Synthetic Lévy area可以导致高顺序弱收敛和减少方差,当使用多层 Monte Carlo(MLMC)。
results: 实现感知推理的剔除方法可以保持竞争性的准确率Abstract
Neural network pruning is a highly effective technique aimed at reducing the computational and memory demands of large neural networks. In this research paper, we present a novel approach to pruning neural networks utilizing Bayesian inference, which can seamlessly integrate into the training procedure. Our proposed method leverages the posterior probabilities of the neural network prior to and following pruning, enabling the calculation of Bayes factors. The calculated Bayes factors guide the iterative pruning. Through comprehensive evaluations conducted on multiple benchmarks, we demonstrate that our method achieves desired levels of sparsity while maintaining competitive accuracy.
摘要
神经网络剔除是一种非常有效的技术,用于降低大神经网络的计算和存储占用量。在这项研究中,我们提出了一种使用泊尔投影来剔除神经网络的新方法。我们的提议利用神经网络剔除前和剔除后的 posterior 概率,以计算泊尔因子。这些泊尔因子引导了迭代剔除。我们在多个标准框架上进行了广泛的评估,并证明了我们的方法可以实现所需的稀疏性,同时保持竞争性的准确率。
From Military to Healthcare: Adopting and Expanding Ethical Principles for Generative Artificial Intelligence
paper_authors: David Oniani, Jordan Hilsman, Yifan Peng, COL, Ronald K. Poropatich, COL Jeremy C. Pamplin, LTC Gary L. Legault, Yanshan Wang
For: The paper is written to propose ethical principles for the use of generative AI in healthcare, addressing concerns about transparency, bias, and ethical dilemmas.* Methods: The paper uses a comprehensive review of existing literature on generative AI and healthcare to identify key ethical challenges and propose the GREAT PLEA ethical principles as a framework for addressing these challenges.* Results: The paper aims to provide a proactive approach to addressing the ethical dilemmas posed by the integration of generative AI in healthcare, with the goal of ensuring that the technology is used in a responsible and equitable manner.Abstract
In 2020, the U.S. Department of Defense officially disclosed a set of ethical principles to guide the use of Artificial Intelligence (AI) technologies on future battlefields. Despite stark differences, there are core similarities between the military and medical service. Warriors on battlefields often face life-altering circumstances that require quick decision-making. Medical providers experience similar challenges in a rapidly changing healthcare environment, such as in the emergency department or during surgery treating a life-threatening condition. Generative AI, an emerging technology designed to efficiently generate valuable information, holds great promise. As computing power becomes more accessible and the abundance of health data, such as electronic health records, electrocardiograms, and medical images, increases, it is inevitable that healthcare will be revolutionized by this technology. Recently, generative AI has captivated the research community, leading to debates about its application in healthcare, mainly due to concerns about transparency and related issues. Meanwhile, concerns about the potential exacerbation of health disparities due to modeling biases have raised notable ethical concerns regarding the use of this technology in healthcare. However, the ethical principles for generative AI in healthcare have been understudied, and decision-makers often fail to consider the significance of generative AI. In this paper, we propose GREAT PLEA ethical principles, encompassing governance, reliability, equity, accountability, traceability, privacy, lawfulness, empathy, and autonomy, for generative AI in healthcare. We aim to proactively address the ethical dilemmas and challenges posed by the integration of generative AI in healthcare.
摘要
在2020年,美国国防部 официаль地公布了一组伦理原则,用于导引未来战场上的人工智能技术的应用。尽管 militar y medical service 之间存在差异,但是在快速决策的 circumstance 中,战士和医疗Provider 面临的挑战类似。在战场上,战士可能面临生命变化的情况,需要快速决策。而医疗提供者在医疗环境中,如紧急部门或在手术中治疗生命危险情况,也面临类似的挑战。新兴的生成型人工智能技术,由于计算机力量的提高和健康数据的增加,将健康卫生领域 revolutionize。 recent years, generative AI has attracted extensive attention, leading to debates about its application in healthcare, primarily due to concerns about transparency and related issues. Meanwhile, concerns about the potential exacerbation of health disparities due to modeling biases have raised notable ethical concerns regarding the use of this technology in healthcare. However, the ethical principles for generative AI in healthcare have been understudied, and decision-makers often fail to consider the significance of generative AI. 为了积极应对生成型人工智能在医疗领域的伦理问题,我们提出了 GREAT PLEA 伦理原则,包括政府、可靠性、公平性、责任、可追溯性、隐私、法律性、Empathy 和自主权。我们的目标是为生成型人工智能在医疗领域的应用提出伦理原则,以便积极地解决这些技术的伦理问题。
Adaptive Preferential Attached kNN Graph with Distribution-Awareness
methods: 基于分布信息的适应kNN图 constructions,通过”pull” ambigious samples towards their original classes,提高总的泛化能力。
results: 在多种实验dataset上,paNNG比 estado-of-the-art算法表现出优异的适应性和效果,展示其在不同场景中的可靠性和有效性。Abstract
Graph-based kNN algorithms have garnered widespread popularity for machine learning tasks due to their simplicity and effectiveness. However, as factual data often inherit complex distributions, the conventional kNN graph's reliance on a unified k-value can hinder its performance. A crucial factor behind this challenge is the presence of ambiguous samples along decision boundaries that are inevitably more prone to incorrect classifications. To address the situation, we propose the Preferential Attached k-Nearest Neighbors Graph (paNNG), which adopts distribution-aware adaptive-k into graph construction. By incorporating distribution information as a cohesive entity, paNNG can significantly improve performance on ambiguous samples by "pulling" them towards their original classes and hence enhance overall generalization capability. Through rigorous evaluations on diverse datasets, paNNG outperforms state-of-the-art algorithms, showcasing its adaptability and efficacy across various real-world scenarios.
摘要
graph-based kNN算法在机器学习任务中广泛应用,因为它们简单易用而且有效。然而,因为常见的数据具有复杂的分布,传统的kNN图的固定k值可能会限制其性能。这种挑战的关键原因是具有含糊目标的样本,这些样本在决策边界上存在,容易被错误地分类。为解决这种问题,我们提出了Preferential Attached k-Nearest Neighbors Graph(paNNG),它在图构建中采用了分布情况感知的自适应k。通过将分布信息作为一个整体 интегрирова到图构建中,paNNG可以在含糊目标样本上进行显著改进,使其“吸引”这些样本返回到原来的类别,从而提高总的泛化能力。经过严格的评估,paNNG在多个 dataset 上表现出优于当前状态的算法, demonstarting its adaptability and efficacy across various real-world scenarios.
results: 根据两维和三维porous media的实验结果表明,FSMA算法在不同的 topological structure和孔和耳中心点的位置下都表现良好,并且可以处理闭合和开放边界的情况。此外,该算法还可以搜索死绕孔,这在多相流动在porous media中的研究中具有重要意义。Abstract
Pore-network models (PNMs) have become an important tool in the study of fluid flow in porous media over the last few decades, and the accuracy of their results highly depends on the extraction of pore networks. Traditional methods of pore-network extraction are based on pixels and require images with high quality. Here, a pixel-free method called the flashlight search medial axis (FSMA) algorithm is proposed for pore-network extraction in a continuous space. The search domain in a two-dimensional space is a line, whereas a surface domain is searched in a three-dimensional scenario. Thus, the FSMA algorithm follows the dimensionality reduction idea; the medial axis can be identified using only a few points instead of calculating every point in the void space. In this way, computational complexity of this method is greatly reduced compared to that of traditional pixel-based extraction methods, thus enabling large-scale pore-network extraction. Based on cases featuring two- and three-dimensional porous media, the FSMA algorithm performs well regardless of the topological structure of the pore network or the positions of the pore and throat centers. This algorithm can also be used to examine both closed- and open-boundary cases. Finally, the FSMA algorithm can search dead-end pores, which is of great significance in the study of multiphase flow in porous media.
摘要
PORE-NETWORK MODELS (PNMs) 在过去几十年中变得了流体流动在孔隙媒体的研究中非常重要的工具,并且其结果的准确性很大程度上取决于破孔网络的提取。传统的破孔网络提取方法基于像素,需要高质量的图像。在这里,一种没有像素的方法,即flashlight搜索中点(FSMA)算法,被提出用于破孔网络提取在连续空间中。搜索空间为二维或三维空间中的一条线,而搜索surface空间则是在三维场景中。因此,FSMA算法采用维度减少的想法,通过只需要在缺空空间中标定一些点来确定中点,而不需要计算所有点。这样,与传统基于像素的提取方法相比,FSMA算法的计算复杂度得到了很大的减少,可以实现大规模的破孔网络提取。无论破孔网络的topological结构是多样的,FSMA算法都能够perform well。此外,这种算法还可以用于研究开放和闭合边界的情况。最后,FSMA算法还可以搜索死绕的破孔,这对多相流动在孔隙媒体中的研究具有重要性。
Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation
results: 实验结果表明,该解决方案在气管镜插入应用程序中达到了竞争性的检测精度。Abstract
Robot-assisted airway intubation application needs high accuracy in locating targets and organs. Two vital landmarks, nostrils and glottis, can be detected during the intubation to accommodate the stages of nasal intubation. Automated landmark detection can provide accurate localization and quantitative evaluation. The Detection Transformer (DeTR) leads object detectors to a new paradigm with long-range dependence. However, current DeTR requires long iterations to converge, and does not perform well in detecting small objects. This paper proposes a transformer-based landmark detection solution with deformable DeTR and the semantic-aligned-matching module for detecting landmarks in robot-assisted intubation. The semantics aligner can effectively align the semantics of object queries and image features in the same embedding space using the most discriminative features. To evaluate the performance of our solution, we utilize a publicly accessible glottis dataset and automatically annotate a nostril detection dataset. The experimental results demonstrate our competitive performance in detection accuracy. Our code is publicly accessible.
摘要
robot-assisted 空气道插管应用需要高精度在目标和器官的位置检测。在插管过程中,两个重要的地标,鼻孔和肺部,可以被检测出来,以便适应插管的不同阶段。自动地标检测可以提供准确的地标位置和质量评价。但现有的DeTR需要长时间融合,并不能够准确地检测小 objet。本文提出了基于 transformer 的地标检测解决方案,包括可变 DeTR 和 semantics-aligned-matching 模块。semantics aligner 可以有效地将对象查询和图像特征在同一 embedding 空间中进行Semantic alignment,使用最 дискriminative 特征。为评估我们的解决方案的性能,我们利用公共 accessible glottis 数据集和自动生成 nostril 检测数据集。实验结果表明我们的竞争性表现在检测精度方面。我们的代码公开 accessible。
Non-line-of-sight reconstruction via structure sparsity regularization
paper_authors: Duolan Huang, Quan Chen, Zhun Wei, Rui Chen
For: The paper is written for the purpose of improving the quality of non-line-of-sight (NLOS) imaging, specifically in the context of denoising and reconstruction of occluded objects.* Methods: The paper proposes a regularization method called structure sparsity (SS) regularization, which incorporates nuclear norm penalization into the cost function of the directional light-cone transform (DLCT) model for NLOS imaging systems. The method utilizes prior knowledge of structure sparseness to facilitate denoising and improve reconstruction quality.* Results: The proposed approach is demonstrated to yield high-quality reconstructions, surpassing state-of-the-art reconstruction algorithms, especially in scenarios involving short exposure and low signal-noise-ratio (SNR) measurements. The approach is effective in denoising and reconstructing occluded objects, and is shown to be robust in various scenarios.Abstract
Non-line-of-sight (NLOS) imaging allows for the imaging of objects around a corner, which enables potential applications in various fields such as autonomous driving, robotic vision, medical imaging, security monitoring, etc. However, the quality of reconstruction is challenged by low signal-noise-ratio (SNR) measurements. In this study, we present a regularization method, referred to as structure sparsity (SS) regularization, for denoising in NLOS reconstruction. By exploiting the prior knowledge of structure sparseness, we incorporate nuclear norm penalization into the cost function of directional light-cone transform (DLCT) model for NLOS imaging system. This incorporation effectively integrates the neighborhood information associated with the directional albedo, thereby facilitating the denoising process. Subsequently, the reconstruction is achieved by optimizing a directional albedo model with SS regularization using fast iterative shrinkage-thresholding algorithm. Notably, the robust reconstruction of occluded objects is observed. Through comprehensive evaluations conducted on both synthetic and experimental datasets, we demonstrate that the proposed approach yields high-quality reconstructions, surpassing the state-of-the-art reconstruction algorithms, especially in scenarios involving short exposure and low SNR measurements.
摘要
Dual Degradation-Inspired Deep Unfolding Network for Low-Light Image Enhancement
results: 对多个流行的低光照图像数据集进行了广泛的实验,并证明了 DASUNet 比传统的低光照图像增强方法更有效。Abstract
Although low-light image enhancement has achieved great stride based on deep enhancement models, most of them mainly stress on enhancement performance via an elaborated black-box network and rarely explore the physical significance of enhancement models. Towards this issue, we propose a Dual degrAdation-inSpired deep Unfolding network, termed DASUNet, for low-light image enhancement. Specifically, we construct a dual degradation model (DDM) to explicitly simulate the deterioration mechanism of low-light images. It learns two distinct image priors via considering degradation specificity between luminance and chrominance spaces. To make the proposed scheme tractable, we design an alternating optimization solution to solve the proposed DDM. Further, the designed solution is unfolded into a specified deep network, imitating the iteration updating rules, to form DASUNet. Local and long-range information are obtained by prior modeling module (PMM), inheriting the advantages of convolution and Transformer, to enhance the representation capability of dual degradation priors. Additionally, a space aggregation module (SAM) is presented to boost the interaction of two degradation models. Extensive experiments on multiple popular low-light image datasets validate the effectiveness of DASUNet compared to canonical state-of-the-art low-light image enhancement methods. Our source code and pretrained model will be publicly available.
摘要
尽管深度改进模型在低光照图像增强方面已经取得了很大的进步,但大多数模型很少探究增强模型的物理意义。为解决这个问题,我们提出了一种基于双层退化的深度 unfolding 网络(DASUNet),用于低光照图像增强。具体来说,我们构建了双层退化模型(DDM),以特别模拟低光照图像的衰变机制。它学习了两个不同的图像假设,通过考虑颜色和灰度空间中的退化特点进行分别学习。为使我们的方案可行,我们设计了一种 alternate 优化解决方案,以解决我们的 DDM。此外,我们设计了一个特定的深度网络,以模仿我们的迭代更新规则,以形成 DASUNet。本网络具有优化了表征能力的双层退化假设模块(PMM)和空间聚合模块(SAM)。通过PMM,我们继承了 convolution 和 Transformer 的优点,以提高表征能力。SAM 则用于提高两个退化模型之间的交互。我们对多个流行的低光照图像 dataset 进行了广泛的实验,并证明了 DASUNet 比 canonical state-of-the-art 低光照图像增强方法更有效。我们将源代码和预训练模型公开发布。
Incorporation of Eye-Tracking and Gaze Feedback to Characterize and Improve Radiologist Search Patterns of Chest X-rays: A Randomized Controlled Clinical Trial
paper_authors: Carolina Ramirez-Tamayo, Syed Hasib Akhter Faruqui, Stanford Martinez, Angel Brisco, Nicholas Czarnek, Adel Alaeddini, Jeffrey R. Mock, Edward J. Golob, Kal L. Clark
results: 研究结果表明,在接受了自动反馈驱动的教育框架的情况下,验表诊断医生的悬虚癌检测精度得到了38.89%的绝对提高,比控制组的提高(5.56%,p值=0.006)要高得多。此外,在四次培训会议中,改进速度也得到了 statistically significant 的改进(p值=0.0001)。然而,其他指标,如速度、搜寻模式多样性、干扰和覆盖率,未显示出 statistically significant 的变化。Abstract
Diagnostic errors in radiology often occur due to incomplete visual assessments by radiologists, despite their knowledge of predicting disease classes. This insufficiency is possibly linked to the absence of required training in search patterns. Additionally, radiologists lack consistent feedback on their visual search patterns, relying on ad-hoc strategies and peer input to minimize errors and enhance efficiency, leading to suboptimal patterns and potential false negatives. This study aimed to use eye-tracking technology to analyze radiologist search patterns, quantify performance using established metrics, and assess the impact of an automated feedback-driven educational framework on detection accuracy. Ten residents participated in a controlled trial focused on detecting suspicious pulmonary nodules. They were divided into an intervention group (received automated feedback) and a control group. Results showed that the intervention group exhibited a 38.89% absolute improvement in detecting suspicious-for-cancer nodules, surpassing the control group's improvement (5.56%, p-value=0.006). Improvement was more rapid over the four training sessions (p-value=0.0001). However, other metrics such as speed, search pattern heterogeneity, distractions, and coverage did not show significant changes. In conclusion, implementing an automated feedback-driven educational framework improved radiologist accuracy in detecting suspicious nodules. The study underscores the potential of such systems in enhancing diagnostic performance and reducing errors. Further research and broader implementation are needed to consolidate these promising results and develop effective training strategies for radiologists, ultimately benefiting patient outcomes.
摘要
诊断错误在医学影像领域经常发生,主要是因为医师未能完整评估影像,即使他们具备疾病类型预测的知识。这可能是因为缺乏必要的训练,导致医师缺乏具体的搜寻模式训练。此外,医师也缺乏一致的反馈,他们通常靠协商策略和对手提供反馈,以减少错误和提高效率,从而导致可能的错误诊断。这项研究采用了眼动追踪技术来分析医师的搜寻模式,量化表现使用已知的 метрик,并评估自动化反馈驱动的教育框架对于癌症检测的影响。本研究中,有10名住院医生参与了一个控制性试验,检测肺部潜在癌瘤。他们被分为干预件组(接受自动化反馈)和控制件组。结果显示,干预件组与控制件组之间的差异为38.89%,高于控制件组的改善(5.56%,p值=0.006)。改善的速度在四次训练Session中明显(p值=0.0001)。但是,其他的 метри克,如速度、搜寻模式多样性、干扰和覆盖,未经过 significatif 的变化。总结来说,实施自动化反馈驱动的教育框架可以改善医师检测潜在癌瘤的精度。这项研究强调了这种系统对于诊断性能的提高和错误的减少的潜在。进一步的研究和更广泛的实施是需要实现这些有前途的结果,以 ultimately 改善病人结果。
results: 对于Voice Bank + DEMAND数据集,提出的模型可以与状态体现相当或更好的结果,但具有较小的参数量(0.58M)。Abstract
Speech enhancement is a demanding task in automated speech processing pipelines, focusing on separating clean speech from noisy channels. Transformer based models have recently bested RNN and CNN models in speech enhancement, however at the same time they are much more computationally expensive and require much more high quality training data, which is always hard to come by. In this paper, we present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity, which we have termed Spectrum Attention Fusion. We carefully construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features. Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
摘要
干扰除是自动语音处理管道中的一项具有挑战性的任务,旨在分离干扰 Channel 中的清晰语音。基于 Transformer 的模型在最近的一段时间内对语音干扰 Task 表现出色,但同时它们也比 RNN 和 CNN 模型更加 computationally expensive 并且需要更多的高质量训练数据,这些数据总是困难找到。在这篇论文中,我们提出一种改进语音干扰模型,保持自我注意力的表达力,同时减少模型的复杂度,我们称之为 Spectrum Attention Fusion。我们精心构建了一个 convolutional 模块,以取代一些 speech Transformer 中的自我注意层,使模型更好地融合 спектраль特征。我们提出的模型可以在 Voice Bank + DEMAND 数据集上达到与 SOTA 模型相同或更好的result,但具有远小于 SOTA 模型的参数(0.58M)。
Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition
for: 这个研究目的是为了实现跨 corps speech emotion识别(SER),即将一个已有 Label的 corpus 中的感情识别能力扩展到一个无 Label 的 corpus 中。
methods: 这个研究提出了一个名为 Emotion Decoupling aNd Alignment 的新框架(EMO-DNA),它是一种基于适应领域调整(UDA)的新方法,可以学习感情相关的 corpus-invariant 特征。EMO-DNA 的两大特点是:一是对于感情的分离学习,二是两个层次的感情Alignment。
results: 实验结果显示,EMO-DNA 比前一代方法在跨 corps 的数个情况下表现出色,具有更高的准确率和更好的一致性。Abstract
Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one, which is a rather challenging task due to the significant discrepancy between two corpora. Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. The novelties of EMO-DNA are two-fold: contrastive emotion decoupling and dual-level emotion alignment. On one hand, our contrastive emotion decoupling achieves decoupling learning via a contrastive decoupling loss to strengthen the separability of emotion-relevant features from corpus-specific ones. On the other hand, our dual-level emotion alignment introduces an adaptive threshold pseudo-labeling to select confident target samples for class-level alignment, and performs corpus-level alignment to jointly guide model for learning class-discriminative corpus-invariant features across corpora. Extensive experimental results demonstrate the superior performance of EMO-DNA over the state-of-the-art methods in several cross-corpus scenarios. Source code is available at https://github.com/Jiaxin-Ye/Emo-DNA.
摘要
cross-corpus speech emotion recognition (SER) 目标是将一个受过标注的语料库中的情绪推断到另一个没有标注的语料库中,这是一项非常具有挑战性的任务,因为两个语料库之间存在很大的差异。现有的方法通常基于不监督领域适应(UDA),尝试通过全局分布对齐来学习语料库共同的特征,但是却不幸的是,得到的特征往往与语料库特定的特征或不是类别特征杂合。为了解决这些挑战,我们提出了一个新的情绪解决和对齐学习框架(EMO-DNA) для cross-corpus SER,一种新的UDA方法来学习情绪相关的语料库共同特征。EMO-DNA的两个新特点是:对比性情绪解决和双级情绪对齐。一方面,我们的对比性情绪解决通过对比减除损失来强化情绪相关特征与语料库特定特征之间的分离性。另一方面,我们的双级情绪对齐引入了一个自适应阈值 pseudo-标签法,选择对模型有信任度的目标样本进行类别对齐,并在语料库级别对齐以同步导向模型学习类别特征杂合的语料库共同特征。我们的实验结果表明,EMO-DNA在多个 cross-corpus 场景中表现出了与当前状态方法相比的显著性能优势。源代码可以在 中下载。
Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques
results: 结果表明,GCNs可以很好地捕捉文本数据中的长期上下文关系和意义,而HuBERT可以很好地捕捉speech中的时间动态和细微变化,这两种方法的组合可以提高emotion recognition的准确率。Abstract
Traditional approaches in speech emotion recognition, such as LSTM, CNN, RNN, SVM, and MLP, have limitations such as difficulty capturing long-term dependencies in sequential data, capturing the temporal dynamics, and struggling to capture complex patterns and relationships in multimodal data. This research addresses these shortcomings by proposing an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals. We found that GCNs excel at capturing Long-term contextual dependencies and relationships within textual data by leveraging graph-based representations of text and thus detecting the contextual meaning and semantic relationships between words. On the other hand, HuBERT utilizes self-attention mechanisms to capture long-range dependencies, enabling the modeling of temporal dynamics present in speech and capturing subtle nuances and variations that contribute to emotion recognition. By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches. This allows for the simultaneous analysis of multimodal data, and the fusion of these modalities enables the extraction of complementary information, enhancing the discriminative power of the emotion recognition system. The results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.
摘要
传统方法在语音情绪识别中,如LSTM、CNN、RNN、SVM和MLP,有一些缺点,包括难以捕捉顺序数据中长期依赖关系、模糊时间动态和复杂的模式和关系。这种研究解决这些缺点,提出一个 ensemble 模型, combining 文本数据的 Graph Convolutional Networks (GCN) 和语音信号的 HuBERT 变换器。我们发现,GCN 可以很好地捕捉文本数据中长期上下文关系和意义,通过利用文本的图形表示来捕捉词语之间的Semantic 关系。而 HuBERT 利用自我注意机制,可以捕捉长距离依赖关系,使得模型可以识别语音中的时间动态,捕捉语音中细微的变化和细节,从而提高情绪识别的准确性。通过合并 GCN 和 HuBERT,我们的ensemble模型可以利用这两种方法的优点。这使得同时分析多Modal 数据,并将这些模式融合,以提高情绪识别系统的推理力。结果表明,我们的combined模型可以超越传统方法的局限性,导致识别语音中的情绪的准确性得到提高。
N-gram Boosting: Improving Contextual Biasing with Normalized N-gram Targets
results: 相比原始数据集,提高关键词识别率26%,相比LibriSpeech,提高2%Abstract
Accurate transcription of proper names and technical terms is particularly important in speech-to-text applications for business conversations. These words, which are essential to understanding the conversation, are often rare and therefore likely to be under-represented in text and audio training data, creating a significant challenge in this domain. We present a two-step keyword boosting mechanism that successfully works on normalized unigrams and n-grams rather than just single tokens, which eliminates missing hits issues with boosting raw targets. In addition, we show how adjusting the boosting weight logic avoids over-boosting multi-token keywords. This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech. This method is particularly useful on targets that involve non-alphabetic characters or have non-standard pronunciations.
摘要
精准转写特有名称和技术术语 particualrly important in speech-to-text应用程序 для商业对话。这些词语,which are essential to understanding the conversation, are often rare and therefore likely to be under-represented in text and audio training data, creating a significant challenge in this domain. We present a two-step keyword boosting mechanism that successfully works on normalized unigrams and n-grams rather than just single tokens, which eliminates missing hits issues with boosting raw targets. In addition, we show how adjusting the boosting weight logic avoids over-boosting multi-token keywords. This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech. This method is particularly useful on targets that involve non-alphabetic characters or have non-standard pronunciations.Here's the breakdown of the translation:* 精准转写 (jīngshù zhōngyì) - accurate transcription* 特有名称 (tèyǒu míngcè) - proper names* 技术术语 (jìshuō shūyǔ) - technical terms* 商业对话 (shāngyào duìhùa) - business conversations* 词语 (cíyǔ) - words* essential (zhòngyào) - essential* understanding (dànzhi) - understanding* domain (diàngròng) - domain* 预处理 (yùchū) - preprocessing* 单词 (danzi) - single token* 拼音 (pīnyīn) - pinyin* 非汉字 (fēihànzì) - non-hanzi characters* 非标准发音 (fēizhǔshuāng fāyīn) - non-standard pronunciationsNote that the translation is written in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
results: 数值实验表明,提出的算法是可靠的,并且可以控制相对体积的范围,不会产生匹配错误。此外,本文的模型还可以生成 diffeomorphic 图像匹配,并比较多个现有的匹配模型表现更好。Abstract
Diffeomorphic registration has become a powerful approach for seeking a smooth and invertible spatial transformation between two coordinate systems which have been measured via the template and reference images. While the pointwise volume-preserving constraint is effective for some problems, it is too stringent for many other problems especially when the local deformations are relatively large, because it may lead to a poor large-deformation for enforcing local matching.In this paper, we propose a novel bi-variant diffeomorphic image registration model with the soft constraint of Jacobian equation, which allows local deformations to shrink and grow in a flexible range.The Jacobian determinant of the transformation is explicitly controlled by optimizing the relaxation function. To prevent deformation folding and enhance the smoothness of deformation, we not only impose a positivity constraint in optimizing the relaxation function, but also employ a regularizer to ensure the smoothness of the relaxation function.Furthermore, the positivity constraint ensures that is as close to one as possible, which helps to obtain a volume-preserving transformation on average.We further analyze the existence of the minimizer for the variational model and propose a penalty splitting method with a multilevel strategy to solve this model. Numerical experiments show that the proposed algorithm is convergent, and the positivity constraint can control the range of relative volume and not compromise registration accuracy. Moreover, the proposed model produces diffeomorphic maps for large deformation, and achieves better performance compared to the several existing registration models.
摘要
Diffusion registration已经成为一种有力的方法,用于在两个坐标系统之间找到一个平滑和反函数的空间变换。虽然点位量保持约束有效于一些问题,但是在许多问题上,特别是当地方减少很大时,这种约束可能会导致差异匮乏。在这篇论文中,我们提出了一种新的二元diffusion图像 регистраción模型,使得地方减少和增加在一个flexible范围内进行。Jacobian方程的determinant是通过优化放松函数来控制的。为了避免减少和提高折叠的smoothness,我们不仅在优化放松函数时强制实施一个正负性约束,还使用一个正则化项来保证放松函数的smoothness。此外,正负性约束使得变换趋近于一个,帮助获得一个保持体积的变换。我们进一步分析了变量模型的存在性和提出了一种罚分法,用于解决这个模型。 numerically experiments show that the proposed algorithm is convergent, and the positivity constraint can control the range of relative volume without compromising registration accuracy. Moreover, the proposed model produces diffeomorphic maps for large deformation, and achieves better performance compared to several existing registration models.
Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra
paper_authors: Liang Liao, Zhuang Guo, Qi Gao, Yan Wang, Fajun Yu, Qifeng Zhao, Stephen Johh Maybank
for: 填充彩色图像中缺失部分的高精度图像复制
methods: 基于扩展的高阶矩阵模型,包括像素邻域扩展策略来描述本地像素约束
results: 对各种算法进行了广泛的实验,并对公共可用的图像进行了比较,结果表明我们的扩展矩阵完成模型和相应的算法与低阶矩阵和传统矩阵完成器相比,有较好的性能。Abstract
To improve the accuracy of color image completion with missing entries, we present a recovery method based on generalized higher-order scalars. We extend the traditional second-order matrix model to a more comprehensive higher-order matrix equivalent, called the "t-matrix" model, which incorporates a pixel neighborhood expansion strategy to characterize the local pixel constraints. This "t-matrix" model is then used to extend some commonly used matrix and tensor completion algorithms to their higher-order versions. We perform extensive experiments on various algorithms using simulated data and algorithms on simulated data and publicly available images and compare their performance. The results show that our generalized matrix completion model and the corresponding algorithm compare favorably with their lower-order tensor and conventional matrix counterparts.
摘要
要提高颜色图像完成缺失项的准确率,我们提出一种基于总是更高级别的scalar的恢复方法。我们将传统的第二阶matrix模型扩展到更加全面的高阶矩阵相等模型,称之为"t-矩阵"模型,该模型利用像素邻域扩展策略来描述当地像素约束。这个"t-矩阵"模型然后用来扩展一些常用的矩阵和tensor completion算法到其高阶版本。我们在各种算法上进行了广泛的实验,使用模拟数据和公共available的图像,并比较了其性能。结果显示,我们的总是更高级别的矩阵 completion模型和相应的算法与其低阶tensor和传统矩阵counterparts相比较avorably。
Frequency Disentangled Features in Neural Image Compression
paper_authors: Ali Zafari, Atefeh Khoshkhahtinat, Piyush Mehta, Mohammad Saeed Ebrahimi Saadabadi, Mohammad Akyash, Nasser M. Nasrabadi
for: 提高图像压缩网络的设计,使其更好地匹配真实分布。
methods: 使用自适应量化、增强的自我注意力计算和通道级感知Entropy模型。
results: 提出了一种基于频率分离的特征级别频率分解方法,使得压缩率下降,同时也超过了手工编码和基于计算昂贵的空间自适应Entropy模型的 neural网络编码器。Abstract
The design of a neural image compression network is governed by how well the entropy model matches the true distribution of the latent code. Apart from the model capacity, this ability is indirectly under the effect of how close the relaxed quantization is to the actual hard quantization. Optimizing the parameters of a rate-distortion variational autoencoder (R-D VAE) is ruled by this approximated quantization scheme. In this paper, we propose a feature-level frequency disentanglement to help the relaxed scalar quantization achieve lower bit rates by guiding the high entropy latent features to include most of the low-frequency texture of the image. In addition, to strengthen the de-correlating power of the transformer-based analysis/synthesis transform, an augmented self-attention score calculation based on the Hadamard product is utilized during both encoding and decoding. Channel-wise autoregressive entropy modeling takes advantage of the proposed frequency separation as it inherently directs high-informational low-frequency channels to the first chunks and conditions the future chunks on it. The proposed network not only outperforms hand-engineered codecs, but also neural network-based codecs built on computation-heavy spatially autoregressive entropy models.
摘要
neural image compression network 的设计受 latent code 的真实分布决定。除了模型容量之外,这种能力受到较量量化的距离影响。 optimize R-D VAE 的参数被这种粗略量化方案控制。在这篇论文中,我们提出了一种基于频谱分解的特征级频率分离,以帮助 relaxed scalar quantization 实现更低的比特率。此外,通过在编码和解码过程中使用增强的自注意力分数计算,以及通过渠道级自动 entropy 模型来利用提出的频谱分解,我们提高了 transformer 基于分析/synthesis 变换的杜尔拜议性。这种网络不仅超越了手工编码器,还超越了基于 computation-heavy 空间自适应 entropy 模型的神经网络编码器。
Brain MRI Segmentation using Template-Based Training and Visual Perception Augmentation
results: 透过这种方法,训练了多种动物和人脑MRI的3D U-Net模型,并且在分类和推导任务中获得了高准确率。这个工具有效地解决了深度学习应用中的训练数据短缺问题,并且具有扩展深度学习应用的潜力。Abstract
Deep learning models usually require sufficient training data to achieve high accuracy, but obtaining labeled data can be time-consuming and labor-intensive. Here we introduce a template-based training method to train a 3D U-Net model from scratch using only one population-averaged brain MRI template and its associated segmentation label. The process incorporated visual perception augmentation to enhance the model's robustness in handling diverse image inputs and mitigating overfitting. Leveraging this approach, we trained 3D U-Net models for mouse, rat, marmoset, rhesus, and human brain MRI to achieve segmentation tasks such as skull-stripping, brain segmentation, and tissue probability mapping. This tool effectively addresses the limited availability of training data and holds significant potential for expanding deep learning applications in image analysis, providing researchers with a unified solution to train deep neural networks with only one image sample.
摘要
通常,深度学习模型需要充足的训练数据来实现高精度,但获取标注数据可以是时间consuming和劳动 INTENSIVE。在这里,我们介绍了一种模板基于的训练方法,可以用一个人类大脑MRI模板和其关联的标注来训练一个3D U-Net模型从零开始。该过程包括视觉感知加强来提高模型对不同输入图像的Robustness和避免过拟合。通过这种方法,我们训练了鼠、老鼠、玛瑙鼠、人类大脑MRI等3D U-Net模型来完成 segmentation 任务,如颅腔抽取、脑 segmentation 和组织概率地图。这种工具有效地解决了训练数据的有限性问题,并且具有扩展深度学习应用于图像分析的潜在力量,为研究人员提供了一个统一的解决方案,只需一个图像样本就可以训练深度神经网络。
T-UNet: Triplet UNet for Change Detection in High-Resolution Remote Sensing Images
paper_authors: Huan Zhong, Chen Wu for:This paper aims to improve remote sensing image change detection by proposing a novel network called Triplet UNet (T-UNet), which can simultaneously extract object features and change features between pre- and post-time-phase images.methods:The proposed T-UNet network uses a three-branch encoder and a multi-branch spatial-spectral cross-attention module (MBSSCA) to interact and fuse the features extracted from the three branches. The network also uses channel attention mechanism (CAM) and spatial attention mechanism (SAM) in the decoder stage to fully mine and integrate detailed textures information and semantic localization information.results:The proposed T-UNet network can accurately detect changes between remote sensing images acquired at different times, and can effectively discern the edges of changed objects. The network’s use of triplet encoder and multi-branch spatial-spectral cross-attention module (MBSSCA) allows it to simultaneously extract object features and change features, leading to more accurate results.Abstract
Remote sensing image change detection aims to identify the differences between images acquired at different times in the same area. It is widely used in land management, environmental monitoring, disaster assessment and other fields. Currently, most change detection methods are based on Siamese network structure or early fusion structure. Siamese structure focuses on extracting object features at different times but lacks attention to change information, which leads to false alarms and missed detections. Early fusion (EF) structure focuses on extracting features after the fusion of images of different phases but ignores the significance of object features at different times for detecting change details, making it difficult to accurately discern the edges of changed objects. To address these issues and obtain more accurate results, we propose a novel network, Triplet UNet(T-UNet), based on a three-branch encoder, which is capable to simultaneously extract the object features and the change features between the pre- and post-time-phase images through triplet encoder. To effectively interact and fuse the features extracted from the three branches of triplet encoder, we propose a multi-branch spatial-spectral cross-attention module (MBSSCA). In the decoder stage, we introduce the channel attention mechanism (CAM) and spatial attention mechanism (SAM) to fully mine and integrate detailed textures information at the shallow layer and semantic localization information at the deep layer.
摘要
<>传感图像变化检测目标是在不同时间在同一地区内检测图像之间的差异。它广泛应用于地区管理、环境监测、灾害评估等领域。目前大多数变化检测方法基于Siamesenet结构或早期融合结构。Siamesenet结构专注于不同时间扫描图像中的对象特征,但缺乏关注变化信息,导致假报警和错过检测。早期融合(EF)结构专注于将不同阶段图像融合后的特征进行检测,但忽视了不同时间对变化细节的重要性,从而难以准确识别变化的边界。为了解决这些问题并获得更高精度结果,我们提议一种新的网络模型,Triplet UNet(T-UNet),基于三个分支编码器。T-UNet模型可同时检测不同时间图像中对象特征和变化特征,并通过 triplet encoder 来实现。为了有效地交互和融合三个分支编码器中的特征,我们提议一种多分支空间 спектral cross-attention模块(MBSSCA)。在解码阶段,我们引入通道注意机制(CAM)和空间注意机制(SAM),以全面挖掘和融合图像的细节信息,并充分利用深层层次结构的semantic localization信息。
Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels
results: 论文发现了一些 immediat 的后果,包括基于强度的攻击和缩放不依赖的攻击例子。它们还发现了分类决策边界在像素空间中的重复和丰富性,并证明了将多个分类器 ensemble 可以减少攻击的可能性。Abstract
We show that we can easily design a single adversarial perturbation $P$ that changes the class of $n$ images $X_1,X_2,\dots,X_n$ from their original, unperturbed classes $c_1, c_2,\dots,c_n$ to desired (not necessarily all the same) classes $c^*_1,c^*_2,\dots,c^*_n$ for up to hundreds of images and target classes at once. We call these \textit{multi-attacks}. Characterizing the maximum $n$ we can achieve under different conditions such as image resolution, we estimate the number of regions of high class confidence around a particular image in the space of pixels to be around $10^{\mathcal{O}(100)}$, posing a significant problem for exhaustive defense strategies. We show several immediate consequences of this: adversarial attacks that change the resulting class based on their intensity, and scale-independent adversarial examples. To demonstrate the redundancy and richness of class decision boundaries in the pixel space, we look for its two-dimensional sections that trace images and spell words using particular classes. We also show that ensembling reduces susceptibility to multi-attacks, and that classifiers trained on random labels are more susceptible. Our code is available on GitHub.
摘要
我们显示了我们可以轻松地设计一个单一的对抗 perturbation $P$,使得 $n$ 张图像 $X_1,X_2,\dots,X_n$ 的原始、未受 perturbation 的类 $c_1, c_2,\dots,c_n$ 变为欲要的类 $c^*_1,c^*_2,\dots,c^*_n$。我们称这为 “多个攻击”。在不同的条件下,如图像分辨率,我们估算最大 achievable $n$ 的值在 $10^{\mathcal{O}(100)}$ 之间,这意味着在像素空间中有大约 $10^{\mathcal{O}(100)}$ 个高度相关的类 confidence 区域,对抗措施是一个重大问题。我们显示了一些 immediate consequence:对于图像的 Intensity 的变化,以及缩放不依赖的对抗例子。此外,我们还证明了像素空间中精度的分布是丰富的,可以用两dimensional 的section找到图像和描述字符串的轨迹。最后,我们发现 ensemble 可以减少多个攻击的影响,而类ifiers 训练于随机标签的情况下更容易受到攻击。我们的代码可以在 GitHub 上找到。
A Parameter-efficient Multi-subject Model for Predicting fMRI Activity
for: This paper is written for the Algonauts 2023 submission, and it describes the team “BlobGPT”‘s model and its components.
methods: The paper uses a multi-subject linear encoding head attached to a pretrained trunk model, which consists of three components: a shared multi-layer feature projection, shared plus subject-specific low-dimension linear transformations, and a shared PCA fMRI embedding.
results: The paper presents experimental results for the team’s model, which demonstrate its effectiveness in certain tasks.Here’s the simplified Chinese text for the three key information points:
methods: 这篇论文使用了一个多主题线性编码头,附加到预训练的树模型上,该头包括三个组成部分:共享多层特征投影、共享 plus 特定主题低维度线性变换、和共享 PCA fMRI 嵌入。
results: 这篇论文发表了团队的模型实验结果,以证明其在某些任务中的效果。Abstract
This is the Algonauts 2023 submission report for team "BlobGPT". Our model consists of a multi-subject linear encoding head attached to a pretrained trunk model. The multi-subject head consists of three components: (1) a shared multi-layer feature projection, (2) shared plus subject-specific low-dimension linear transformations, and (3) a shared PCA fMRI embedding. In this report, we explain these components in more detail and present some experimental results. Our code is available at https://github.com/cmi-dair/algonauts23.
摘要
这是 Algonauts 2023 提交报告,我们的团队名称是 "BlobGPT"。我们的模型包括一个多主题线性编码头,附加到预训练的核心模型上。多主题头包括三个组成部分:(1)共享多层特征投影,(2)共享 plus 特定主题低维度线性变换,(3)共享 PCA fMRI 嵌入。在这份报告中,我们将这些组成部分进行详细介绍,并展示一些实验结果。我们的代码可以在 GitHub 上找到:https://github.com/cmi-dair/algonauts23。
RobustMQ: Benchmarking Robustness of Quantized Models
results: 研究结果表明,量化模型对抗攻击 exhibits higher robustness than its floating-point counterpart, but is more vulnerable to natural corruptions and systematic noises。增加量化比特宽度会导致对抗攻击的Robustness下降,自然 robustness 增加,系统 robustness 增加。 among corruption methods, impulse noise and glass blur are the most harmful to quantized models, while brightness has the least impact. among systematic noises, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful.Abstract
Quantization has emerged as an essential technique for deploying deep neural networks (DNNs) on devices with limited resources. However, quantized models exhibit vulnerabilities when exposed to various noises in real-world applications. Despite the importance of evaluating the impact of quantization on robustness, existing research on this topic is limited and often disregards established principles of robustness evaluation, resulting in incomplete and inconclusive findings. To address this gap, we thoroughly evaluated the robustness of quantized models against various noises (adversarial attacks, natural corruptions, and systematic noises) on ImageNet. The comprehensive evaluation results empirically provide valuable insights into the robustness of quantized models in various scenarios, for example: (1) quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruptions and systematic noises; (2) in general, increasing the quantization bit-width results in a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness; (3) among corruption methods, \textit{impulse noise} and \textit{glass blur} are the most harmful to quantized models, while \textit{brightness} has the least impact; (4) among systematic noises, the \textit{nearest neighbor interpolation} has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.
摘要
“量化技术已经成为深度神经网络(DNNs)部署在有限资源设备时的关键手段。然而,量化模型在实际应用中容易受到各种噪声的影响。尽管评估量化对模型的Robustness的影响非常重要,但现有的研究往往忽视了已有的Robustness评估原则,导致结果不完整、不一致。为了解决这个空白,我们对ImageNet上的量化模型进行了全面的评估,包括了骚扰攻击、自然损害和系统性噪声。我们的实验结果表明:(1)量化模型对骚扰攻击更为抵抗,但对自然损害和系统性噪声更为敏感;(2)随着量化比特宽度的增加,对骚扰攻击的抵抗力下降,对自然损害的抵抗力增加,对系统性噪声的抵抗力增加;(3)在损害方法中,雷达噪声和玻璃抹涂是对量化模型最有害的,而亮度损害最小。在系统性噪声中,最危险的是 nearest neighbor interpolation,而bilinear interpolation、cubic interpolation和area interpolation是最安全的。”
Class Incremental Learning with Self-Supervised Pre-Training and Prototype Learning
results: 实验结果显示, compared to state-of-the-art exemplar-based methods when they reserved 5 examplers per class, our method can significantly outperform them under the incremental setting of 10 phases, by 18.24% on CIFAR-100 and 9.37% on ImageNet100.Abstract
Deep Neural Network (DNN) has achieved great success on datasets of closed class set. However, new classes, like new categories of social media topics, are continuously added to the real world, making it necessary to incrementally learn. This is hard for DNN because it tends to focus on fitting to new classes while ignoring old classes, a phenomenon known as catastrophic forgetting. State-of-the-art methods rely on knowledge distillation and data replay techniques but still have limitations. In this work, we analyze the causes of catastrophic forgetting in class incremental learning, which owes to three factors: representation drift, representation confusion, and classifier distortion. Based on this view, we propose a two-stage learning framework with a fixed encoder and an incrementally updated prototype classifier. The encoder is trained with self-supervised learning to generate a feature space with high intrinsic dimensionality, thus improving its transferability and generality. The classifier incrementally learns new prototypes while retaining the prototypes of previously learned data, which is crucial in preserving the decision boundary.Our method does not rely on preserved samples of old classes, is thus a non-exemplar based CIL method. Experiments on public datasets show that our method can significantly outperform state-of-the-art exemplar-based methods when they reserved 5 examplers per class, under the incremental setting of 10 phases, by 18.24% on CIFAR-100 and 9.37% on ImageNet100.
摘要
深度神经网络(DNN)在封闭类集的数据上达到了很大的成功。然而,实际世界中新的类是不断添加的,使得需要逐步学习。这是DNN很难受的,因为它往往会专注于新类的适应而忽略旧类,这被称为慢性忘却。现状的方法包括知识传递和数据重播技术,但还有局限性。在这种情况下,我们分析了逐步学习中的慢性忘却的原因,它是由于表示变化、表示混乱和分类器扭曲三个因素引起的。基于这种视角,我们提出了一种两阶段学习框架,其中有一个固定的编码器和一个逐步更新的原型分类器。编码器通过自我超vised学习来生成一个具有高内在维度的特征空间,从而提高其传输性和通用性。原型分类器逐步学习新的原型,同时保留之前学习的数据的原型,这是保持决策边界的关键。我们的方法不需要保留旧类的示例,因此不是基于示例的CIL方法。在公共数据集上进行了实验,我们的方法可以在10个阶段的逐步学习Setting下,在与存储5个示例的旧类的情况下,与现状的 exemplar-based 方法相比,提高了18.24%的CIFAR-100和9.37%的ImageNet100。
Generative Image Priors for MRI Reconstruction Trained from Magnitude-Only Images
paper_authors: Guanxiong Luo, Xiaoqing Wang, Mortiz Blumenthal, Martin Schilling, Erik Hans Ulrich Rauf, Raviteja Kotikalapudi, Niels Focke, Martin Uecker
results: 实验结果表明,基于复杂图像的生成假设比只基于幅度图像的生成假设更高效。此外,使用更大的数据集可以提高生成假设的可靠性。最后,我们发现使用生成假设比L1-抽象冲激regularization更有利于高Undersampling下的分辨率增强。Abstract
Purpose: In this work, we present a workflow to construct generic and robust generative image priors from magnitude-only images. The priors can then be used for regularization in reconstruction to improve image quality. Methods: The workflow begins with the preparation of training datasets from magnitude-only MR images. This dataset is then augmented with phase information and used to train generative priors of complex images. Finally, trained priors are evaluated using both linear and nonlinear reconstruction for compressed sensing parallel imaging with various undersampling schemes. Results: The results of our experiments demonstrate that priors trained on complex images outperform priors trained only on magnitude images. Additionally, a prior trained on a larger dataset exhibits higher robustness. Finally, we show that the generative priors are superior to L1 -wavelet regularization for compressed sensing parallel imaging with high undersampling. Conclusion: These findings stress the importance of incorporating phase information and leveraging large datasets to raise the performance and reliability of the generative priors for MRI reconstruction. Phase augmentation makes it possible to use existing image databases for training.
摘要
目的:在这项工作中,我们提出了一个工作流程,用于从偏度只图像中生成一般化和稳定的生成图像假设。这些假设可以用于图像重建中的规则化,以提高图像质量。方法:我们的工作流程开始于准备待训练的训练集,其中包含了偏度只的MR图像。这个集合然后被补充了相位信息,并用于训练复杂图像的生成假设。最后,我们训练了这些假设,并对其进行了线性和非线性重建的评估。结果:我们的实验结果表明,使用复杂图像进行训练的假设,可以超过只使用偏度图像进行训练的假设。此外,我们发现,使用更大的数据集来训练假设,可以提高假设的稳定性。最后,我们表明,生成假设比L1-wavelet正则化更有效地进行压缩感知并行图像重建。结论:这些发现表明,在MRI重建中,包含相位信息和利用大数据集来训练生成假设,是提高性和可靠性的关键。相位补充使得可以使用现有的图像库进行训练。
Improving Scene Graph Generation with Superpixel-Based Interaction Learning
results: 经过extensive的实验 validate,SIL方法能够在Visual Genome和Open Image V6两个Benchmark上达到state-of-the-art的性能,并且可以与现有的Box-level方法相结合,以提高其性能。Abstract
Recent advances in Scene Graph Generation (SGG) typically model the relationships among entities utilizing box-level features from pre-defined detectors. We argue that an overlooked problem in SGG is the coarse-grained interactions between boxes, which inadequately capture contextual semantics for relationship modeling, practically limiting the development of the field. In this paper, we take the initiative to explore and propose a generic paradigm termed Superpixel-based Interaction Learning (SIL) to remedy coarse-grained interactions at the box level. It allows us to model fine-grained interactions at the superpixel level in SGG. Specifically, (i) we treat a scene as a set of points and cluster them into superpixels representing sub-regions of the scene. (ii) We explore intra-entity and cross-entity interactions among the superpixels to enrich fine-grained interactions between entities at an earlier stage. Extensive experiments on two challenging benchmarks (Visual Genome and Open Image V6) prove that our SIL enables fine-grained interaction at the superpixel level above previous box-level methods, and significantly outperforms previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing box-level approaches in a plug-and-play fashion. In particular, SIL brings an average improvement of 2.0% mR (even up to 3.4%) of baselines for the PredCls task on Visual Genome, which facilitates its integration into any existing box-level method.
摘要
Recent advances in Scene Graph Generation (SGG) 通常使用预定的检测器提供的箱级别特征来模型实体之间的关系。我们认为SGG中的粗略交互不足以捕捉场景中的语义上下文,实际上限制了该领域的发展。在这篇论文中,我们决定探索和提议一种通用的思想,称之为Superpixel-based Interaction Learning (SIL),以解决粗略交互的问题。它允许我们在SGG中模型细化的交互。具体来说,我们将场景视为一组点,并将它们分组成相应的superpixel,表示场景中的子区域。然后,我们会探索这些superpixel之间的内部和跨实体交互,以增强交互的细化。广泛的实验表明,我们的SIL可以在Visual Genome和Open Image V6两个挑战性评价指标上提高细化交互的性能,并在所有指标上超过之前的箱级方法。此外,我们的方法可以与现有的箱级方法相结合,以提高其性能。例如,在Visual Genome上,我们的SIL可以与基eline方法相结合,提高PredCls任务的平均改进率2.0%(最高达3.4%)。
Diffusion-Augmented Depth Prediction with Sparse Annotations
results: 在三个驾驶 benchmark上测试了DADP框架,实现了显著提高的深度结构和Robustness。这种方法为自主驾驶场景中的depth estimation带来了新的视角和解决方案。Abstract
Depth estimation aims to predict dense depth maps. In autonomous driving scenes, sparsity of annotations makes the task challenging. Supervised models produce concave objects due to insufficient structural information. They overfit to valid pixels and fail to restore spatial structures. Self-supervised methods are proposed for the problem. Their robustness is limited by pose estimation, leading to erroneous results in natural scenes. In this paper, we propose a supervised framework termed Diffusion-Augmented Depth Prediction (DADP). We leverage the structural characteristics of diffusion model to enforce depth structures of depth models in a plug-and-play manner. An object-guided integrality loss is also proposed to further enhance regional structure integrality by fetching objective information. We evaluate DADP on three driving benchmarks and achieve significant improvements in depth structures and robustness. Our work provides a new perspective on depth estimation with sparse annotations in autonomous driving scenes.
摘要
depth estimation 目标是预测粗粒度地图。在自动驾驶场景中,笔记率的假设使得任务变得困难。经过监督的模型会生成凹形物体,因为缺乏结构信息而导致过拟合有效像素,并且失去空间结构。为了解决这问题,自动驾驶方法被提出。它们的可靠性受到定位估计的限制,导致在自然场景中产生错误结果。在这篇论文中,我们提出了一种名为傅立叶-扩展 depth prediction(DADP)的监督框架。我们利用傅立叶模型的结构特征来强制深度模型中的深度结构,并且在插件和撤销的方式下进行替换。我们还提出了一种对象指导的完整性损失函数,以进一步增强地域结构完整性。我们对 DADP 进行了三个驾驶benchmark测试,并实现了深度结构和可靠性方面的显著提高。我们的工作为自动驾驶场景中笔记率稀缺的深度估计带来了新的视角。
SURE-Val: Safe Urban Relevance Extension and Validation
results: 本研究通过对 relevance 验证方法进行 verify ,并证明了 relevance 验证方法的有效性。此外,本研究还发现了一些 relevance 验证方法的缺陷,并提出了一些改进建议。Abstract
To evaluate perception components of an automated driving system, it is necessary to define the relevant objects. While the urban domain is popular among perception datasets, relevance is insufficiently specified for this domain. Therefore, this work adopts an existing method to define relevance in the highway domain and expands it to the urban domain. While different conceptualizations and definitions of relevance are present in literature, there is a lack of methods to validate these definitions. Therefore, this work presents a novel relevance validation method leveraging a motion prediction component. The validation leverages the idea that removing irrelevant objects should not influence a prediction component which reflects human driving behavior. The influence on the prediction is quantified by considering the statistical distribution of prediction performance across a large-scale dataset. The validation procedure is verified using criteria specifically designed to exclude relevant objects. The validation method is successfully applied to the relevance criteria from this work, thus supporting their validity.
摘要
There are various conceptualizations and definitions of relevance in literature, but there is a lack of methods to validate these definitions. Therefore, this study presents a novel relevance validation method that leverages a motion prediction component. The validation method is based on the idea that removing irrelevant objects should not influence a prediction component that reflects human driving behavior. The influence on the prediction is quantified by considering the statistical distribution of prediction performance across a large-scale dataset.The validation procedure is verified using criteria specifically designed to exclude relevant objects. The validation method is successfully applied to the relevance criteria from this study, thus supporting their validity.
On the Calibration of Uncertainty Estimation in LiDAR-based Semantic Segmentation
results: 该方法可以帮助找到 Label 问题,以提高手动或自动标注数据集的质量。Abstract
The confidence calibration of deep learning-based perception models plays a crucial role in their reliability. Especially in the context of autonomous driving, downstream tasks like prediction and planning depend on accurate confidence estimates. In point-wise multiclass classification tasks like sematic segmentation the model has to deal with heavy class imbalances. Due to their underrepresentation, the confidence calibration of classes with smaller instances is challenging but essential, not only for safety reasons. We propose a metric to measure the confidence calibration quality of a semantic segmentation model with respect to individual classes. It is calculated by computing sparsification curves for each class based on the uncertainty estimates. We use the classification calibration metric to evaluate uncertainty estimation methods with respect to their confidence calibration of underrepresented classes. We furthermore suggest a double use for the method to automatically find label problems to improve the quality of hand- or auto-annotated datasets.
摘要
“深度学习基于观察模型的自信核算对其可靠性起着关键作用。尤其在自动驾驶上,下游任务如预测和规划都需要准确的自信估算。在点对多类分类任务如semantic segmentation中,模型需要面临重重分类异常现象。由于这些类型的实例较少,对这些类型的自信核算是挑战性的,但也是安全性上的必需。我们提出了一个度量用于评估semantic segmentation模型中每个类型的自信核算质量的指标。它是通过计算每个类型的缩减曲线来计算的,该曲线基于模型的不确定性估算。我们使用这个分类calibration指标来评估不确定性估算方法的自信核算质量,特别是对于下游类型。此外,我们还建议使用这个方法自动找到手动或自动标注数据集中的标签问题,以提高数据集的质量。”
Improving Human-Object Interaction Detection via Virtual Image Learning
results: 本文在两个benchmark上实现了显著提高,并取得了新的状态码记录。Abstract
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects, which plays a curtail role in high-level semantic understanding tasks. However, most works pursue designing better architectures to learn overall features more efficiently, while ignoring the long-tail nature of interaction-object pair categories. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL). Firstly, a novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images. In this stage, virtual images are generated based on prompts with specific characterizations and selected by multi-filtering processes. Secondly, we use both virtual and real images to train the model with the teacher-student framework. Considering the initial labels of some virtual images are inaccurate and inadequate, we devise an Adaptive Matching-and-Filtering (AMF) module to construct pseudo-labels. Our method is independent of the internal structure of HOI detectors, so it can be combined with off-the-shelf methods by training merely 10 additional epochs. With the assistance of our method, multiple methods obtain significant improvements, and new state-of-the-art results are achieved on two benchmarks.
摘要
人机物交互(HOI)检测目标是理解人与物之间的交互,这对高级 semantic 理解任务起到关键作用。然而,大多数工作强调设计更好的建筑来更有效地学习总体特征。在这篇文章中,我们提议通过虚拟图像学习(VIL)来减轻交互对象分类的长尾分布的影响。首先,我们提出了一种新的标签到图像的方法,称为多步图像创建(MUSIC),以创建一个具有均衡分布的高质量数据集。在这个阶段,虚拟图像通过特定特征的描述和多步过滤进程选择。其次,我们使用虚拟和真实图像来帮助模型在教师-学生框架下学习。由于一些虚拟图像的初始标签不准确和不充分,我们提出了一种适应性匹配和筛选(AMF)模块来构建pseudo标签。我们的方法不依赖HOI检测器的内部结构,因此可以与市面上的方法结合使用,只需要训练10个额外的熬杯。通过我们的方法,多种方法均 obtianed significan improvements,并在两个标准准点上获得了新的州arameter Record。
MSECNet: Accurate and Robust Normal Estimation for 3D Point Clouds by Multi-Scale Edge Conditioning
for: surface normals estimation from 3D point clouds, especially in regions with rapidly changing normals
methods: MSECNet, a novel approach that treats normal variation modeling as an edge detection problem, consisting of a backbone network and a multi-scale edge conditioning (MSEC) stream
results: outperforms existing methods on both synthetic and real-world datasets while running significantly faster, and demonstrates effectiveness in surface reconstructionAbstract
Estimating surface normals from 3D point clouds is critical for various applications, including surface reconstruction and rendering. While existing methods for normal estimation perform well in regions where normals change slowly, they tend to fail where normals vary rapidly. To address this issue, we propose a novel approach called MSECNet, which improves estimation in normal varying regions by treating normal variation modeling as an edge detection problem. MSECNet consists of a backbone network and a multi-scale edge conditioning (MSEC) stream. The MSEC stream achieves robust edge detection through multi-scale feature fusion and adaptive edge detection. The detected edges are then combined with the output of the backbone network using the edge conditioning module to produce edge-aware representations. Extensive experiments show that MSECNet outperforms existing methods on both synthetic (PCPNet) and real-world (SceneNN) datasets while running significantly faster. We also conduct various analyses to investigate the contribution of each component in the MSEC stream. Finally, we demonstrate the effectiveness of our approach in surface reconstruction.
摘要
<>将三维点云中的表面法向估计作为应用的核心问题,包括表面重建和渲染。现有的normal估计方法在normal变化缓慢地区表现良好,但在normal变化迅速地区时,这些方法往往失败。为解决这问题,我们提出了一种新的方法 called MSECNet,它在normal变化地区进行了改进,并将normal变化模型化为边检测问题。MSECNet包括一个后处网络和一个多尺度边条件(MSEC)流。MSEC流通过多尺度特征融合和自适应边检测来实现Robust的边检测。检测到的边然后与后处网络输出的edge conditioning模块结合,以生成edge-aware表示。广泛的实验显示,MSECNet在PCPNet和SceneNN两个synthetic和实际数据集上具有更高的性能,而且运行速度更快。我们还进行了各种分析,以Investigate MSEC流中每个组件的贡献。最后,我们示出了我们的方法在表面重建中的效果。<>
FB-BEV: BEV Representation from Forward-Backward View Transformations
results: 本文实现了一种新的 FB-BEV 模型,通过将 forward-backward 视图转换模块与 BEVFormer 结合使用,实现了 nuScenes 测试集上的新state-of-the-art 结果,达到 62.4% NDS。代码和模型可以在 https://github.com/NVlabs/FB-BEV 上下载。Abstract
View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-processing. Backward projection, with BEVFormer being an example, tends to generate false-positive BEV features from incorrect projections due to the lack of utilization on depth. To address the above limitations, we propose a novel forward-backward view transformation module. Our approach compensates for the deficiencies in both existing methods, allowing them to enhance each other to obtain higher quality BEV representations mutually. We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set. Code and models are available at https://github.com/NVlabs/FB-BEV.
摘要
视图变换模块(VTM),负责将多视图图像特征转换为飞行视图(BEV)表示,是Camera-based BEV感知系统中关键的步骤。目前,最为流行的VTM方法有两种:前 projection和后 projection。前 projection,表示Lift-Splat-Shoot,会导致稀疏地 proyect BEV特征无需后处理。后 projection,如BEVFormer,可能会从错误的投影中生成错误的 BEV特征,因为不利用深度。为了解决以上限制,我们提议一种新的前后视图变换模块。我们的方法可以补做两种现有方法的缺陷,使其互相增强,从而获得更高质量的 BEV表示。我们实现了提议的模块,并实现了FB-BEV,在nuScenes测试集上达到了62.4% NDS的新状态态。代码和模型可以在https://github.com/NVlabs/FB-BEV中下载。
Painterly Image Harmonization using Diffusion Model
results: 与相关领域的状态机型相比,我们的PHDiffusion可以更好地塑造前景并同时保留更细的内容。Abstract
Painterly image harmonization aims to insert photographic objects into paintings and obtain artistically coherent composite images. Previous methods for this task mainly rely on inference optimization or generative adversarial network, but they are either very time-consuming or struggling at fine control of the foreground objects (e.g., texture and content details). To address these issues, we propose a novel Painterly Harmonization stable Diffusion model (PHDiffusion), which includes a lightweight adaptive encoder and a Dual Encoder Fusion (DEF) module. Specifically, the adaptive encoder and the DEF module first stylize foreground features within each encoder. Then, the stylized foreground features from both encoders are combined to guide the harmonization process. During training, besides the noise loss in diffusion model, we additionally employ content loss and two style losses, i.e., AdaIN style loss and contrastive style loss, aiming to balance the trade-off between style migration and content preservation. Compared with the state-of-the-art models from related fields, our PHDiffusion can stylize the foreground more sufficiently and simultaneously retain finer content. Our code and model are available at https://github.com/bcmi/PHDiffusion-Painterly-Image-Harmonization.
摘要
painterly 图像协调的目标是插入 fotografic 对象到画作中并获得艺术一致的复合图像。现有的方法主要基于推理优化或生成敌对网络,但他们是非常占用时间或缺乏细节控制(例如,Texture和内容细节)。为解决这些问题,我们提出了一种新的笔画协调稳定扩散模型(PHDiffusion),它包括一个轻量级适应编码器和双编码器融合(DEF)模块。具体来说,适应编码器和 DEF 模块首先在每个编码器中进行了笔画化前景特征。然后,这些笔画化前景特征从两个编码器中被合并,以指导协调过程。在训练时,我们还采用了内容损失和两种风格损失,即 AdaIN 风格损失和对比风格损失,以努力均衡风格迁移和内容保留。相比之前的相关领域模型,我们的 PHDiffusion 可以更好地笔画前景并同时保留更细的内容。我们的代码和模型可以在 GitHub 上找到:https://github.com/bcmi/PHDiffusion-Painterly-Image-Harmonization。
Deep Semantic Model Fusion for Ancient Agricultural Terrace Detection
results: 提出的方法在国际 AI 考古挑战中获得了第一名,代表了该方法的高效性和可靠性。Abstract
Discovering ancient agricultural terraces in desert regions is important for the monitoring of long-term climate changes on the Earth's surface. However, traditional ground surveys are both costly and limited in scale. With the increasing accessibility of aerial and satellite data, machine learning techniques bear large potential for the automatic detection and recognition of archaeological landscapes. In this paper, we propose a deep semantic model fusion method for ancient agricultural terrace detection. The input data includes aerial images and LiDAR generated terrain features in the Negev desert. Two deep semantic segmentation models, namely DeepLabv3+ and UNet, with EfficientNet backbone, are trained and fused to provide segmentation maps of ancient terraces and walls. The proposed method won the first prize in the International AI Archaeology Challenge. Codes are available at https://github.com/wangyi111/international-archaeology-ai-challenge.
摘要
发现古代垦殖 terrace 在 desert 地区对地球表面长期气候变化的监测非常重要。然而,传统的地面调查非常昂贵,而且规模有限。随着飞行和卫星数据的更加可 accessible,机器学习技术在自动检测和识别考古遗产中具有大量潜力。在这篇论文中,我们提议一种深度 semantic model fusion 方法,用于古代垦殖 terrace 的检测。输入数据包括飞行图像和 LiDAR 生成的地形特征,位于约旦河谷。我们在 DeepLabv3+ 和 UNet 两种深度 semantic segmentation 模型中使用 EfficientNet 背景,并将其们训练和融合,以生成古代 terrace 和墙壁的分 segmentation 图。该方法在国际 AI 考古挑战中获得了首奖。代码可以在 上获取。
Balanced Classification: A Unified Framework for Long-Tailed Object Detection
results: 在CVPR2020 LVIS数据集上,BACL与基于ResNet-50-FPN的标准Faster R-CNN相比,提高了5.8% AP和16.1% AP的性能,并在不同的数据集和结构上进行了广泛的实验,都表现出了性能改进。Abstract
Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories. In this paper, we contend that the learning bias originates from two factors: 1) the unequal competition arising from the imbalanced distribution of foreground categories, and 2) the lack of sample diversity in tail categories. To tackle these issues, we introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution and dynamic intensification of sample diversities in a synchronized manner. Specifically, a novel foreground classification balance loss (FCBL) is developed to ameliorate the domination of head categories and shift attention to difficult-to-differentiate categories by introducing pairwise class-aware margins and auto-adjusted weight terms, respectively. This loss prevents the over-suppression of tail categories in the context of unequal competition. Moreover, we propose a dynamic feature hallucination module (FHM), which enhances the representation of tail categories in the feature space by synthesizing hallucinated samples to introduce additional data variances. In this divide-and-conquer approach, BACL sets a new state-of-the-art on the challenging LVIS benchmark with a decoupled training pipeline, surpassing vanilla Faster R-CNN with ResNet-50-FPN by 5.8% AP and 16.1% AP for overall and tail categories. Extensive experiments demonstrate that BACL consistently achieves performance improvements across various datasets with different backbones and architectures. Code and models are available at https://github.com/Tianhao-Qi/BACL.
摘要
传统探测器在处理长尾数据时会出现性能下降,这是因为探测器受到多类别头部的分类偏好。在这篇论文中,我们认为这种学习偏好来自两个因素:1)分类 distribu-tion 不均衡,2)尾类别样本缺乏多样性。为解决这些问题,我们提出了一个统一框架,即 Balanced Classification(BACL),它可以同步地修正由分类 distribu-tion 不均衡所引起的不平等,并强制tail类别样本的多样性。在BACL框架中,我们提出了一种新的前景分类均衡损失函数(FCBL),用于改善头部类别的Domination,并增强难以区分的类别的 Representation。FCBL使用对类别之间的对比margin和自适应权重项,从而避免尾类别在不平等环境中的过度压抑。此外,我们还提出了一种动态特征幻化模块(FHM),用于增强尾类别在特征空间中的表示。在这种分治策略中,BACL在LCVIS数据集上实现了新的状态态-of-the-art,超过了基于ResNet-50-FPN的标准Faster R-CNN的5.8% AP和16.1% AP。我们在多个数据集和不同的后端和架构上进行了广泛的实验,并证明了BACL在不同的环境下都能够实现性能提高。代码和模型可以在https://github.com/Tianhao-Qi/BACL上下载。
Paired Competing Neurons Improving STDP Supervised Local Learning In Spiking Neural Networks
results: results 表明,该 paper 的方法可以与现有的监督 STDP 基本相同的架构和 neuron 数量进行比较,并且在 MNIST、Fashion-MNIST 和 CIFAR-10 等图像识别 datasets 上达到了更高的性能。同时,PCN 对 S2-STDP 的使用也有进一步提高图像识别层的性能的效果,无需调整任何超参数。进一步分析还表明,该方法具有更好的超参数Robustness,减少了训练过程中的调整次数。Abstract
Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware has the potential to significantly reduce the high energy consumption of Artificial Neural Networks (ANNs) training on modern computers. The biological plausibility of SNNs allows them to benefit from bio-inspired plasticity rules, such as Spike Timing-Dependent Plasticity (STDP). STDP offers gradient-free and unsupervised local learning, which can be easily implemented on neuromorphic hardware. However, relying solely on unsupervised STDP to perform classification tasks is not enough. In this paper, we propose Stabilized Supervised STDP (S2-STDP), a supervised STDP learning rule to train the classification layer of an SNN equipped with unsupervised STDP. S2-STDP integrates error-modulated weight updates that align neuron spikes with desired timestamps derived from the average firing time within the layer. Then, we introduce a training architecture called Paired Competing Neurons (PCN) to further enhance the learning capabilities of our classification layer trained with S2-STDP. PCN associates each class with paired neurons and encourages neuron specialization through intra-class competition. We evaluated our proposed methods on image recognition datasets, including MNIST, Fashion-MNIST, and CIFAR-10. Results showed that our methods outperform current supervised STDP-based state of the art, for comparable architectures and numbers of neurons. Also, the use of PCN enhances the performance of S2-STDP, regardless of the configuration, and without introducing any hyperparameters.Further analysis demonstrated that our methods exhibited improved hyperparameter robustness, which reduces the need for tuning.
摘要
直接训练神经网络(SNN)在神经模仿硬件上有可能大幅降低人工神经网络(ANNs)训练在现代计算机上的高能耗。生物可能性让SNN受益于生物灵感的пластичность规则,如电声相互作用(STDP)。STDP提供了无梯度和无监督的本地学习,可以轻松实现在神经模仿硬件上。然而,仅仅通过不监督的STDP来完成分类任务并不够。在这篇论文中,我们提出了稳定化supervised STDP(S2-STDP),一种监督STDP学习规则,用于训练SNN的分类层。S2-STDP将电声射频更新策略与 neuron 射频相互作用,以实现更稳定和有效的学习。然后,我们引入了一种叫做Paired Competing Neurons(PCN)的训练架构,以进一步增强我们的分类层的学习能力。PCN将每个类别与对应的两个神经元相关联,并且鼓励神经元特化通过类内竞争。我们在MNIST、Fashion-MNIST和CIFAR-10等图像识别数据集上评估了我们的方法。结果显示,我们的方法在相同的架构和神经元数量下,超过了当前supervised STDP基eline的性能。此外,PCN的使用可以在不同的配置下,提高S2-STDP的性能,而无需调整任何超参数。进一步分析表明,我们的方法具有改善的超参数鲁棒性,减少了调整的需求。
ES-MVSNet: Efficient Framework for End-to-end Self-supervised Multi-View Stereo
results: 在DTU和Tanks&Temples benchmark上进行了广泛的实验,并证明了提出的ES-MVSNet方法可以在终端自动多视角摄影领域实现最佳性能,并与许多超级vised和多Stage自动多视角摄影方法竞争。Abstract
Compared to the multi-stage self-supervised multi-view stereo (MVS) method, the end-to-end (E2E) approach has received more attention due to its concise and efficient training pipeline. Recent E2E self-supervised MVS approaches have integrated third-party models (such as optical flow models, semantic segmentation models, NeRF models, etc.) to provide additional consistency constraints, which grows GPU memory consumption and complicates the model's structure and training pipeline. In this work, we propose an efficient framework for end-to-end self-supervised MVS, dubbed ES-MVSNet. To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance. Furthermore, with the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that the proposed ES-MVSNet approach achieves state-of-the-art performance among E2E self-supervised MVS methods and competitive performance to many supervised and multi-stage self-supervised methods.
摘要
Translated into Simplified Chinese:与多stage自动规范多视图零抽象(MVS)方法相比,端到端(E2E)方法得到了更多的关注,因为它的训练管道更简洁和高效。现在的E2E自我监督MVS方法通常将第三方模型(如光流模型、semantic排序模型、NeRF模型等)integrated into the model to provide additional consistency constraints, which leads to increased GPU memory consumption and a more complex model structure and training pipeline. In this work, we propose an efficient framework for end-to-end self-supervised MVS, called ES-MVSNet. To alleviate the high memory consumption of current E2E self-supervised MVS frameworks, we present a memory-efficient architecture that reduces memory usage by 43% without compromising model performance. Furthermore, with the novel design of asymmetric view selection policy and region-aware depth consistency, we achieve state-of-the-art performance among E2E self-supervised MVS methods, without relying on third-party models for additional consistency signals. 经验表明,提出的ES-MVSNet方法在DTU和Tanks&Temples标准benchmark上实现了E2E自我监督MVS方法的state-of-the-art性和多supervised和多stage自动规范方法的竞争性。
Synthetic outlier generation for anomaly detection in autonomous driving
paper_authors: Martin Bikandi, Gorka Velez, Naiara Aginako, Itziar Irigoien
for: 这个研究的目的是提高自驾车中的异常检测性能,以避免安全重要的意外事件。
methods: 这个研究使用了 modifying the training stage of the state-of-the-art DenseHybrid model, 以及 proposing a simplified detector.
results: 研究获得了 significant performance improvements in anomaly detection, 并且 proposed detector 可以与 modified DenseHybrid approach 相比,而且也超过了原始 DenseHybrid 模型的性能。Abstract
Anomaly detection, or outlier detection, is a crucial task in various domains to identify instances that significantly deviate from established patterns or the majority of data. In the context of autonomous driving, the identification of anomalies is particularly important to prevent safety-critical incidents, as deep learning models often exhibit overconfidence in anomalous or outlier samples. In this study, we explore different strategies for training an image semantic segmentation model with an anomaly detection module. By introducing modifications to the training stage of the state-of-the-art DenseHybrid model, we achieve significant performance improvements in anomaly detection. Moreover, we propose a simplified detector that achieves comparable results to our modified DenseHybrid approach, while also surpassing the performance of the original DenseHybrid model. These findings demonstrate the efficacy of our proposed strategies for enhancing anomaly detection in the context of autonomous driving.
摘要
anomaly detection,或者异常检测,是在不同领域中找到与Established patterns或主要数据分布不匹配的情况的关键任务。在自动驾驶领域中,异常检测的identification是非常重要,因为深度学习模型经常对异常或异常样本表现出过自信。在这种研究中,我们探索了不同的方法来训练一个图像semantic segmentation模型,并在训练阶段引入了一些修改以提高异常检测性能。此外,我们提出了一种简化的检测器,可以与我们修改的DenseHybrid模型相比,并且在性能上超过了原始DenseHybrid模型。这些发现表明了我们提出的策略对自动驾驶中异常检测具有效果。
Scene-aware Human Pose Generation using Transformer
paper_authors: Jieteng Yao, Junjie Chen, Li Niu, Bin Sheng
for: scene understanding and intelligent robotics
methods: template-based human pose generation, interaction between query embeddings and scene feature map, knowledge distillation
results: effective prediction of scale and offsets for each pose template, demonstrated effectiveness on Sitcom datasetAbstract
Affordance learning considers the interaction opportunities for an actor in the scene and thus has wide application in scene understanding and intelligent robotics. In this paper, we focus on contextual affordance learning, i.e., using affordance as context to generate a reasonable human pose in a scene. Existing scene-aware human pose generation methods could be divided into two categories depending on whether using pose templates. Our proposed method belongs to the template-based category, which benefits from the representative pose templates. Moreover, inspired by recent transformer-based methods, we associate each query embedding with a pose template, and use the interaction between query embeddings and scene feature map to effectively predict the scale and offsets for each pose template. In addition, we employ knowledge distillation to facilitate the offset learning given the predicted scale. Comprehensive experiments on Sitcom dataset demonstrate the effectiveness of our method.
摘要
<>scene理解和智能机器人中的人 pose生成方法的学习,我们将关注场景上的可行性学习,即通过场景来生成合理的人姿。现有的场景意识人姿生成方法可以分为两类:不使用pose模板和使用pose模板两类。我们的提议方法属于使用pose模板的类别,受益于代表性的pose模板。此外,受到最近的transformer基本方法的启发,我们将每个查询嵌入与场景特征图的交互用来有效地预测每个pose模板的尺度和偏移。此外,我们还使用知识传播来促进偏移学习。对于Sitcom数据集的全面实验,我们证明了我们的方法的有效性。<>
Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive Learning
results: 实验结果显示,MT-CLAR可以在AFEW-VA数据集上达到与状态之巅相当的情绪预测性能,而且与支持集的大小相比,MT-CLAR可以大幅提高情绪预测的精度。Abstract
Whilst deep learning techniques have achieved excellent emotion prediction, they still require large amounts of labelled training data, which are (a) onerous and tedious to compile, and (b) prone to errors and biases. We propose Multi-Task Contrastive Learning for Affect Representation (\textbf{MT-CLAR}) for few-shot affect inference. MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images (a) the (dis)similarity between the facial expressions, and (b) the difference in valence and arousal levels of the two faces. We further extend the image-based MT-CLAR framework for automated video labelling where, given one or a few labelled video frames (termed \textit{support-set}), MT-CLAR labels the remainder of the video for valence and arousal. Experiments are performed on the AFEW-VA dataset with multiple support-set configurations; moreover, supervised learning on representations learnt via MT-CLAR are used for valence, arousal and categorical emotion prediction on the AffectNet and AFEW-VA datasets. The results show that valence and arousal predictions via MT-CLAR are very comparable to the state-of-the-art (SOTA), and we significantly outperform SOTA with a support-set $\approx$6\% the size of the video dataset.
摘要
而深度学习技术已经实现了出色的情感预测,但它们仍需要大量标注训练数据,这些数据是(a)困难和繁琐准备,以及(b)容易出现错误和偏见。我们提出了多任务对照学习 для情感表示(\textbf{MT-CLAR}),用于几个shot情感预测。MT-CLAR将多任务学习与对照学习相结合,通过对两个表达性脸部图像进行比较,推断两个脸部图像之间的(不)相似性,以及两个脸部图像的情感强度水平之间的差异。我们还扩展了基于图像的MT-CLAR框架,用于自动化视频标注,给定一个或几个标注过的视频帧(称为\textit{支持集}),MT-CLAR将视频中剩下的所有帧标注为情感强度和激动度水平。我们在AFEW-VA数据集上进行了多种支持集配置的实验,并使用MT-CLAR学习的表示进行情感预测,包括情感强度、激动度和 categorical emotion预测。结果显示,MT-CLAR在情感预测中的值和激动度预测与状态之前(SOTA)几乎相同,而且我们在支持集大小为6%的视频数据集上显著超越SOTA。
Learning Referring Video Object Segmentation from Weak Annotation
results: 我们的方法可以在不需要密集标注的情况下实现竞争力强的 segmentation 性能。经过广泛的实验和ablative分析,我们的方法在不同的 dataset 上都表现出了优秀的result。Abstract
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object. Previous RVOS methods have achieved significant performance with densely-annotated datasets, whose construction is expensive and time-consuming. To relieve the burden of data annotation while maintaining sufficient supervision for segmentation, we propose a new annotation scheme, in which we label the frame where the object first appears with a mask and use bounding boxes for the subsequent frames. Based on this scheme, we propose a method to learn from this weak annotation. Specifically, we design a cross frame segmentation method, which uses the language-guided dynamic filters to thoroughly leverage the valuable mask annotation and bounding boxes. We further develop a bi-level contrastive learning method to encourage the model to learn discriminative representation at the pixel level. Extensive experiments and ablative analyses show that our method is able to achieve competitive performance without the demand of dense mask annotation. The code will be available at https://github.com/wangbo-zhao/WRVOS/.
摘要
<>使用 Referring video object segmentation (RVOS) 任务,目标是在所有视频帧中基于一句描述对象进行对象分割。先前的 RVOS 方法已经在 densely-annotated 数据集上达到了显著性能,但是构建这些数据集是贵重的和时间consuming。为了减轻数据标注的负担而不失去分割的足够supervision,我们提议一种新的标注方案,在该方案中,我们将第一帧中的对象批注为Mask,并使用 bounding boxes 来标注后续帧。基于这种方案,我们提议一种从weak annotation学习的方法。我们设计了一种 across frame segmentation 方法,使用语言导向的动态滤波器,以完全利用valuable的批注和 bounding boxes。我们还开发了一种 bi-level 对比学习方法,以促进模型学习精细的像素表示。广泛的实验和ablative 分析表明,我们的方法可以在无需厚度的批注下实现竞争性的性能。代码将在 https://github.com/wangbo-zhao/WRVOS/ 上发布。
M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition
methods: 利用 multi-scale patch selection (MSPS) 和 class token transfer (CTT) 以及 multi-scale cross-attention (MSCA) 来提高模型的多尺度表示能力和权重学习能力。
results: 比前一代单尺度 patch selection (SSPS) 提高表达能力和性能,在多种广泛使用的 FGVR benchmark 上达到了新的高点。Abstract
Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.
摘要
最近,视觉变换器(ViT)已经活跃地应用于细腻视识别(FGVR)。ViT可以有效地模型分割后的对象区域之间的相互依赖关系,通过自然的自注意机制。此外,patch选择被用于ViT,以除去重复的patch信息并强调最重要的对象patch。然而,现有的ViT基于的FGVR模型受到一个固定的见识场所限制,这限制了表达的丰富性和对比例变化的抗性。因此,我们提出了多比例 patch选择(MSPS),以改进现有的ViT基于模型的多比例能力。具体来说,MSPS在不同的级别上选择不同的比例的patch,并在不同的阶段使用多比例视Transformer(MS-ViT)来模型交叉的多比例关系。此外,我们引入了类token传递(CTT)和多比例交叉注意(MSCA),以便充分反映交叉的多比例关系,并将其纳入模型决策中。相比之下,单比例patch选择(SSPS)只能在固定的比例下进行选择,从而封锁了表达的层次结构和对比例变化的能力。因此,我们提出了M2Former,它在多个普遍使用的FGVR标准benchmark上表现出色,并且超越了基于CNN和ViT的模型。
CTP-Net: Character Texture Perception Network for Document Image Forgery Localization
results: 实验结果表明,CPT-Net可以准确地探测文档图像中的多尺度伪造区域,并且在不同的数据集上表现出excel。此外,为了解决因缺少伪造文档图像而导致的挑战,该研究还提出了一种数据生成策略,用于构建一个Fake Chinese Trademark数据集(FCTM)。Abstract
Due to the progression of information technology in recent years, document images have been widely disseminated on social networks. With the help of powerful image editing tools, document images are easily forged without leaving visible manipulation traces, which leads to severe issues if significant information is falsified for malicious use. Therefore, the research of document image forensics is worth further exploring. In this paper, we propose a Character Texture Perception Network (CTP-Net) to localize the forged regions in document images. Specifically, considering the characters with semantics in a document image are highly vulnerable, capturing the forgery traces is the key to localize the forged regions. We design a Character Texture Stream (CTS) based on optical character recognition to capture features of text areas that are essential components of a document image. Meanwhile, texture features of the whole document image are exploited by an Image Texture Stream (ITS). Combining the features extracted from the CTS and the ITS, the CTP-Net can reveal more subtle forgery traces from document images. Moreover, to overcome the challenge caused by the lack of fake document images, we design a data generation strategy that is utilized to construct a Fake Chinese Trademark dataset (FCTM). Experimental results on different datasets demonstrate that the proposed CTP-Net is able to localize multi-scale forged areas in document images, and outperform the state-of-the-art forgery localization methods, even though post-processing operations are applied.
摘要
SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation
for: 这种新的分解 diffusion model (SDDM) 是为了Explicitly optimize the tangled distributions during image generation.
methods: SDDM 使用 manifold 来分解 score function 或 energy guidance into an image “denoising” part and a content “refinement” part.
results: SDDM 可以在 fewer diffusion steps 中比 existing SBDM-based methods 表现出更好的效果 on several I2I benchmarks.I hope that helps! Let me know if you have any other questions.Abstract
Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image ``denoising" part and a content ``refinement" part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.
摘要
现代分数基 diffusion 模型(SBDM)在无对照图像至图像翻译(I2I)中显示出了有前途的成绩。然而,现有的方法,可能是能量基或统计基的,没有直接提供杂乱的中间生成分布的显式形式。本工作提出了一种新的分数 decomposed diffusion model(SDDM),该模型在抽象上分解了分数函数或能量导航的杂乱部分,并在图像生成过程中显式优化这些分布。为了在同一个噪音水平上细化图像,我们将图像“净化”部分和内容“精度”部分的分数函数和能量导航的平衡化,从而实现多目标优化在抽象上。此外,我们还利用了块适应性的实例normalization模块来构建抽象上的低维度拟合分布,以便更好地翻译图像。 SDDM 在多个 I2I 测试准则上表现出了较好的成绩,并且只需要比较少的扩散步数。
paper_authors: Takayuki Kanai, Igor Vasiljevic, Vitor Guizilini, Adrien Gaidon, Rares Ambrus for:自动驾驶和机器人需要在多种场景中运行,以完成任务高效和安全。methods:我们提出了一种基于自我超视觉学习的新方法,即使无需其他传感器,可以高效和准确地进行外部约束。我们的方法使用彩色视频中的单目深度和pose估计器,并在运动视频中提供速度监视,以便估计外部约束。results:我们的方法在一个多摄像头数据集(DDAD)上进行了实验,并证明了在不同场景中可以Robustly和高效地进行自我约束,而不需要传感器。此外,我们还证明了在depth预测中使用外部约束可以提高深度预测的准确性。Abstract
Autonomous vehicles and robots need to operate over a wide variety of scenarios in order to complete tasks efficiently and safely. Multi-camera self-supervised monocular depth estimation from videos is a promising way to reason about the environment, as it generates metrically scaled geometric predictions from visual data without requiring additional sensors. However, most works assume well-calibrated extrinsics to fully leverage this multi-camera setup, even though accurate and efficient calibration is still a challenging problem. In this work, we introduce a novel method for extrinsic calibration that builds upon the principles of self-supervised monocular depth and ego-motion learning. Our proposed curriculum learning strategy uses monocular depth and pose estimators with velocity supervision to estimate extrinsics, and then jointly learns extrinsic calibration along with depth and pose for a set of overlapping cameras rigidly attached to a moving vehicle. Experiments on a benchmark multi-camera dataset (DDAD) demonstrate that our method enables self-calibration in various scenes robustly and efficiently compared to a traditional vision-based pose estimation pipeline. Furthermore, we demonstrate the benefits of extrinsics self-calibration as a way to improve depth prediction via joint optimization.
摘要
Attention-Driven Lightweight Model for Pigmented Skin Lesion Detection
paper_authors: Mingzhe Hu, Xiaofeng Yang for:This paper presents a lightweight pipeline for skin lesion detection, addressing the challenges of imbalanced class distribution and subtle or atypical appearances of some lesions.methods:The pipeline uses a lightweight model that leverages ghosted features and the DFC attention mechanism to reduce computational complexity while maintaining high performance. The model was trained on the HAM10000 dataset, which includes various types of skin lesions, and incorporates a knowledge-based loss weighting technique to address class imbalance.results:The model achieved an accuracy of 92.4%, a precision of 84.2%, a recall of 86.9%, and a f1-score of 85.4%, with particularly strong performance in identifying Benign Keratosis-like lesions (BKL) and Nevus (NV). Despite its superior performance, the model’s computational cost is considerably lower than some models with less accuracy, making it an optimal solution for real-world applications where both accuracy and efficiency are essential.Abstract
This study presents a lightweight pipeline for skin lesion detection, addressing the challenges posed by imbalanced class distribution and subtle or atypical appearances of some lesions. The pipeline is built around a lightweight model that leverages ghosted features and the DFC attention mechanism to reduce computational complexity while maintaining high performance. The model was trained on the HAM10000 dataset, which includes various types of skin lesions. To address the class imbalance in the dataset, the synthetic minority over-sampling technique and various image augmentation techniques were used. The model also incorporates a knowledge-based loss weighting technique, which assigns different weights to the loss function at the class level and the instance level, helping the model focus on minority classes and challenging samples. This technique involves assigning different weights to the loss function on two levels - the class level and the instance level. By applying appropriate loss weights, the model pays more attention to the minority classes and challenging samples, thus improving its ability to correctly detect and classify different skin lesions. The model achieved an accuracy of 92.4%, a precision of 84.2%, a recall of 86.9%, a f1-score of 85.4% with particularly strong performance in identifying Benign Keratosis-like lesions (BKL) and Nevus (NV). Despite its superior performance, the model's computational cost is considerably lower than some models with less accuracy, making it an optimal solution for real-world applications where both accuracy and efficiency are essential.
摘要
Rethinking Class Activation Maps for Segmentation: Revealing Semantic Information in Shallow Layers by Reducing Noise
results: 经过大量实验 validate 的结果表明,该方法可以提高无监督 semantic segmentation 任务中的性能。Abstract
Class activation maps are widely used for explaining deep neural networks. Due to its ability to highlight regions of interest, it has evolved in recent years as a key step in weakly supervised learning. A major limitation to the performance of the class activation maps is the small spatial resolution of the feature maps in the last layer of the convolutional neural network. Therefore, we expect to generate high-resolution feature maps that result in high-quality semantic information. In this paper, we rethink the properties of semantic information in shallow feature maps. We find that the shallow feature maps still have fine-grained non-discriminative features while mixing considerable non-target noise. Furthermore, we propose a simple gradient-based denoising method to filter the noise by truncating the positive gradient. Our proposed scheme can be easily deployed in other CAM-related methods, facilitating these methods to obtain higher-quality class activation maps. We evaluate the proposed approach through a weakly-supervised semantic segmentation task, and a large number of experiments demonstrate the effectiveness of our approach.
摘要
<>translate text into Simplified Chinese<>���������iddle activation maps ������� Ди����� deep neural networks ��������� explain. Due to its ability to highlight regions of interest, it has evolved in recent years as a key step in weakly supervised learning. A major limitation to the performance of the class activation maps is the small spatial resolution of the feature maps in the last layer of the convolutional neural network. Therefore, we expect to generate high-resolution feature maps that result in high-quality semantic information. In this paper, we rethink the properties of semantic information in shallow feature maps. We find that the shallow feature maps still have fine-grained non-discriminative features while mixing considerable non-target noise. Furthermore, we propose a simple gradient-based denoising method to filter the noise by truncating the positive gradient. Our proposed scheme can be easily deployed in other CAM-related methods, facilitating these methods to obtain higher-quality class activation maps. We evaluate the proposed approach through a weakly-supervised semantic segmentation task, and a large number of experiments demonstrate the effectiveness of our approach.Note: Please keep in mind that the translation is done using a machine translation tool, and the quality of the translation may vary depending on the complexity and nuances of the original text.
Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network
results: 研究结果显示,Hybrid-MT-ESTAN在3,320幅乳腺超音波图像中的分类和分 segmentation任务中取得了最高的准确率、敏感度和F1分数,分别为82.7%, 86.4%和86.0%。Abstract
Capturing global contextual information plays a critical role in breast ultrasound (BUS) image classification. Although convolutional neural networks (CNNs) have demonstrated reliable performance in tumor classification, they have inherent limitations for modeling global and long-range dependencies due to the localized nature of convolution operations. Vision Transformers have an improved capability of capturing global contextual information but may distort the local image patterns due to the tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation using a hybrid architecture composed of CNNs and Swin Transformer components. The proposed approach was compared to nine BUS classification methods and evaluated using seven quantitative metrics on a dataset of 3,320 BUS images. The results indicate that Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.
摘要
globally capturing contextual information plays a crucial role in breast ultrasound (BUS) image classification. Although convolutional neural networks (CNNs) have shown reliable performance in tumor classification, they have inherent limitations in modeling global and long-range dependencies due to the localized nature of convolution operations. Vision Transformers have an improved capability of capturing global contextual information but may distort local image patterns due to tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation using a hybrid architecture composed of CNNs and Swin Transformer components. The proposed approach was compared to nine BUS classification methods and evaluated using seven quantitative metrics on a dataset of 3,320 BUS images. The results indicate that Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.
CT Reconstruction from Few Planar X-rays with Application towards Low-resource Radiotherapy
results: 我们在使用这种方法生成的 thoracic CT扫描图像上进行了2场对抗的、舒缓放疗规划,并发现了<1%的误差在计算机中计算的辐射剂量与临床获得的CT扫描图像中的辐射剂量之间。此外,我们的方法也比最近的稀疙CT重建基线性能更高(PSNR、SSIM、Dice分数)在公共的LIDC肺CT数据集上。Abstract
CT scans are the standard-of-care for many clinical ailments, and are needed for treatments like external beam radiotherapy. Unfortunately, CT scanners are rare in low and mid-resource settings due to their costs. Planar X-ray radiography units, in comparison, are far more prevalent, but can only provide limited 2D observations of the 3D anatomy. In this work, we propose a method to generate CT volumes from few (<5) planar X-ray observations using a prior data distribution, and perform the first evaluation of such a reconstruction algorithm for a clinical application: radiotherapy planning. We propose a deep generative model, building on advances in neural implicit representations to synthesize volumetric CT scans from few input planar X-ray images at different angles. To focus the generation task on clinically-relevant features, our model can also leverage anatomical guidance during training (via segmentation masks). We generated 2-field opposed, palliative radiotherapy plans on thoracic CTs reconstructed by our method, and found that isocenter radiation dose on reconstructed scans have <1% error with respect to the dose calculated on clinically acquired CTs using <=4 X-ray views. In addition, our method is better than recent sparse CT reconstruction baselines in terms of standard pixel and structure-level metrics (PSNR, SSIM, Dice score) on the public LIDC lung CT dataset. Code is available at: https://github.com/wanderinrain/Xray2CT.
摘要
CT扫描是现代医疗标准,用于许多临床疾病的诊断和治疗,如外部β射线治疗。然而,在LOW和中等资源设置中,CT扫描仪仍然 rare due to its high cost.在这些设置中,平面X射线成像机器更加普遍,但它们只能提供限定的2D观察,无法提供3D анатомия的全面观察。在这种情况下,我们提出了一种方法,使用先前的数据分布来生成CT卷积体从少量(<5)平面X射线图像的观察。我们采用了深度生成模型,基于神经隐式表示来生成3D CT扫描图像从平面X射线图像的不同角度。为了将生成任务关注临床有关的特征,我们的模型可以在训练过程中使用解剖指导(via分剖标签)。我们在使用我们的方法生成的肺CT扫描图像上进行了2场对称的肺癌治疗规划,并发现了辐射剂量在重建的SCANS上和临床获得的CT扫描图像上的差异小于1%。此外,我们的方法也比最近的稀疏CT重建基线更好,根据公共的LIDC肺CT数据集的标准像素级和结构级指标(PSNR、SSIM、Dice分数)。代码可以在:https://github.com/wanderinrain/Xray2CT中找到。
Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation
results: 实验表明,该方法可以输出可观赏的合并图像,并在实际场景中提高 segmentation mIoU 值,相比之前的方法。Abstract
Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. Early efforts focus on boosting the performance for only one task, \emph{e.g.,} fusion or segmentation, making it hard to reach~`Best of Both Worlds'. To overcome this issue, in this paper, we propose a \textbf{M}ulti-\textbf{i}nteractive \textbf{F}eature learning architecture for image fusion and \textbf{Seg}mentation, namely SegMiF, and exploit dual-task correlation to promote the performance of both tasks. The SegMiF is of a cascade structure, containing a fusion sub-network and a commonly used segmentation sub-network. By slickly bridging intermediate features between two components, the knowledge learned from the segmentation task can effectively assist the fusion task. Also, the benefited fusion network supports the segmentation one to perform more pretentiously. Besides, a hierarchical interactive attention block is established to ensure fine-grained mapping of all the vital information between two tasks, so that the modality/semantic features can be fully mutual-interactive. In addition, a dynamic weight factor is introduced to automatically adjust the corresponding weights of each task, which can balance the interactive feature correspondence and break through the limitation of laborious tuning. Furthermore, we construct a smart multi-wave binocular imaging system and collect a full-time multi-modality benchmark with 15 annotated pixel-level categories for image fusion and segmentation. Extensive experiments on several public datasets and our benchmark demonstrate that the proposed method outputs visually appealing fused images and perform averagely $7.66\%$ higher segmentation mIoU in the real-world scene than the state-of-the-art approaches. The source code and benchmark are available at \url{https://github.com/JinyuanLiu-CV/SegMiF}.
摘要
多Modalitate的图像融合和分割在自动驾驶和机器人操作中扮演着重要的角色。初期努力主要是提高单一任务的性能,例如融合或分割,这使得达到“Best of Both Worlds”的目标变得困难。为解决这个问题,在本文中,我们提出了一种多因素互动特征学习架构,即SegMiF,并利用双任务相关性来提高两个任务的性能。SegMiF具有搅合结构,包括融合子网络和通常使用的分割子网络。通过细腻地桥接两个组件之间的中间特征,分割任务学习的知识可以有效地帮助融合任务。同时,融合网络也可以支持分割任务更加准确地进行。此外,我们还设立了一个层次互动注意力块,以确保所有重要信息之间的细腻 mapping,以便全面互动。此外,我们还引入了一个自动调整相应任务的权重因子,以平衡互动特征对应和缓解劳动劳累调整的限制。此外,我们还构建了一个智能多波普普摄影系统,并为多模态图像融合和分割建立了一个全天候多模态标准套件,包括15个注释的像素级分类。经过对多个公共数据集和我们的标准套件进行广泛的实验,我们发现提出的方法可以输出视觉吸引人的融合图像,并在实际场景中的分割mIoU平均提高7.66%。源代码和标准套件可以在获取。
Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images
results: 对于IST-3和BraTS2021两个数据集,我们的方法与其他弱监督方法进行比较,显示我们的方法可以提高DICE分数从0.6534提高到0.7056,表明我们的方法可以更好地生成健康版本图像。Abstract
Segmentation masks of pathological areas are useful in many medical applications, such as brain tumour and stroke management. Moreover, healthy counterfactuals of diseased images can be used to enhance radiologists' training files and to improve the interpretability of segmentation models. In this work, we present a weakly supervised method to generate a healthy version of a diseased image and then use it to obtain a pixel-wise anomaly map. To do so, we start by considering a saliency map that approximately covers the pathological areas, obtained with ACAT. Then, we propose a technique that allows to perform targeted modifications to these regions, while preserving the rest of the image. In particular, we employ a diffusion model trained on healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and Denoising Diffusion Implicit Model (DDIM) at each step of the sampling process. DDPM is used to modify the areas affected by a lesion within the saliency map, while DDIM guarantees reconstruction of the normal anatomy outside of it. The two parts are also fused at each timestep, to guarantee the generation of a sample with a coherent appearance and a seamless transition between edited and unedited parts. We verify that when our method is applied to healthy samples, the input images are reconstructed without significant modifications. We compare our approach with alternative weakly supervised methods on IST-3 for stroke lesion segmentation and on BraTS2021 for brain tumour segmentation, where we improve the DICE score of the best competing method from $0.6534$ to $0.7056$.
摘要
干将疾病区域分割的分割面貌是医学应用中非常有用的,例如脑肿瘤和中风管理。此外,健康的对比样本可以用于增强放射学家的训练文件,并提高分割模型的可读性。在这种情况下,我们提出了一种弱监督的方法,可以生成一个健康版本的疾病图像,并使用其生成一个像素精度的异常地图。我们开始是通过考虑一个ACAT获得的病理区域精度的报告来,然后我们提出了一种可以在这些区域进行目标 modify 的技术,保持图像的其他部分不变。特别是,我们使用了训练在健康样本上的扩散模型,并将DDPM和DDIM相互融合在每个抽样过程中。DDPM用于在病理区域内的精度报告中修改病理区域,而DDIM确保在病理区域外的正常解剖结构的重建。这两个部分也在每个时间步骤中融合,以保证生成一个具有净合的外观和无缝过渡的编辑和未编辑部分。我们证明,当我们的方法应用于健康样本时,输入图像不会经受显著的修改。我们与其他弱监督方法进行比较,在IST-3上对stroke lesion segmentation和BraTS2021上对脑肿瘤分割进行比较,我们提高了最佳竞争方法的DICE分数从0.6534提高到0.7056。
results: 研究发现,使用该方法可以对道路上的异常对象进行高精度的检测,并且达到了80.08%和88.98%的AP值在Fishyscapes Lost and Found和RoadAnomaly验证集上。项目页面:https://vision.rwth-aachen.de/ugainsAbstract
A single unexpected object on the road can cause an accident or may lead to injuries. To prevent this, we need a reliable mechanism for finding anomalous objects on the road. This task, called anomaly segmentation, can be a stepping stone to safe and reliable autonomous driving. Current approaches tackle anomaly segmentation by assigning an anomaly score to each pixel and by grouping anomalous regions using simple heuristics. However, pixel grouping is a limiting factor when it comes to evaluating the segmentation performance of individual anomalous objects. To address the issue of grouping multiple anomaly instances into one, we propose an approach that produces accurate anomaly instance masks. Our approach centers on an out-of-distribution segmentation model for identifying uncertain regions and a strong generalist segmentation model for anomaly instances segmentation. We investigate ways to use uncertain regions to guide such a segmentation model to perform segmentation of anomalous instances. By incorporating strong object priors from a generalist model we additionally improve the per-pixel anomaly segmentation performance. Our approach outperforms current pixel-level anomaly segmentation methods, achieving an AP of 80.08% and 88.98% on the Fishyscapes Lost and Found and the RoadAnomaly validation sets respectively. Project page: https://vision.rwth-aachen.de/ugains
摘要
一个不期望的对象在路上可能会导致事故或伤害。为了防止这种情况,我们需要一种可靠的机制来检测路上异常对象。这项任务被称为异常分割,可以作为自动驾驶安全可靠的一个步骤。现有的方法对异常分割采用分配异常分数到每个像素点和使用简单的规则来组合异常区域。然而,像素点组合是评估异常分割性能的限制因素。为了解决多个异常实例被一起分组的问题,我们提出一种方法,可以生成准确的异常实例面积。我们的方法围绕着非常区分分数模型和一个强大的通用模型来实现异常实例分割。我们调查了如何使用不确定区域来引导这种分割模型进行异常实例分割。通过将强大的物体先验从通用模型integrated,我们还提高了每个像素点的异常分割性能。我们的方法在当前像素级异常分割方法的基础上提高了AP值,达到80.08%和88.98%在Fishyscapes Lost and Found和RoadAnomaly验证集上。项目页面:https://vision.rwth-aachen.de/ugains
results: 我们在四个 benchmark 和两个任务上进行了广泛的实验,并证明了 ETran 在对象检测和分类 benchmark 上平均高于前一些工作 by 21% 和 12%,并达到了转移性评估中的最佳性能。Abstract
This paper addresses the problem of ranking pre-trained models for object detection and image classification. Selecting the best pre-trained model by fine-tuning is an expensive and time-consuming task. Previous works have proposed transferability estimation based on features extracted by the pre-trained models. We argue that quantifying whether the target dataset is in-distribution (IND) or out-of-distribution (OOD) for the pre-trained model is an important factor in the transferability estimation. To this end, we propose ETran, an energy-based transferability assessment metric, which includes three scores: 1) energy score, 2) classification score, and 3) regression score. We use energy-based models to determine whether the target dataset is OOD or IND for the pre-trained model. In contrast to the prior works, ETran is applicable to a wide range of tasks including classification, regression, and object detection (classification+regression). This is the first work that proposes transferability estimation for object detection task. Our extensive experiments on four benchmarks and two tasks show that ETran outperforms previous works on object detection and classification benchmarks by an average of 21% and 12%, respectively, and achieves SOTA in transferability assessment.
摘要
Explainable unsupervised multi-modal image registration using deep networks
results: 作者在之前的研究中已经达到了超过现有标准Syn方法的性能水平,而在这项工作中,他们则证明了他们的DL模型可以完全解释,这将为将来的医学图像数据进行扩展提供了基础。Abstract
Clinical decision making from magnetic resonance imaging (MRI) combines complementary information from multiple MRI sequences (defined as 'modalities'). MRI image registration aims to geometrically 'pair' diagnoses from different modalities, time points and slices. Both intra- and inter-modality MRI registration are essential components in clinical MRI settings. Further, an MRI image processing pipeline that can address both afine and non-rigid registration is critical, as both types of deformations may be occuring in real MRI data scenarios. Unlike image classification, explainability is not commonly addressed in image registration deep learning (DL) methods, as it is challenging to interpet model-data behaviours against transformation fields. To properly address this, we incorporate Grad-CAM-based explainability frameworks in each major component of our unsupervised multi-modal and multi-organ image registration DL methodology. We previously demonstrated that we were able to reach superior performance (against the current standard Syn method). In this work, we show that our DL model becomes fully explainable, setting the framework to generalise our approach on further medical imaging data.
摘要
临床决策从核磁共振成像(MRI)结合多种MRI序列(定义为“模态”)的信息。MRI图像匹配目标是将不同模态、时间点和切片的诊断“匹配”在一起。在临床MRI Setting中,内部和间部MRI匹配都是重要组成部分。此外,一个能够处理both afine和非RIGID匹配的MRI图像处理管道是关键,因为这两种类型的变形都可能出现在实际MRI数据enario中。与图像分类不同,explainability不是通常在图像匹配深度学习(DL)方法中被考虑,因为很难从变换场景中解释模型-数据行为。为了正确地 Addressing this, we incorporate Grad-CAM-based explainability frameworks in each major component of our unsupervised multi-modal and multi-organ image registration DL methodology. In our previous work, we demonstrated that our DL model could reach superior performance against the current standard Syn method. In this work, we show that our DL model becomes fully explainable, setting the framework for generalizing our approach to further medical imaging data.
Predicting Ki67, ER, PR, and HER2 Statuses from H&E-stained Breast Cancer Images
paper_authors: Amir Akbarnejad, Nilanjan Ray, Penny J. Barnes, Gilbert Bigras
for: 本研究的目的是确定机器学习方法是否可以准确地预测蛋白质信息仅基于组织结构图像。
methods: 我们建立了一个大规模的数据集(185538张图像),其中包含可靠的测量数据 для Ki67、ER、PR和HER2状况。这个数据集由H&E和相关的免疫抑制试验(Ki67、ER、PR和HER2)的图像组成,这些图像通过注册进行了镜像。为了增强可靠性,我们 manually inspect each pair of images and discarded those with artifacts (such as tissue folding or bubbles). Measurements for Ki67、ER和PR were determined by calculating H-Score from image analysis, while HER2 measurement was based on binary classification.
results: 我们发现,使用标准的ViT基础pipeline可以在训练with proper labeling protocol下达到约90%的AUC预测性能。此外,我们还证明了训练的分类器能够正确地 lokalisate relevant regions,这将激发未来的工作以提高localizations。我们的提出的数据集现在公开可用:https://ihc4bc.github.io/Abstract
Despite the advances in machine learning and digital pathology, it is not yet clear if machine learning methods can accurately predict molecular information merely from histomorphology. In a quest to answer this question, we built a large-scale dataset (185538 images) with reliable measurements for Ki67, ER, PR, and HER2 statuses. The dataset is composed of mirrored images of H\&E and corresponding images of immunohistochemistry (IHC) assays (Ki67, ER, PR, and HER2. These images are mirrored through registration. To increase reliability, individual pairs were inspected and discarded if artifacts were present (tissue folding, bubbles, etc). Measurements for Ki67, ER and PR were determined by calculating H-Score from image analysis. HER2 measurement is based on binary classification: 0 and 1+ (IHC scores representing a negative subset) vs 3+ (IHC score positive subset). Cases with IHC equivocal score (2+) were excluded. We show that a standard ViT-based pipeline can achieve prediction performances around 90% in terms of Area Under the Curve (AUC) when trained with a proper labeling protocol. Finally, we shed light on the ability of the trained classifiers to localize relevant regions, which encourages future work to improve the localizations. Our proposed dataset is publicly available: https://ihc4bc.github.io/
摘要
尽管机器学习和数字 PATHOLOGY 技术有所进步,但是还没有清楚地表明机器学习方法可以准确地预测分子信息仅通过 histomorphology。为了回答这个问题,我们建立了一个大规模数据集(185538张图像),其中包含可靠的测量数据 для Ki67、ER、PR 和 HER2 状态。该数据集由 H\&E 和相关的免疫抑制试验(IHC)图像组成(Ki67、ER、PR 和 HER2),这些图像通过注册进行镜像。为了增加可靠性,我们对每个对应的图像进行了检查,并且如果存在遗留物(如组织卷绕、气泡等),那么将其排除。我们通过计算 H-Score 来确定 Ki67、ER 和 PR 的测量,而 HER2 的测量则基于二分类:0和1+(IHC 分数表示负集)vs 3+(IHC 分数正集)。我们排除了 IHC 不明确分数(2+)的案例。我们显示,使用标准 ViT-based 管道可以在训练时以 около 90% 的 AUC 性能进行预测。最后,我们探讨了训练的分类器所能够启发的相关区域,这引发了未来工作的改进局部化。我们所建立的数据集现在公开可用:https://ihc4bc.github.io/
CartiMorph: a framework for automated knee articular cartilage morphometrics
results: 研究发现,使用这个框架可以生成高精度的软骨厚度图和血液损伤率图,并且与手动分割的结果相比,root-mean-squared deviation只有8%以下,而且存在strong correlation。此外,与之前的研究中的血液损伤率比较,这个方法的测量结果更接近真实值。Abstract
We introduce CartiMorph, a framework for automated knee articular cartilage morphometrics. It takes an image as input and generates quantitative metrics for cartilage subregions, including the percentage of full-thickness cartilage loss (FCL), mean thickness, surface area, and volume. CartiMorph leverages the power of deep learning models for hierarchical image feature representation. Deep learning models were trained and validated for tissue segmentation, template construction, and template-to-image registration. We established methods for surface-normal-based cartilage thickness mapping, FCL estimation, and rule-based cartilage parcellation. Our cartilage thickness map showed less error in thin and peripheral regions. We evaluated the effectiveness of the adopted segmentation model by comparing the quantitative metrics obtained from model segmentation and those from manual segmentation. The root-mean-squared deviation of the FCL measurements was less than 8%, and strong correlations were observed for the mean thickness (Pearson's correlation coefficient $\rho \in [0.82,0.97]$), surface area ($\rho \in [0.82,0.98]$) and volume ($\rho \in [0.89,0.98]$) measurements. We compared our FCL measurements with those from a previous study and found that our measurements deviated less from the ground truths. We observed superior performance of the proposed rule-based cartilage parcellation method compared with the atlas-based approach. CartiMorph has the potential to promote imaging biomarkers discovery for knee osteoarthritis.
摘要
我们介绍CartiMorph,一个数据科学框架,用于自动脚关节软骨形态分析。它可以将脚关节软骨影像作为输入,生成软骨各个子区域的量化指标,包括软骨全厚度损伤率(FCL)、平均厚度、表面积和体积。CartiMorph充分利用深度学习模型来表示图像特征 hierarchy。我们训练了深度学习模型并进行验证,以进行组织分类、模板建立和模板与图像对接。我们开发了基于表面法向的软骨厚度对应、FCL估计和规律基于的软骨分割方法。我们的软骨厚度图示在薄和 périphérique 区域中表现较低的误差。我们评估了运用数据科学模型进行分类的效果,与手动分类结果进行比较。在FCL量化中,根mean-squared deviation的误差小于8%,且在平均厚度(Pearson's correlation coefficient $\rho \in [0.82,0.97]$)、表面积(Pearson's correlation coefficient $\rho \in [0.82,0.98]$)和体积(Pearson's correlation coefficient $\rho \in [0.89,0.98]$)量化中都 observe strong correlation。我们与之前的研究比较FCL量化结果,发现我们的量化结果与真实值更接近。我们发现了规律基于的软骨分割方法比Atlas-based方法表现更好。CartiMorph具有推广镜影像生物 markers的潜力。
Unmasking Parkinson’s Disease with Smile: An AI-enabled Screening Framework
methods: 该研究使用了 largest video dataset containing micro-expressions,并使用了面部特征和动作单元来提取有关肌萎病的特征。 ensemble of AI models 被训练于这些特征,并 achieved an accuracy of 89.7% 和 AUROC of 89.3%。
results: 研究发现,基于微表情的抑阻肌萎病诊断方法可以具有高度可靠性和抗检测性,并且可以在不同的性别和民族背景下实现同等的表现。Abstract
Parkinson's disease (PD) diagnosis remains challenging due to lacking a reliable biomarker and limited access to clinical care. In this study, we present an analysis of the largest video dataset containing micro-expressions to screen for PD. We collected 3,871 videos from 1,059 unique participants, including 256 self-reported PD patients. The recordings are from diverse sources encompassing participants' homes across multiple countries, a clinic, and a PD care facility in the US. Leveraging facial landmarks and action units, we extracted features relevant to Hypomimia, a prominent symptom of PD characterized by reduced facial expressions. An ensemble of AI models trained on these features achieved an accuracy of 89.7% and an Area Under the Receiver Operating Characteristic (AUROC) of 89.3% while being free from detectable bias across population subgroups based on sex and ethnicity on held-out data. Further analysis reveals that features from the smiling videos alone lead to comparable performance, even on two external test sets the model has never seen during training, suggesting the potential for PD risk assessment from smiling selfie videos.
摘要
RealCQA: Scientific Chart Question Answering as a Test-bed for First-Order Logic
results: 该论文通过实验证明了大规模预训练模型在图表视觉问答任务中的表现,并为图表视觉问答和形式逻辑验证各提供了一个标准。Abstract
We present a comprehensive study of chart visual question-answering(QA) task, to address the challenges faced in comprehending and extracting data from chart visualizations within documents. Despite efforts to tackle this problem using synthetic charts, solutions are limited by the shortage of annotated real-world data. To fill this gap, we introduce a benchmark and dataset for chart visual QA on real-world charts, offering a systematic analysis of the task and a novel taxonomy for template-based chart question creation. Our contribution includes the introduction of a new answer type, 'list', with both ranked and unranked variations. Our study is conducted on a real-world chart dataset from scientific literature, showcasing higher visual complexity compared to other works. Our focus is on template-based QA and how it can serve as a standard for evaluating the first-order logic capabilities of models. The results of our experiments, conducted on a real-world out-of-distribution dataset, provide a robust evaluation of large-scale pre-trained models and advance the field of chart visual QA and formal logic verification for neural networks in general.
摘要
我们进行了一项全面的研究,探讨了图表视觉问答(QA)任务,以解决在文档中理解和提取图表视觉数据时所遇到的挑战。尽管有尝试使用合成图表来解决这个问题,但解决方案受到实际数据缺乏答案的限制。为了填补这个差距,我们提出了一个基准和图表视觉QA实际数据集,并提供了一种系统性的分析和一种新的模板基于的图表问题创建分类法。我们的贡献包括引入一种新的答案类型,“列表”,其中包括排序和无排序两种变种。我们的研究在科学文献中采集的实际图表数据集上进行,这个数据集的视觉复杂性较高于其他工作。我们的重点是模板基于的QA和如何使其成为评估模型首觉逻辑能力的标准。实验结果,在一个实际 OUT-OF-distribution 数据集上进行,为大规模预训练模型提供了robust的评估,并为图表视觉QA和形式逻辑验证 для神经网络在通过总的提升了领域。
Synthesising Rare Cataract Surgery Samples with Guided Diffusion Models
paper_authors: Yannik Frisch, Moritz Fuchs, Antoine Sanner, Felix Anton Ucar, Marius Frenzel, Joana Wasielica-Poslednik, Adrian Gericke, Felix Mathias Wagner, Thomas Dratsch, Anirban Mukhopadhyay for:The paper aims to address the challenge of gathering and annotating data for training automated assistance systems for cataract surgery, specifically by analyzing cataract surgery video data and synthesizing diverse, high-quality examples of surgical phases and tool usage.methods:The authors use a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG) to synthesize realistic examples of surgical phases and tool usage, which can improve the data sparsity problem for the downstream task of tool classification.results:The model can generate valuable unseen examples that allow the tool classifier to improve by up to 10% for rare cases, and the synthetically extended data can facilitate the development of automated assistance systems for cataract surgery by providing a reliable source of realistic synthetic data.Here’s the Chinese translation of the three key points:for:这篇论文目标是解决cataract surgery assistive system的自动化协助系统训练数据收集和标注的挑战,特别是分析cataract surgery视频数据并生成多样化、高质量的手术阶段和工具使用示例。methods:作者使用了一种基于Denoising Diffusion Implicit Models (DDIM)和Classifier-Free Guidance (CFG)的 conditional generative模型来生成真实的手术阶段和工具使用示例,以解决下游任务的数据稀缺问题。results:模型可以生成有价值的未看过的示例,使得工具分类器在罕见 случа中提高至10%,并且将生成的 synthetically extended data 提供了一个可靠的真实的synthetic数据源,以便助长cataract surgery assistive system的自动化协助系统的发展。Abstract
Cataract surgery is a frequently performed procedure that demands automation and advanced assistance systems. However, gathering and annotating data for training such systems is resource intensive. The publicly available data also comprises severe imbalances inherent to the surgical process. Motivated by this, we analyse cataract surgery video data for the worst-performing phases of a pre-trained downstream tool classifier. The analysis demonstrates that imbalances deteriorate the classifier's performance on underrepresented cases. To address this challenge, we utilise a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG). Our model can synthesise diverse, high-quality examples based on complex multi-class multi-label conditions, such as surgical phases and combinations of surgical tools. We affirm that the synthesised samples display tools that the classifier recognises. These samples are hard to differentiate from real images, even for clinical experts with more than five years of experience. Further, our synthetically extended data can improve the data sparsity problem for the downstream task of tool classification. The evaluations demonstrate that the model can generate valuable unseen examples, allowing the tool classifier to improve by up to 10% for rare cases. Overall, our approach can facilitate the development of automated assistance systems for cataract surgery by providing a reliable source of realistic synthetic data, which we make available for everyone.
摘要
喷洗手术是一种非常常见的手术程序,需要自动化和高级帮助系统。然而,收集和标注数据 для训练这些系统是资源占用的。公共可用数据也存在严重的不均衡问题,这些问题会影响下游工具分类器的性能。为了解决这个挑战,我们分析了喷洗手术视频数据,并发现异常性对下游工具分类器的性能有负面影响。为了解决这个问题,我们使用了基于零噪扩散模型(DDIM)和无类标注指导(CFG)的冲激生成模型。我们的模型可以生成多样化、高质量的示例,包括手术阶段和手术工具的复杂多类多标签条件。我们证明了这些生成的样例中的工具都是分类器认可的。这些样例与真实图像很难分辨,即使是临床专家超过5年的经验。此外,我们通过生成的数据扩展,可以解决下游工具分类器的数据稀缺问题。评估表明,我们的模型可以生成有价值的未见样例,使工具分类器提高至10%。总之,我们的方法可以促进喷洗手术自动化系统的开发,提供一个可靠的真实synthetic数据源,我们将其公开给大家。
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
results: 这个项目开发了一个名为All-Seeing模型(ASM),这是一个统一的框架,可以处理多种视觉和语言任务,包括区域文本检索、区域识别、描述和问题回答。这个模型在零基础情况下表现出惊人的表现,并且可以 generale到不同的视觉和语言任务。Abstract
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.
摘要
我们介绍《全见计划》(AS)项目:一个大规模数据和模型,旨在认知和理解开放世界中的一切。我们使用可扩展的数据引擎,并将人类反馈和高效的模型纳入循环,创建了一个新的数据集(AS-1B),包含超过10亿个区域,每个区域均被标注为Semantic tag、问题对答对和详细描述。这些数据覆盖了350万个常见和罕见的世界现象,并有132.2亿个字符来描述这些概念和其属性。基于这个新的数据集,我们开发了《全见模型》(ASM),一个统一的视觉认知和理解框架。该模型通过使用开放语言提示和位置训练,能够通过Zero-shot学习来解决多种视觉和语言任务,包括区域文本检索、区域识别、描述和问题回答。我们希望这个项目可以成为视觉语言人工智能研究的基础。模型和数据将在GitHub上发布,demo可以在Hugging Face上看到。
results: 通过 incorporating these technologies and recent advancements in training and problem formation, the improved “plain” DETR showed exceptional improvements over the original DETR detector. Using the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy with a Swin-L backbone, which is highly competitive with state-of-the-art detectors that all heavily rely on multi-scale feature maps and region-based feature extraction.Abstract
This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .
摘要
results: 实验表明,UniSim可以生成具有小域领域差异的感知数据,并在下游任务上表现出优秀的性能。通过UniSim,我们实现了在安全关键场景下进行闭环评估自动驾驶系统的能力。Abstract
Rigorously testing autonomy systems is essential for making safe self-driving vehicles (SDV) a reality. It requires one to generate safety critical scenarios beyond what can be collected safely in the world, as many scenarios happen rarely on public roads. To accurately evaluate performance, we need to test the SDV on these scenarios in closed-loop, where the SDV and other actors interact with each other at each timestep. Previously recorded driving logs provide a rich resource to build these new scenarios from, but for closed loop evaluation, we need to modify the sensor data based on the new scene configuration and the SDV's decisions, as actors might be added or removed and the trajectories of existing actors and the SDV will differ from the original log. In this paper, we present UniSim, a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, we incorporate learnable priors for dynamic objects, and leverage a convolutional network to complete unseen regions. Our experiments show UniSim can simulate realistic sensor data with small domain gap on downstream tasks. With UniSim, we demonstrate closed-loop evaluation of an autonomy system on safety-critical scenarios as if it were in the real world.
摘要
rigorously testing autonomous systems is essential for making safe self-driving vehicles (SDV) a reality. It requires one to generate safety-critical scenarios beyond what can be collected safely on public roads, as many scenarios happen rarely. To accurately evaluate performance, we need to test the SDV on these scenarios in closed-loop, where the SDV and other actors interact with each other at each timestep. Previously recorded driving logs provide a rich resource to build these new scenarios from, but for closed-loop evaluation, we need to modify the sensor data based on the new scene configuration and the SDV's decisions, as actors might be added or removed and the trajectories of existing actors and the SDV will differ from the original log. In this paper, we present UniSim, a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, we incorporate learnable priors for dynamic objects, and leverage a convolutional network to complete unseen regions. Our experiments show UniSim can simulate realistic sensor data with small domain gap on downstream tasks. With UniSim, we demonstrate closed-loop evaluation of an autonomy system on safety-critical scenarios as if it were in the real world.
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations
methods: 我们利用了强大的文本和视觉特征之间的协调,通过 millions of auxiliary image-text pairs预训练的 poderoso alignement。我们提出了一种高效可行的框架called Evidence-guided Dual Context Optimization (DualCoOp++),它作为一种统一的方法来解决 partial-label 和零shot多标签识别。在 DualCoOp++ 中,我们分别编码了目标类的证据、正面和负面上下文作为参数的文本输入(即提示)中的 parametric 组件。
results: experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.Abstract
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.
摘要
多 label 图像识别在低标签 régime 是一项具有挑战性和实际意义的任务。先前的工作通过学习图像和文本空间之间的对应关系来资料缺乏多 label 约束,但可能会受到丰富约束的影响。在本研究中,我们利用图像和文本特征之间的强大对应关系,预训练了数百万个辅助图像-文本对。我们提出了一种高效的框架,称为 Evidence-guided Dual Context Optimization (DualCoOp++),它是一种综合性的方法,用于解决 partial-label 和 zero-shot 多 label 识别任务。在 DualCoOp++ 中,我们分别对目标类的 evidential、正例和负例上下文进行编码,并将它们作为文本输入(即提示)的参数。目标类的 evidential 上下文旨在找到与其相关的所有视觉内容,并作为指导,将正例和负例上下文从图像的空间领域集成,以提高类别之间的分辨率。此外,我们还提出了一种 Winner-Take-All 模块,它在训练过程中促进了类之间的互动,而不需要额外的参数和成本。由于 DualCoOp++ 对预训练的视觉语言框架增加了最小的可学习增加,因此它可以快速适应多 label 识别任务,即使有限的标签和未知的类别。我们在标准多 label 识别测试上进行了两个低标签设定下的实验,并证明了我们的方法在比 estado-of-the-art 方法更高效。
results: 通过两种轻量级组件——模仿损失和延迟反恶作用训练——进一步提高物体检测器的鲁棒性,并在 MS-COCO 和 Pascal VOC 数据集上进行了广泛的实验来证明提案的有效性。Abstract
Object detection is a vital task in computer vision and has become an integral component of numerous critical systems. However, state-of-the-art object detectors, similar to their classification counterparts, are susceptible to small adversarial perturbations that can significantly alter their normal behavior. Unlike classification, the robustness of object detectors has not been thoroughly explored. In this work, we take the initial step towards bridging the gap between the robustness of classification and object detection by leveraging adversarially trained classification models. Merely utilizing adversarially trained models as backbones for object detection does not result in robustness. We propose effective modifications to the classification-based backbone to instill robustness in object detection without incurring any computational overhead. To further enhance the robustness achieved by the proposed modified backbone, we introduce two lightweight components: imitation loss and delayed adversarial training. Extensive experiments on the MS-COCO and Pascal VOC datasets are conducted to demonstrate the effectiveness of our proposed approach.
摘要
Object detection 是计算机视觉中的一项重要任务,已成为许多关键系统的重要组件。然而,当前的状态体检器,与分类器一样,受到小型恶作剂的影响,可能导致它们的正常行为发生重大变化。与分类器不同的是,对象检测器的可靠性还没有得到了充分的探索。在这项工作中,我们通过利用针对攻击的分类器模型来减轻这一问题,开始尝试将分类器和检测器的可靠性更加相似。但是,直接使用针对攻击的分类器模型作为检测器的后备,并不能确保对象检测器的可靠性。我们提出了一些有效的修改,以使得分类器后备具备对象检测器的Robustness,而不需要任何计算负担增加。此外,我们还引入了两种轻量级组件:模仿损失和延迟攻击训练。我们对 MS-COCO 和 Pascal VOC 数据集进行了广泛的实验,以证明我们的提出的方法的有效性。
ConceptLab: Creative Generation using Diffusion Prior Constraints
results: 我们的研究表明,通过在优化问题中适应添加问题回答模型,可以避免生成的概念混合到现有的成员中,并且可以在生成过程中创造更多的创新性。此外,我们还发现了“先验约束”可以作为一种强大的混合机制,使得我们可以在生成过程中创造更多的杂合体。Abstract
Recent text-to-image generative models have enabled us to transform our words into vibrant, captivating imagery. The surge of personalization techniques that has followed has also allowed us to imagine unique concepts in new scenes. However, an intriguing question remains: How can we generate a new, imaginary concept that has never been seen before? In this paper, we present the task of creative text-to-image generation, where we seek to generate new members of a broad category (e.g., generating a pet that differs from all existing pets). We leverage the under-studied Diffusion Prior models and show that the creative generation problem can be formulated as an optimization process over the output space of the diffusion prior, resulting in a set of "prior constraints". To keep our generated concept from converging into existing members, we incorporate a question-answering model that adaptively adds new constraints to the optimization problem, encouraging the model to discover increasingly more unique creations. Finally, we show that our prior constraints can also serve as a strong mixing mechanism allowing us to create hybrids between generated concepts, introducing even more flexibility into the creative process.
摘要
Reconstructing Three-Dimensional Models of Interacting Humans
paper_authors: Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, Cristian Sminchisescu
For: This paper focuses on improving the accuracy of 3D human interactions, which is essential for fine-grained scene analysis and behavioral modeling.* Methods: The authors introduce several new models for interaction signature estimation (ISP), including contact detection, segmentation, and 3D contact signature prediction. They also show how these components can be used to ensure contact consistency during 3D reconstruction.* Results: The authors construct several large datasets for learning and evaluating 3D contact prediction and reconstruction methods, including CHI3D and FlickrCI3D. They also propose a methodology for recovering the ground-truth pose and shape of interacting people in a controlled setup and annotate all 3D interaction motions in CHI3D with textual descriptions.Abstract
Understanding 3d human interactions is fundamental for fine-grained scene analysis and behavioural modeling. However, most of the existing models predict incorrect, lifeless 3d estimates, that miss the subtle human contact aspects--the essence of the event--and are of little use for detailed behavioral understanding. This paper addresses such issues with several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3d contact signature prediction; (2) we show how such components can be leveraged to ensure contact consistency during 3d reconstruction; (3) we construct several large datasets for learning and evaluating 3d contact prediction and reconstruction methods; specifically, we introduce CHI3D, a lab-based accurate 3d motion capture dataset with 631 sequences containing $2,525$ contact events, $728,664$ ground truth 3d poses, as well as FlickrCI3D, a dataset of $11,216$ images, with $14,081$ processed pairs of people, and $81,233$ facet-level surface correspondences. Finally, (4) we propose methodology for recovering the ground-truth pose and shape of interacting people in a controlled setup and (5) annotate all 3d interaction motions in CHI3D with textual descriptions. Motion data in multiple formats (GHUM and SMPLX parameters, Human3.6m 3d joints) is made available for research purposes at \url{https://ci3d.imar.ro}, together with an evaluation server and a public benchmark.
摘要
理解人类三维交互是场景分析和行为模型的基础。然而,大多数现有模型预测错误的、无生命的三维估计,缺少人类接触细节--场景的核心--并对详细行为理解无用。本文解决这些问题,并提供以下贡献:1. 我们引入交互特征估计(ISP)模型,包括接触检测、分割和三维接触特征预测;2. 我们示出如何使用这些组件保证接触一致性 during 3D重建;3. 我们建立了多个大型数据集用于学习和评估3D接触预测和重建方法,其中包括CHI3D,一个室内精确3D动作捕捉数据集,包含2,525个接触事件,728,664个真实3D姿势、以及FlickrCI3D,一个包含11,216个图像、14,081个处理后的人物对象、81,233个表面匹配的数据集。最后,4. 我们提出了方法来恢复实际的人类接触姿势和形状,并5. 对CHI3D数据集中的所有3D交互动作进行文本描述。运动数据在多种格式(GHUM和SMPLX参数、人类3.6m JOINTS)可以用于研究目的下载于\url{https://ci3d.imar.ro},同时也提供评估服务器和公共准则。
Is your data alignable? Principled and interpretable alignability testing and integration of single-cell data
methods: spectral manifold alignment and inference (SMAI) 框架,该框架提供了一种可靠的统计测试以确定数据集之间的可alignability,并且可以保持数据的结构Integrity。
results: SMAI在多个实际和模拟 benchmark 数据集上表现出色,并且可以提高各种后续分析,如差异表达基因的鉴定和单元 spatial transcriptomics 的填充。Abstract
Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data. SMAI provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
摘要
单元数据集 интеграción可以提供完整的分子视图单元,并有许多算法开发以除掉技术或生物学变化。Despite their widespread use, existing methods have several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional single-cell datasets are alignable (and therefore should even be aligned). Moreover, popular methods can significantly distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data. SMAI provides a statistical test to robustly determine the alignability between datasets to avoid misleading inference, and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI's interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
results: 研究发现,使用 differential 数据流模型可以实现等效的添加和删除操作性能,并且可以均衡工作负荷。Abstract
The core reasoning task for datalog engines is materialization, the evaluation of a datalog program over a database alongside its physical incorporation into the database itself. The de-facto method of computing it, is through the recursive application of inference rules. Due to it being a costly operation, it is a must for datalog engines to provide incremental materialization, that is, to adjust the computation to new data, instead of restarting from scratch. One of the major caveats, is that deleting data is notoriously more involved than adding, since one has to take into account all possible data that has been entailed from what is being deleted. Differential Dataflow is a computational model that provides efficient incremental maintenance, notoriously with equal performance between additions and deletions, and work distribution, of iterative dataflows. In this paper we investigate the performance of materialization with three reference datalog implementations, out of which one is built on top of a lightweight relational engine, and the two others are differential-dataflow and non-differential versions of the same rewrite algorithm, with the same optimizations.
摘要
核心逻辑任务 для datalog 引擎是材料化,即将 datalog 程序应用于数据库并将其物理integrated 到数据库中。现行方法是通过 recursively 应用推理规则来实现。由于这是一个昂贵的操作,因此 datalog 引擎必须提供增量材料化,即根据新数据进行计算而不是从scratch 重新开始。其中一个主要缺点是,删除数据是与添加数据相比更加复杂,因为需要考虑所有可能的数据,从被删除的数据中推导出来的所有数据。 differential 数据流是一种计算模型,它提供了高效的增量维护,并且在添加和删除数据方面具有相同的性能,以及分布式的数据流处理。在这篇论文中,我们 investigate 材料化的性能,使用三个参考 datalog 实现,其中一个基于轻量级关系引擎,另外两个是不同的推理算法的增量和非增量版本,均具有同样的优化。
Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics
results: 实验结果表明,FedSurF++ 可以与现有方法相比,在效率和隐私保护两个方面具有显著的改进,并在实际应用中对两个真实世界数据集进行了成功的应用。Abstract
Survival analysis is a fundamental tool in medicine, modeling the time until an event of interest occurs in a population. However, in real-world applications, survival data are often incomplete, censored, distributed, and confidential, especially in healthcare settings where privacy is critical. The scarcity of data can severely limit the scalability of survival models to distributed applications that rely on large data pools. Federated learning is a promising technique that enables machine learning models to be trained on multiple datasets without compromising user privacy, making it particularly well-suited for addressing the challenges of survival data and large-scale survival applications. Despite significant developments in federated learning for classification and regression, many directions remain unexplored in the context of survival analysis. In this work, we propose an extension of the Federated Survival Forest algorithm, called FedSurF++. This federated ensemble method constructs random survival forests in heterogeneous federations. Specifically, we investigate several new tree sampling methods from client forests and compare the results with state-of-the-art survival models based on neural networks. The key advantage of FedSurF++ is its ability to achieve comparable performance to existing methods while requiring only a single communication round to complete. The extensive empirical investigation results in a significant improvement from the algorithmic and privacy preservation perspectives, making the original FedSurF algorithm more efficient, robust, and private. We also present results on two real-world datasets demonstrating the success of FedSurF++ in real-world healthcare studies. Our results underscore the potential of FedSurF++ to improve the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy.
摘要
生存分析是医学中的基本工具,用于计算人口中事件兴奋的时间。然而,在实际应用中,生存数据经常是不完整、审核、分布和保密的,特别是在医疗设置中,隐私是极其重要的。数据的缺乏可能会限制生存模型的扩展性,使其无法应用于大规模的分布式应用程序。联邦学习是一种有 Promise的技术,它允许机器学习模型在多个数据集上进行训练,而不需要牺牲用户隐私。这使得联邦学习在生存数据和大规模生存应用中非常有优势。 despite significant developments in federated learning for classification and regression, many directions remain unexplored in the context of survival analysis.在这项工作中,我们提出了一种基于联邦生存森林算法的扩展,即 FedSurF++。这是一种联邦ensemble方法,可以在多个不同的机器学习模型之间进行混合。我们 investigate several new tree sampling methods from client forests and compare the results with state-of-the-art survival models based on neural networks. FedSurF++的关键优点在于它可以与现有方法相比,只需要一次通信往返完成,从算法和隐私保持角度来看,它更高效、更稳定、更隐私。我们还在两个实际数据集上进行了实验,demonstrating the success of FedSurF++ in real-world healthcare studies.our results underscore the potential of FedSurF++ to improve the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy.
Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks
results: 我们的分析表明,AI代码生成器具有恶意掺入攻击的敏感性,即使只有小量恶意代码掺入,也可以导致代码生成器生成漏洞代码。此外,攻击不会影响代码生成器生成的代码的正确性,这使得攻击更难以发现。Abstract
In this work, we assess the security of AI code generators via data poisoning, i.e., an attack that injects malicious samples into the training data to generate vulnerable code. We poison the training data by injecting increasing amounts of code containing security vulnerabilities and assess the attack's success on different state-of-the-art models for code generation. Our analysis shows that AI code generators are vulnerable to even a small amount of data poisoning. Moreover, the attack does not impact the correctness of code generated by pre-trained models, making it hard to detect.
摘要
在这个工作中,我们评估人工智能代码生成器的安全性via数据恶意攻击,即通过把恶意样本注入到训练数据中来生成漏洞代码。我们在训练数据中注入递增的代码中存在安全漏洞的代码,并评估不同的现代代码生成模型的攻击成功率。我们的分析显示,AI代码生成器对数据恶意攻击很易受到影响,而且这种攻击不会影响预训练模型生成的代码正确性,从而使攻击更加困难发现。
Harnessing the Web and Knowledge Graphs for Automated Impact Investing Scoring
paper_authors: Qingzhi Hu, Daniel Daza, Laurens Swinkels, Kristina Ūsaitė, Robbert-Jan ‘t Hoen, Paul Groth for:The paper aims to automate the process of creating an SDG framework for the finance industry.methods:The proposed system uses a data-driven approach, collecting and filtering a dataset of texts from different web sources and a knowledge graph relevant to a set of companies. Classifiers trained with this data are used to predict scores of alignment with SDGs for a given company.results:The best performing model achieved a micro average F1 score of 0.89, demonstrating the effectiveness of the proposed solution. The system also provides explanations in the form of data relevant to a predicted score, facilitating its use by humans. Additionally, the system enables accurate prediction of SDG scores at a fraction of the cost of traditional methods.Abstract
The Sustainable Development Goals (SDGs) were introduced by the United Nations in order to encourage policies and activities that help guarantee human prosperity and sustainability. SDG frameworks produced in the finance industry are designed to provide scores that indicate how well a company aligns with each of the 17 SDGs. This scoring enables a consistent assessment of investments that have the potential of building an inclusive and sustainable economy. As a result of the high quality and reliability required by such frameworks, the process of creating and maintaining them is time-consuming and requires extensive domain expertise. In this work, we describe a data-driven system that seeks to automate the process of creating an SDG framework. First, we propose a novel method for collecting and filtering a dataset of texts from different web sources and a knowledge graph relevant to a set of companies. We then implement and deploy classifiers trained with this data for predicting scores of alignment with SDGs for a given company. Our results indicate that our best performing model can accurately predict SDG scores with a micro average F1 score of 0.89, demonstrating the effectiveness of the proposed solution. We further describe how the integration of the models for its use by humans can be facilitated by providing explanations in the form of data relevant to a predicted score. We find that our proposed solution enables access to a large amount of information that analysts would normally not be able to process, resulting in an accurate prediction of SDG scores at a fraction of the cost.
摘要
联合国发布可持续发展目标(SDGs),以促进人类发展和可持续性。 SDG框架在金融业中生成的分数可以衡量公司与每一个SDG的匹配程度。这种分数可以帮助评估投资,推动建立包容和可持续的经济。由于需要高质量和可靠性,创建和维护SDG框架的过程是时间消耗和需要广泛领域专业知识的。在这份工作中,我们描述了一种数据驱动的系统,用于自动创建SDG框架。我们首先提出了一种新的方法,用于收集和筛选不同网络源和知识图库相关公司的文本数据集。然后,我们实现和部署基于这些数据的分类器,用于预测公司与SDGs的匹配度。我们的结果表明,我们的最佳表现模型可以准确预测公司与SDGs的匹配度,微平均F1分数为0.89,证明我们的提案得到了效果。此外,我们还描述了如何将模型与人类使用者集成,通过提供预测分数的数据解释。我们发现,我们的提案的解决方案可以提供大量的信息,让分析员可以快速和准确地预测SDG分数,只需一小部分的成本。
A Machine Learning Method for Predicting Traffic Signal Timing from Probe Vehicle Data
results: 我们的结果显示,对于交通信号周期长度和红灯时间的预测,XGBoost模型的误差在0.56秒之下,而神经网络模型的误差在7.2秒之间。Abstract
Traffic signals play an important role in transportation by enabling traffic flow management, and ensuring safety at intersections. In addition, knowing the traffic signal phase and timing data can allow optimal vehicle routing for time and energy efficiency, eco-driving, and the accurate simulation of signalized road networks. In this paper, we present a machine learning (ML) method for estimating traffic signal timing information from vehicle probe data. To the authors best knowledge, very few works have presented ML techniques for determining traffic signal timing parameters from vehicle probe data. In this work, we develop an Extreme Gradient Boosting (XGBoost) model to estimate signal cycle lengths and a neural network model to determine the corresponding red times per phase from probe data. The green times are then be derived from the cycle length and red times. Our results show an error of less than 0.56 sec for cycle length, and red times predictions within 7.2 sec error on average.
摘要
交通信号机器人agement plays an important role in transportation by enabling traffic flow management and ensuring safety at intersections. In addition, knowing the traffic signal phase and timing data can allow optimal vehicle routing for time and energy efficiency, eco-driving, and the accurate simulation of signalized road networks. In this paper, we present a machine learning (ML) method for estimating traffic signal timing information from vehicle probe data. To the authors' best knowledge, very few works have presented ML techniques for determining traffic signal timing parameters from vehicle probe data. In this work, we develop an Extreme Gradient Boosting (XGBoost) model to estimate signal cycle lengths and a neural network model to determine the corresponding red times per phase from probe data. The green times are then derived from the cycle length and red times. Our results show an error of less than 0.56 sec for cycle length, and red times predictions within 7.2 sec error on average.Here's the translation in Traditional Chinese:交通信号机器人agement plays an important role in transportation by enabling traffic flow management and ensuring safety at intersections. In addition, knowing the traffic signal phase and timing data can allow optimal vehicle routing for time and energy efficiency, eco-driving, and the accurate simulation of signalized road networks. In this paper, we present a machine learning (ML) method for estimating traffic signal timing information from vehicle probe data. To the authors' best knowledge, very few works have presented ML techniques for determining traffic signal timing parameters from vehicle probe data. In this work, we develop an Extreme Gradient Boosting (XGBoost) model to estimate signal cycle lengths and a neural network model to determine the corresponding red times per phase from probe data. The green times are then derived from the cycle length and red times. Our results show an error of less than 0.56 sec for cycle length, and red times predictions within 7.2 sec error on average.
Universal Defensive Underpainting Patch: Making Your Text Invisible to Optical Character Recognition
results: 在各种 screenshot 范围和复杂背景下,有效防止未经授权 OCR,并对文本图像大小、颜色和语言等特点表示不敏感,可以跨多个 OCR 系统进行跨度测试。Abstract
Optical Character Recognition (OCR) enables automatic text extraction from scanned or digitized text images, but it also makes it easy to pirate valuable or sensitive text from these images. Previous methods to prevent OCR piracy by distorting characters in text images are impractical in real-world scenarios, as pirates can capture arbitrary portions of the text images, rendering the defenses ineffective. In this work, we propose a novel and effective defense mechanism termed the Universal Defensive Underpainting Patch (UDUP) that modifies the underpainting of text images instead of the characters. UDUP is created through an iterative optimization process to craft a small, fixed-size defensive patch that can generate non-overlapping underpainting for text images of any size. Experimental results show that UDUP effectively defends against unauthorized OCR under the setting of any screenshot range or complex image background. It is agnostic to the content, size, colors, and languages of characters, and is robust to typical image operations such as scaling and compressing. In addition, the transferability of UDUP is demonstrated by evading several off-the-shelf OCRs. The code is available at https://github.com/QRICKDD/UDUP.
摘要
Optical Character Recognition (OCR) 可以自动从扫描或数字化的文本图像中提取文本,但也使得偷窃有价或敏感文本成为可能。前一些防止 OCR 盗版的方法,例如扭曲字符在文本图像中,在实际应用中是无效的,因为盗版者可以捕捉文本图像中任意部分,使防御无效。在这个工作中,我们提出了一个新的防御机制,称为“ Universal Defensive Underpainting Patch”(UDUP)。UDUP 通过迭代优化过程,创建一个小型、固定大小的防御贴片,可以为任何文本图像 generate 非重合的底层。实验结果显示,UDUP 能够对不授权 OCR 进行有效防御,不论屏幕截图范围或复杂的图像背景。它不受字符内容、大小、颜色和语言等因素影响,并具有对于常见的图像操作,如缩小和压缩,的抗衰变性。此外,我们还证明了 UDUP 的传染性,可以避免多个商业 OCR 辨识。代码可以在 获取。
methods: 本研究使用GPT-3.5进行测试,并对 filtered GTFS 数据集进行信息提取, comparative study of zero-shot 和程序生成。
results: GPT-3.5 能够正确答复多项选择问题(MCQ)的比例为 77%,并在简单问题和复杂问题上达到了 ~90% 和 ~40% 的准确率。Abstract
The General Transit Feed Specification (GTFS) standard for publishing transit data is ubiquitous. GTFS being tabular data, with information spread across different files, necessitates specialized tools or packages to retrieve information. Concurrently, the use of Large Language Models for text and information retrieval is growing. The idea of this research is to see if the current widely adopted LLMs (ChatGPT) are able to retrieve information from GTFS using natural language instructions. We first test whether ChatGPT (GPT-3.5) understands the GTFS specification. GPT-3.5 answers 77% of our multiple-choice questions (MCQ) correctly. Next, we task the LLM with information extractions from a filtered GTFS feed with 4 routes. For information retrieval, we compare zero-shot and program synthesis. Program synthesis works better, achieving ~90% accuracy on simple questions and ~40% accuracy on complex questions.
摘要
《通用交通Feed标准(GTFS)》的普及性使得特殊工具或包装必须用于检索信息。同时,大型自然语言模型(LLM)在文本和信息检索方面的使用也在增长。这项研究的想法是看看当前普遍采用的LLM(ChatGPT)是否能够通过自然语言指令从GTFS中检索信息。我们首先测试GPT-3.5是否理解GTFS规范。GPT-3.5回答了77%的多选问题(MCQ)正确。接着,我们对筛选后的GTFSFeed进行了信息检索。为了进行信息检索,我们比较了零shot和程序合成。程序合成工具比较好,实现了简单问题的准确率达90%,复杂问题的准确率达40%。
Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text
results: 研究发现,使用 Vicuna-13B 和 Alpaca-LoRA-13B 两种基eline模型,可以通过自动生成测试 caso 来提高语言模型对 KG 的生成能力,但是还有很多空间可以进行改进,特别是通过结合semantic web和自然语言处理技术。Abstract
The recent advances in large language models (LLM) and foundation models with emergent capabilities have been shown to improve the performance of many NLP tasks. LLMs and Knowledge Graphs (KG) can complement each other such that LLMs can be used for KG construction or completion while existing KGs can be used for different tasks such as making LLM outputs explainable or fact-checking in Neuro-Symbolic manner. In this paper, we present Text2KGBench, a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences. We provide two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences. We define seven evaluation metrics to measure fact extraction performance, ontology conformance, and hallucinations by LLMs. Furthermore, we provide results for two baseline models, Vicuna-13B and Alpaca-LoRA-13B using automatic prompt generation from test cases. The baseline results show that there is room for improvement using both Semantic Web and Natural Language Processing techniques.
摘要
(注:以下是简化中文版本)现有的大语言模型(LLM)和基础模型(基模)的进步,已经提高了许多自然语言处理(NLP)任务的性能。LLM和知识图(KG)可以相互补充,使LLM用于KG构建或完成,而现有的KG可以用于不同的任务,如使LLM输出Explainable或Fact-Checking在神经符号学方面。在本文中,我们提出了Text2KGBench,一个用于评估语言模型是否可以根据ontology生成知识图的Benchmark。给定输入ontology和一组句子,任务是从文本中提取事实,并遵循给定的ontology(概念、关系、领域/范围约束),同时保持输入句子的准确性。我们提供了两个数据集:Wikidata-TekGen与10个ontology和13474个句子,以及DBpedia-WebNLG与19个ontology和4860个句子。我们定义了七个评估指标,用于评估事实提取性能、ontology遵循性和LLM的幻想。此外,我们提供了两个基eline模型,Vicuna-13B和Alpaca-LoRA-13B的自动提示生成测试 caso的结果。基eline结果表明,还有很多可以通过semantic web和自然语言处理技术的提高。
Adapting to Change: Robust Counterfactual Explanations in Dynamic Data Landscapes
paper_authors: Bardh Prenkaj, Mario Villaizan-Vallelado, Tobias Leemann, Gjergji Kasneci
For: This paper proposes a novel semi-supervised graph counterfactual explainer (GCE) method called Dynamic GRAph Counterfactual Explainer (DyGRACE), which can learn the representation of each class in a binary classification scenario and identify counterfactuals without relying on an underlying black-box oracle.* Methods: DyGRACE leverages two graph autoencoders (GAEs) to learn the representation of the graph data, and optimizes a parametric density function (implemented as a logistic regression function) to identify counterfactuals by maximizing the factual autoencoder’s reconstruction error. The method also minimizes the counterfactual autoencoder’s error and maximizes the similarity between the factual and counterfactual graphs.* Results: DyGRACE is effective in identifying counterfactuals and can act as a drift detector, identifying distributional drift based on differences in reconstruction errors between iterations. It avoids reliance on the oracle’s predictions in successive iterations, thereby increasing the efficiency of counterfactual discovery. DyGRACE also has the capacity for contrastive learning and drift detection, offering new avenues for semi-supervised learning and explanation generation.Here’s the Chinese version of the three key points:* For: 这篇论文提出了一种新的半监督图像Counterfactual explainer(GCE)方法,即动态GRAphCounterfactual Explainer(DyGRACE),可以在二分类enario中学习每个类别的表示,并无需依赖于基础黑obox oracle。* Methods: DyGRACE利用了两个图像自编码器(GAEs)来学习图像数据的表示,并优化了一个参数化概率函数(实现为逻辑回归函数)来确定对象的counterfactuals,并最小化对应的counterfactual自编码器的错误。* Results: DyGRACE非常有效地确定counterfactuals,可以作为分布漂移检测器,根据不同的重构错误值来识别分布漂移。它避免了基础黑obox oracle的预测,从而提高了对counterfactual的发现效率。 DyGRACE还具有对比学习和分布检测的能力,这将为半监督学习和解释生成提供新的 Avenues。Abstract
We introduce a novel semi-supervised Graph Counterfactual Explainer (GCE) methodology, Dynamic GRAph Counterfactual Explainer (DyGRACE). It leverages initial knowledge about the data distribution to search for valid counterfactuals while avoiding using information from potentially outdated decision functions in subsequent time steps. Employing two graph autoencoders (GAEs), DyGRACE learns the representation of each class in a binary classification scenario. The GAEs minimise the reconstruction error between the original graph and its learned representation during training. The method involves (i) optimising a parametric density function (implemented as a logistic regression function) to identify counterfactuals by maximising the factual autoencoder's reconstruction error, (ii) minimising the counterfactual autoencoder's error, and (iii) maximising the similarity between the factual and counterfactual graphs. This semi-supervised approach is independent of an underlying black-box oracle. A logistic regression model is trained on a set of graph pairs to learn weights that aid in finding counterfactuals. At inference, for each unseen graph, the logistic regressor identifies the best counterfactual candidate using these learned weights, while the GAEs can be iteratively updated to represent the continual adaptation of the learned graph representation over iterations. DyGRACE is quite effective and can act as a drift detector, identifying distributional drift based on differences in reconstruction errors between iterations. It avoids reliance on the oracle's predictions in successive iterations, thereby increasing the efficiency of counterfactual discovery. DyGRACE, with its capacity for contrastive learning and drift detection, will offer new avenues for semi-supervised learning and explanation generation.
摘要
我们介绍了一种新的半超vised图像Counterfactual Explainer(GCE)方法,即动态图像Counterfactual Explainer(DyGRACE)。它利用初始知识 Concerning the data distribution to search for valid counterfactuals while avoiding the use of potentially outdated decision functions in subsequent time steps. Employing two graph autoencoders(GAEs), DyGRACE learns the representation of each class in a binary classification scenario. The GAEs minimize the reconstruction error between the original graph and its learned representation during training. The method involves(i)optimizing a parametric density function(implemented as a logistic regression function)to identify counterfactuals by maximizing the factual autoencoder's reconstruction error,(ii)minimizing the counterfactual autoencoder's error, and(iii)maximizing the similarity between the factual and counterfactual graphs. This semi-supervised approach is independent of an underlying black-box oracle. A logistic regression model is trained on a set of graph pairs to learn weights that aid in finding counterfactuals. At inference,for each unseen graph,the logistic regressor identifies the best counterfactual candidate using these learned weights,while the GAEs can be iteratively updated to represent the continual adaptation of the learned graph representation over iterations. DyGRACE is quite effective and can act as a drift detector,identifying distributional drift based on differences in reconstruction errors between iterations. It avoids reliance on the oracle's predictions in successive iterations,thereby increasing the efficiency of counterfactual discovery. DyGRACE,with its capacity for contrastive learning and drift detection,will offer new avenues for semi-supervised learning and explanation generation.
Vehicles Control: Collision Avoidance using Federated Deep Reinforcement Learning
results: 研究结果显示,使用FDDPG算法可以更好地控制车辆,预防碰撞和提高平均速度,而且与本地模型DDPG相比,FDDPG算法能够更好地降低旅行时间和提高平均速度。Abstract
In the face of growing urban populations and the escalating number of vehicles on the roads, managing transportation efficiently and ensuring safety have become critical challenges. To tackle these issues, the development of intelligent control systems for vehicles is paramount. This paper presents a comprehensive study on vehicle control for collision avoidance, leveraging the power of Federated Deep Reinforcement Learning (FDRL) techniques. Our main goal is to minimize travel delays and enhance the average speed of vehicles while prioritizing safety and preserving data privacy. To accomplish this, we conducted a comparative analysis between the local model, Deep Deterministic Policy Gradient (DDPG), and the global model, Federated Deep Deterministic Policy Gradient (FDDPG), to determine their effectiveness in optimizing vehicle control for collision avoidance. The results obtained indicate that the FDDPG algorithm outperforms DDPG in terms of effectively controlling vehicles and preventing collisions. Significantly, the FDDPG-based algorithm demonstrates substantial reductions in travel delays and notable improvements in average speed compared to the DDPG algorithm.
摘要
面对城市人口增长和交通量不断增加,管理交通efficiently和保障安全已成为核心挑战。为解决这些问题,车辆智能控制系统的开发变得非常重要。这篇论文介绍了车辆控制与碰撞避免的全面研究,利用联合深度强化学习(FDRL)技术。我们的主要目标是减少旅行延迟和提高车辆平均速度,同时尽量保持数据隐私。为此,我们进行了本地模型(DDPG)和全球模型(FDDPG)的比较分析,以确定它们在优化车辆控制方面的效果。结果表明,使用FDDPG算法可以更好地控制车辆,避免碰撞。更重要的是,FDDPG算法示出了明显减少旅行延迟和提高车辆平均速度的效果,相比DDPG算法。
RAHNet: Retrieval Augmented Hybrid Network for Long-tailed Graph Classification
results: 实验表明,RAHNet 在多种流行的 benchmark 上表现出优于当前state-of-the-art方法。Abstract
Graph classification is a crucial task in many real-world multimedia applications, where graphs can represent various multimedia data types such as images, videos, and social networks. Previous efforts have applied graph neural networks (GNNs) in balanced situations where the class distribution is balanced. However, real-world data typically exhibit long-tailed class distributions, resulting in a bias towards the head classes when using GNNs and limited generalization ability over the tail classes. Recent approaches mainly focus on re-balancing different classes during model training, which fails to explicitly introduce new knowledge and sacrifices the performance of the head classes. To address these drawbacks, we propose a novel framework called Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature extractor and an unbiased classifier in a decoupled manner. In the feature extractor training stage, we develop a graph retrieval module to search for relevant graphs that directly enrich the intra-class diversity for the tail classes. Moreover, we innovatively optimize a category-centered supervised contrastive loss to obtain discriminative representations, which is more suitable for long-tailed scenarios. In the classifier fine-tuning stage, we balance the classifier weights with two weight regularization techniques, i.e., Max-norm and weight decay. Experiments on various popular benchmarks verify the superiority of the proposed method against state-of-the-art approaches.
摘要
GRAPH Classification 是现实世界多媒体应用中的关键任务, graphs 可以表示不同类型的多媒体数据,如图像、视频和社交网络。 previoUs efforts 使用 graph neural networks (GNNs) 在平衡的情况下进行了应用,但实际数据通常具有长尾分布,导致 GNNs 中的偏袋猎和limited generalization 能力。 recent approaches 主要关注在模型训练中平衡不同类型的数据,但这会遗弃新的知识并且影响 head 类的性能。 To address these drawbacks, we propose a novel framework called Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature extractor and an unbiased classifier in a decoupled manner. In the feature extractor training stage, we develop a graph retrieval module to search for relevant graphs that directly enrich the intra-class diversity for the tail classes. Moreover, we innovatively optimize a category-centered supervised contrastive loss to obtain discriminative representations, which is more suitable for long-tailed scenarios. In the classifier fine-tuning stage, we balance the classifier weights with two weight regularization techniques, i.e., Max-norm and weight decay. Experiments on various popular benchmarks verify the superiority of the proposed method against state-of-the-art approaches.
Interoperable synthetic health data with SyntHIR to enable the development of CDSS tools
results: 该论文通过使用挪威病人注册(NPR)和挪威药物投放(NorPD)数据来开发一个机器学习基于CDSS工具,并在SyntHIR系统上测试了该工具。Abstract
There is a great opportunity to use high-quality patient journals and health registers to develop machine learning-based Clinical Decision Support Systems (CDSS). To implement a CDSS tool in a clinical workflow, there is a need to integrate, validate and test this tool on the Electronic Health Record (EHR) systems used to store and manage patient data. However, it is often not possible to get the necessary access to an EHR system due to legal compliance. We propose an architecture for generating and using synthetic EHR data for CDSS tool development. The architecture is implemented in a system called SyntHIR. The SyntHIR system uses the Fast Healthcare Interoperability Resources (FHIR) standards for data interoperability, the Gretel framework for generating synthetic data, the Microsoft Azure FHIR server as the FHIR-based EHR system and SMART on FHIR framework for tool transportability. We demonstrate the usefulness of SyntHIR by developing a machine learning-based CDSS tool using data from the Norwegian Patient Register (NPR) and Norwegian Patient Prescriptions (NorPD). We demonstrate the development of the tool on the SyntHIR system and then lift it to the Open DIPS environment. In conclusion, SyntHIR provides a generic architecture for CDSS tool development using synthetic FHIR data and a testing environment before implementing it in a clinical setting. However, there is scope for improvement in terms of the quality of the synthetic data generated. The code is open source and available at https://github.com/potter-coder89/SyntHIR.git.
摘要
有一大机会使用高质量的病人日记和医疗注册来开发机器学习基于的临床决策支持系统(CDSS)。为实施一个CDSS工具在临床工作流程中,需要将其集成、验证和测试到医疗记录系统(EHR)上,但经常因法律合规而无法获得相关访问权限。我们提出了一种架构,用于生成和使用合成EHR数据来开发CDSS工具。该架构基于Fast Healthcare Interoperability Resources(FHIR)标准、Gretel框架、Microsoft Azure FHIR服务器和SMART on FHIR框架。我们示出了SyntHIR系统的用用实用性,通过使用挪威病人注册(NPR)和挪威药物预约(NorPD)的数据来开发一个机器学习基于的CDSS工具。我们将该工具在SyntHIR系统上开发,然后将其提取到Open DIPS环境中。综上所述,SyntHIR提供了一个通用的架构,用于CDSS工具开发,使用合成FHIR数据进行测试,并在临床环境中实施。然而,有可能提高合成数据的质量。代码可以在https://github.com/potter-coder89/SyntHIR.git中获取。
A Controllable Co-Creative Agent for Game System Design
paper_authors: Rohan Agarwal, Zhiyu Lin, Mark Riedl
for: 这篇论文旨在模拟游戏系统,以便在任何类型的游戏中实现控制性的合作创作。
methods: 该论文使用状态机制和资源流程来模型游戏,并使用可控的指标和设计评价器来评估设计。
results: 研究发现这种系统能够表达广泛的游戏类型,并且可以由人类控制以便于未来的合作创作。Abstract
Many advancements have been made in procedural content generation for games, and with mixed-initiative co-creativity, have the potential for great benefits to human designers. However, co-creative systems for game generation are typically limited to specific genres, rules, or games, limiting the creativity of the designer. We seek to model games abstractly enough to apply to any genre, focusing on designing game systems and mechanics, and create a controllable, co-creative agent that can collaborate on these designs. We present a model of games using state-machine-like components and resource flows, a set of controllable metrics, a design evaluator simulating playthroughs with these metrics, and an evolutionary design balancer and generator. We find this system to be both able to express a wide range of games and able to be human-controllable for future co-creative applications.
摘要
很多进步已经被成功地应用于游戏的过程内容生成,但是权衡共创系统通常只能应用于特定的类型、规则或游戏,这限制了设计师的创作空间。我们想要使用抽象的游戏模型,以便应用于任何类型的游戏,专注于设计游戏系统和机制,并创造一个可控的共创作者。我们提出了一种基于状态机组件和资源流的游戏模型,一组可控的指标,一个模拟游戏playthrough的设计评估器,以及一个进化设计均衡器和生成器。我们发现这种系统可以表达很广泛的游戏,并且可以被人类控制以便未来的共创应用。
Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions
paper_authors: Samia Kabir, David N. Udo-Imeh, Bonan Kou, Tianyi Zhang
For: The paper aims to investigate the quality and usability of ChatGPT’s responses to software engineering queries on Stack Overflow.* Methods: The authors analyzed 517 questions from Stack Overflow and assessed the correctness, consistency, comprehensiveness, and conciseness of ChatGPT’s responses. They also conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT’s answers.* Results: The authors found that 52% of ChatGPT’s answers contain inaccuracies and 77% are verbose. However, users still prefer ChatGPT’s responses 39.34% of the time due to their comprehensiveness and articulate language style. The findings highlight the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.Here is the same information in Simplified Chinese:* For: 研究探讨ChatGPT在Stack Overflow上的问答批处。* Methods: 分析517道Stack Overflow问题,评估ChatGPT的回答是否正确、一致、全面、简洁。同时进行了广泛的语言分析和用户研究,了解ChatGPT的回答的语言和人类方面。* Results: 发现ChatGPT的回答有52%是错误的,77%是 verbose。但是用户仍然偏好ChatGPT的回答39.34%,主要是因为其涵盖性和文革的语言风格。发现结果告诉我们需要对ChatGPT进行精细的错误检查,同时也需要用户注意 seemingly 正确的回答可能存在风险。Abstract
Over the last decade, Q&A platforms have played a crucial role in how programmers seek help online. The emergence of ChatGPT, however, is causing a shift in this pattern. Despite ChatGPT's popularity, there hasn't been a thorough investigation into the quality and usability of its responses to software engineering queries. To address this gap, we undertook a comprehensive analysis of ChatGPT's replies to 517 questions from Stack Overflow (SO). We assessed the correctness, consistency, comprehensiveness, and conciseness of these responses. Additionally, we conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT's answers. Our examination revealed that 52% of ChatGPT's answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT's responses 39.34% of the time due to their comprehensiveness and articulate language style. These findings underscore the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.
摘要
Translation notes:* Q&A platforms: 问答平台 (wèn táng píng dài)* ChatGPT: 聊天GPT (shuō yǎn GPT)* Stack Overflow: 栈 Overflow (duān yāo fēng)* correctness: 正确性 (zhèng qié xìng)* comprehensiveness: 全面性 (quán miàn xìng)* conciseness: 简洁性 (jiǎn jiǎn xìng)* inaccuracies: 错误 (cuò wàng)* verbosity: 言语贪夸 (yán yǔ zhòng bào)* comprehensiveness and articulate language style: 全面性和豁达的语言风格 (quán miàn xìng hé huò dà de yǔ yán fēng xìng)* meticulous error correction: 仔细错误修正 (zhī xiǎo cuò wàng xiù zhèng)* seemingly accurate answers: 看上去准确的答案 (kàn shàng qù zhèng qié de jiǔ yì)
results: 本研究的输出是81个责任串,这些责任串可以帮助不同领域的人们在讨论复杂的AI系统责任时,使用准确和特定的语言表达不同的责任方式。Abstract
To reason about where responsibility does and should lie in complex situations involving AI-enabled systems, we first need a sufficiently clear and detailed cross-disciplinary vocabulary for talking about responsibility. Responsibility is a triadic relation involving an actor, an occurrence, and a way of being responsible. As part of a conscious effort towards 'unravelling' the concept of responsibility to support practical reasoning about responsibility for AI, this paper takes the three-part formulation, 'Actor A is responsible for Occurrence O' and identifies valid combinations of subcategories of A, is responsible for, and O. These valid combinations - which we term "responsibility strings" - are grouped into four senses of responsibility: role-responsibility; causal responsibility; legal liability-responsibility; and moral responsibility. They are illustrated with two running examples, one involving a healthcare AI-based system and another the fatal collision of an AV with a pedestrian in Tempe, Arizona in 2018. The output of the paper is 81 responsibility strings. The aim is that these strings provide the vocabulary for people across disciplines to be clear and specific about the different ways that different actors are responsible for different occurrences within a complex event for which responsibility is sought, allowing for precise and targeted interdisciplinary normative deliberations.
摘要
为了理解AI引用系统中责任的位置和应该负责任的情况,我们首先需要一个具有明确和详细的跨学科词汇,以便在实际reasoning中讨论责任。责任是一种三元关系,包括一个actor、一个事件和一种负责任的方式。为了支持对AI责任的实践理解,这篇论文采用三部分形式,即“Actor A负责事件 O”,并识别了有效的子类划分。这些有效的子类划分被称为“责任串”,并被分为四种责任感:角色责任、 causa责任、法律责任和道德责任。它们通过两个运行例子,一个是基于医疗AI系统的例子,另一个是2018年在阿里扎纳的一起自驾车与行人相撞事件,以 illustrate these responsibility strings。输出的责任串有81个。目标是这些责任串可以为不同领域的人们提供一个明确和特定的词汇,以便在复杂事件中寻求责任时,有precise和targeted的跨学科法规哲学讨论。
Learning to Select the Relevant History Turns in Conversational Question Answering
paper_authors: Munazza Zaib, Wei Emma Zhang, Quan Z. Sheng, Subhash Sagar, Adnan Mahmood, Yang Zhang for: 这 paper 的目的是提出一种基于 conversational question answering (ConvQA) 的 Dynamic History Selection (DHS) 框架,以优化 conversational history 的选择,以提高问答模型的性能。methods: 该 paper 使用了一种基于相似性的方法,首先生成所有历史转的上下文和问题实体,然后根据这些实体和问题的相似性进行筛选和重新排序。另外,paper 还提出了一种基于注意力的机制,以优化重新排序后的上下文中的标记。results: 实验结果表明,选择相关历史转是更好的than rewrite 原始问题,并且 Adding irrelevant history turns 会对模型的性能产生负面影响。paper 还调查了 IR 社区需要更多的关注的研究挑战。Abstract
The increasing demand for the web-based digital assistants has given a rapid rise in the interest of the Information Retrieval (IR) community towards the field of conversational question answering (ConvQA). However, one of the critical aspects of ConvQA is the effective selection of conversational history turns to answer the question at hand. The dependency between relevant history selection and correct answer prediction is an intriguing but under-explored area. The selected relevant context can better guide the system so as to where exactly in the passage to look for an answer. Irrelevant context, on the other hand, brings noise to the system, thereby resulting in a decline in the model's performance. In this paper, we propose a framework, DHS-ConvQA (Dynamic History Selection in Conversational Question Answering), that first generates the context and question entities for all the history turns, which are then pruned on the basis of similarity they share in common with the question at hand. We also propose an attention-based mechanism to re-rank the pruned terms based on their calculated weights of how useful they are in answering the question. In the end, we further aid the model by highlighting the terms in the re-ranked conversational history using a binary classification task and keeping the useful terms (predicted as 1) and ignoring the irrelevant terms (predicted as 0). We demonstrate the efficacy of our proposed framework with extensive experimental results on CANARD and QuAC -- the two popularly utilized datasets in ConvQA. We demonstrate that selecting relevant turns works better than rewriting the original question. We also investigate how adding the irrelevant history turns negatively impacts the model's performance and discuss the research challenges that demand more attention from the IR community.
摘要
随着网络基于的数字助手的需求增长,信息检索(IR)社区对于对话问答(ConvQA)领域的兴趣也得到了快速的提高。然而,对话问答中选择有关的历史记录是一个关键的问题,因为相关的历史记录可以更好地导引系统,以便在答案中找到正确的位置。然而,不相关的历史记录会带来噪音,从而导致模型的性能下降。在这篇论文中,我们提出了一个框架,称为DHS-ConvQA(动态历史选择在对话问答中),它首先生成了所有历史记录中的上下文和问题实体,然后根据这些实体与问题之间的相似度进行排序和减少。我们还提出了一种注意力机制,用于重新排序减少后的上下文项,并在排序后使用二分类任务来高亮重要的项(预测为1),并忽略无关的项(预测为0)。我们在CANARD和QuAC两个广泛使用的数据集上进行了广泛的实验,并证明了选择有关历史记录的方法比rewrite原始问题更好。我们还证明了添加无关历史记录会对模型性能产生负面影响,并讨论了需要IR社区更多的关注的研究挑战。
A stochastic optimization approach to train non-linear neural networks with a higher-order variation regularization
for: 本研究旨在Addressing the issue of overfitting in highly expressive parametric models, such as deep neural networks, by introducing a $(k,q)$th order variation regularization ($(k,q)$-VR) term to the loss function.
methods: 本研究提出了一种 Stochastic optimization algorithm, which can efficiently train general parametric models with the $(k,q)$-VR term without conducting explicit numerical integration. The proposed approach can be applied to the training of even deep neural networks with arbitrary structure, using a simple stochastic gradient descent algorithm and automatic differentiation.
results: numerical experiments demonstrate that the neural networks trained with the $(k,q)$-VR terms are more “resilient” than those with the conventional parameter regularization. The proposed algorithm can also be extended to the physics-informed training of neural networks (PINNs).Abstract
While highly expressive parametric models including deep neural networks have an advantage to model complicated concepts, training such highly non-linear models is known to yield a high risk of notorious overfitting. To address this issue, this study considers a $(k,q)$th order variation regularization ($(k,q)$-VR), which is defined as the $q$th-powered integral of the absolute $k$th order derivative of the parametric models to be trained; penalizing the $(k,q)$-VR is expected to yield a smoother function, which is expected to avoid overfitting. Particularly, $(k,q)$-VR encompasses the conventional (general-order) total variation with $q=1$. While the $(k,q)$-VR terms applied to general parametric models are computationally intractable due to the integration, this study provides a stochastic optimization algorithm, that can efficiently train general models with the $(k,q)$-VR without conducting explicit numerical integration. The proposed approach can be applied to the training of even deep neural networks whose structure is arbitrary, as it can be implemented by only a simple stochastic gradient descent algorithm and automatic differentiation. Our numerical experiments demonstrate that the neural networks trained with the $(k,q)$-VR terms are more ``resilient'' than those with the conventional parameter regularization. The proposed algorithm also can be extended to the physics-informed training of neural networks (PINNs).
摘要
“ While highly expressive parametric models including deep neural networks have an advantage in modeling complicated concepts, training such highly non-linear models is known to yield a high risk of notorious overfitting. To address this issue, this study considers a $(k,q)$th order variation regularization ($(k,q)$-VR), which is defined as the $q$th-powered integral of the absolute $k$th order derivative of the parametric models to be trained; penalizing the $(k,q)$-VR is expected to yield a smoother function, which is expected to avoid overfitting. Particularly, $(k,q)$-VR encompasses the conventional (general-order) total variation with $q=1$. While the $(k,q)$-VR terms applied to general parametric models are computationally intractable due to the integration, this study provides a stochastic optimization algorithm, that can efficiently train general models with the $(k,q)$-VR without conducting explicit numerical integration. The proposed approach can be applied to the training of even deep neural networks whose structure is arbitrary, as it can be implemented by only a simple stochastic gradient descent algorithm and automatic differentiation. Our numerical experiments demonstrate that the neural networks trained with the $(k,q)$-VR terms are more ``resilient'' than those with the conventional parameter regularization. The proposed algorithm also can be extended to the physics-informed training of neural networks (PINNs).”Note: The translation is in Simplified Chinese, which is one of the two standard versions of Chinese language. The other version is Traditional Chinese.
Frustratingly Easy Model Generalization by Dummy Risk Minimization
results: 在多种任务上,包括传统分类、semantic segmentation、out-of-distribution泛化、对抗训练和长尾识别等,DuRM 能够一直提高表现,并且与现有的通用技术相Compatible,但可能存在一些限制。Abstract
Empirical risk minimization (ERM) is a fundamental machine learning paradigm. However, its generalization ability is limited in various tasks. In this paper, we devise Dummy Risk Minimization (DuRM), a frustratingly easy and general technique to improve the generalization of ERM. DuRM is extremely simple to implement: just enlarging the dimension of the output logits and then optimizing using standard gradient descent. Moreover, we validate the efficacy of DuRM on both theoretical and empirical analysis. Theoretically, we show that DuRM derives greater variance of the gradient, which facilitates model generalization by observing better flat local minima. Empirically, we conduct evaluations of DuRM across different datasets, modalities, and network architectures on diverse tasks, including conventional classification, semantic segmentation, out-of-distribution generalization, adverserial training, and long-tailed recognition. Results demonstrate that DuRM could consistently improve the performance under all tasks with an almost free lunch manner. Furthermore, we show that DuRM is compatible with existing generalization techniques and we discuss possible limitations. We hope that DuRM could trigger new interest in the fundamental research on risk minimization.
摘要
empirical risk minimization (ERM) 是机器学习的基本思想之一,但它在不同任务中的泛化能力有限。在这篇论文中,我们提出了幂减风险最小化(DuRM),一种易于实现且普遍适用的技术来提高ERM的泛化能力。DuRM的实现非常简单:只需增大输出归一化的维度,然后使用标准的梯度下降优化。我们也在理论和实际分析中证明了DuRM的有效性。在理论上,我们表明了DuRM可以提高模型的泛化能力,通过在更好的平坦的本地极小值上观察更大的变分。在实际中,我们对不同的数据集、模式和网络架构进行了多种任务的评估,包括传统的分类、semantic segmentation、out-of-distribution泛化、对抗训练和长尾识别等。结果显示,DuRM可以在所有任务上提高性能,并且具有几乎免费的午餐特性。此外,我们还证明了DuRM与现有的泛化技术相容,并讨论了可能的限制。我们希望通过DuRM触发新的基本研究风险最小化的兴趣。
DIVERSIFY: A General Framework for Time Series Out-of-distribution Detection and Generalization
results: 实验结果显示,DIVERSIFY可以更好地学习时间序列的对异类特征,并在七个不同的时间序列资料集上实现了优秀的效果。Abstract
Time series remains one of the most challenging modalities in machine learning research. The out-of-distribution (OOD) detection and generalization on time series tend to suffer due to its non-stationary property, i.e., the distribution changes over time. The dynamic distributions inside time series pose great challenges to existing algorithms to identify invariant distributions since they mainly focus on the scenario where the domain information is given as prior knowledge. In this paper, we attempt to exploit subdomains within a whole dataset to counteract issues induced by non-stationary for generalized representation learning. We propose DIVERSIFY, a general framework, for OOD detection and generalization on dynamic distributions of time series. DIVERSIFY takes an iterative process: it first obtains the "worst-case" latent distribution scenario via adversarial training, then reduces the gap between these latent distributions. We implement DIVERSIFY via combining existing OOD detection methods according to either extracted features or outputs of models for detection while we also directly utilize outputs for classification. In addition, theoretical insights illustrate that DIVERSIFY is theoretically supported. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that DIVERSIFY learns more generalized features and significantly outperforms other baselines.
摘要
时序序列仍然是机器学习研究中最为困难的模式之一。时序序列的非站立性,即时间序列中的分布变化,会导致既存算法很难以识别 invariant 分布。在这篇论文中,我们尝试通过在整个数据集中找到子区域来抵消由非站立性引起的问题,以便实现通用的表示学习。我们提出了 DIVERSIFY,一种通用框架,用于时序序列的非同站通知检测和泛化。DIVERSIFY 采用了迭代过程:首先通过对恶化训练获得 "最差" 的干扰分布enario,然后减少这些干扰分布之间的差距。我们通过将现有的 OOD 检测方法与特定的特征或模型输出结合在一起来实现 DIVERSIFY。此外,我们还直接使用输出进行分类。经验证明了 DIVERSIFY 理论上支持。我们在 Seven 个数据集上进行了对不同 OOD 设置的广泛实验,结果表明 DIVERSIFY 可以更好地学习通用的特征,并在其他基elines上显著超越。
Semantic Channel Equalizer: Modelling Language Mismatch in Multi-User Semantic Communications
results: 比较traditional方法,这篇论文的提案Semantic通信频道均衡器在运算复杂度和传输精度方面表现出色。Abstract
We consider a multi-user semantic communications system in which agents (transmitters and receivers) interact through the exchange of semantic messages to convey meanings. In this context, languages are instrumental in structuring the construction and consolidation of knowledge, influencing conceptual representation and semantic extraction and interpretation. Yet, the crucial role of languages in semantic communications is often overlooked. When this is not the case, agent languages are assumed compatible and unambiguously interoperable, ignoring practical limitations that may arise due to language mismatching. This is the focus of this work. When agents use distinct languages, message interpretation is prone to semantic noise resulting from critical distortion introduced by semantic channels. To address this problem, this paper proposes a new semantic channel equalizer to counteract and limit the critical ambiguity in message interpretation. Our proposed solution models the mismatch of languages with measurable transformations over semantic representation spaces. We achieve this using optimal transport theory, where we model such transformations as transportation maps. Then, to recover at the receiver the meaning intended by the teacher we operate semantic equalization to compensate for the transformation introduced by the semantic channel, either before transmission and/or after the reception of semantic messages. We implement the proposed approach as an operation over a codebook of transformations specifically designed for successful communication. Numerical results show that the proposed semantic channel equalizer outperforms traditional approaches in terms of operational complexity and transmission accuracy.
摘要
我们考虑一个多用户semantic通信系统,在该系统中代理(传输者和接收者)通过semantic消息的交换来传递意义。在这个上下文中,语言对于semantic通信的结构和固化知识的建构和Semantic提取和解释具有重要作用。然而,语言在semantic通信中的重要作用通常被忽略。当代理语言不同时,消息解释受到语言匹配不足的影响,导致semantic频谱中的扭曲。为解决这个问题,本文提出了一种新的semantic频谱平衡器,用于对抗和限制语言匹配不足引起的semantic频谱中的扭曲。我们的提议的解决方案是通过测量变换来模型代理语言之间的语言匹配不足。我们使用优化运输理论来模型这些变换,并将其称为传输地图。接下来,我们在接收方使用semantic平衡器来补偿semantic频谱中的变换,以重建教师发送的意义。我们实现了该方法为一个特定的codebook变换,用于确保successful communication。我们的计算结果表明,我们的semantic频谱平衡器在运算复杂性和传输精度方面都高于传统方法。
DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field
results: 在REAL275和CAMERA25数据集上进行了广泛的实验,并 demonstarted了这种方法在真实场景中的超越性。此外,这种方法还可以支持真实机器臂抓取任务。Abstract
Estimating 6D poses and reconstructing 3D shapes of objects in open-world scenes from RGB-depth image pairs is challenging. Many existing methods rely on learning geometric features that correspond to specific templates while disregarding shape variations and pose differences among objects in the same category. As a result, these methods underperform when handling unseen object instances in complex environments. In contrast, other approaches aim to achieve category-level estimation and reconstruction by leveraging normalized geometric structure priors, but the static prior-based reconstruction struggles with substantial intra-class variations. To solve these problems, we propose the DTF-Net, a novel framework for pose estimation and shape reconstruction based on implicit neural fields of object categories. In DTF-Net, we design a deformable template field to represent the general category-wise shape latent features and intra-category geometric deformation features. The field establishes continuous shape correspondences, deforming the category template into arbitrary observed instances to accomplish shape reconstruction. We introduce a pose regression module that shares the deformation features and template codes from the fields to estimate the accurate 6D pose of each object in the scene. We integrate a multi-modal representation extraction module to extract object features and semantic masks, enabling end-to-end inference. Moreover, during training, we implement a shape-invariant training strategy and a viewpoint sampling method to further enhance the model's capability to extract object pose features. Extensive experiments on the REAL275 and CAMERA25 datasets demonstrate the superiority of DTF-Net in both synthetic and real scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks with a real robot arm.
摘要
估计6D姿 pose和 reconstruction3D形状在开放世界场景中从RGB-深度图像对中获得是具有挑战性的。许多现有方法通过学习特定模板的几何特征来实现,而忽略形状变化和姿态差异在同一类目中的对象。这导致这些方法在处理未看到的对象实例时表现不佳。相反,其他方法通过利用normalized几何结构先天来实现类别级别的估计和重建,但static先天基于的重建受到了显著的内类变化的影响。为解决这些问题,我们提出了DTF-Net,一种基于对象类别的偏函数预测网络。在DTF-Net中,我们设计了可变模板场,用于表示一般类别几何缺失特征和类别内部几何变形特征。场建立了连续的形状匹配,将类别模板转化为观察到的实例中的任意形状,以完成形状重建。我们引入了一个姿势回推模块,该模块共享模板代码和形状偏移特征从场中获得准确的6D姿势估计。我们集成了一个多modal表示EXTRACT模块,以提取对象特征和语义标签,使得结构从头到尾进行推理。此外,在训练时,我们实施了形状不变的训练策略和视点采样方法,以进一步增强模型对对象姿势特征的EXTRACT能力。广泛的实验表明DTF-Net在真实场景中表现出色,并且在真正机器臂上支持着GRASP任务。
results: 文章通过一些实例来说明采集web数据时的采样偏见的存在和严重程度,并提供了预测、检测和缓解采样偏见的建议。Abstract
The increasing adoption of econometric and machine-learning approaches by empirical researchers has led to a widespread use of one data collection method: web scraping. Web scraping refers to the use of automated computer programs to access websites and download their content. The key argument of this paper is that na\"ive web scraping procedures can lead to sampling bias in the collected data. This article describes three sources of sampling bias in web-scraped data. More specifically, sampling bias emerges from web content being volatile (i.e., being subject to change), personalized (i.e., presented in response to request characteristics), and unindexed (i.e., abundance of a population register). In a series of examples, I illustrate the prevalence and magnitude of sampling bias. To support researchers and reviewers, this paper provides recommendations on anticipating, detecting, and overcoming sampling bias in web-scraped data.
摘要
随着经济学和机器学习方法的广泛应用,empirical研究者们正在广泛采用一种数据收集方法:网络抓取。网络抓取指的是通过自动化计算机程序访问网站并下载其内容。本文的主要论点是:不经过合适的处理的网络抓取过程可能会导致采样偏见。本文描述了网络抓取数据中的三种采样偏见来源。具体来说,采样偏见来自于网页内容的变化(即被变化)、个性化(即根据请求特点提供)以及未索引(即人口总数注册)。本文通过一系列示例,ILLUSTRATE了采样偏见的普遍性和规模。为支持研究者和评审人,本文提供了预测、检测和缓解采样偏见的建议。
Federated Learning: Organizational Opportunities, Challenges, and Adoption Strategies
results: 论文认为,在不同领域的优秀组织可能会采取不同的FL方法,并且FL技术将导致机构间的institutional shift,对商业和信息工程团队进行了广泛的交叉学术研究机会。Abstract
Restrictive rules for data sharing in many industries have led to the development of \ac{FL}. \ac{FL} is a \ac{ML} technique that allows distributed clients to train models collaboratively without the need to share their respective training data with others. In this article, we first explore the technical basics of FL and its potential applications. Second, we present a conceptual framework for the adoption of \ac{FL}, mapping organizations along the lines of their \ac{AI} capabilities and environment. We then discuss why exemplary organizations in different industries, including industry consortia, established banks, public authorities, and data-intensive SMEs might consider different approaches to \ac{FL}. To conclude, we argue that \ac{FL} presents an institutional shift with ample interdisciplinary research opportunities for the business and information systems engineering community.
摘要
限制性的资料共享规则在许多行业中导致了\ac{FL}的发展。\ac{FL}是一种\ac{ML}技术,允许分布式客户端共同培训模型,无需共享各自的训练数据。在这篇文章中,我们首先探讨\ac{FL}的技术基础和潜在应用。然后,我们提出了\ac{FL}的采用框架,将组织按照其\ac{AI}能力和环境分类。我们还讨论了不同行业的优秀组织可能采取不同的\ac{FL}方法。 Finally, we argue that \ac{FL} represents an institutional shift with abundant interdisciplinary research opportunities for the business and information systems engineering community.Note:* \ac{FL} stands for Federated Learning* \ac{ML} stands for Machine Learning* \ac{AI} stands for Artificial Intelligence
Towards Personalized Prompt-Model Retrieval for Generative Recommendation
for: 这 paper 探讨了一种新的推荐任务,即使用生成模型创建个性化的ITEMS,并提出了一个两stage框架,即Prompt-Model Retrieval和生成ITEM Ranking,以实现这种新任务。
methods: 这 paper 使用了200个公共可用的生成模型和90个文本提示组合生成了18K个图像,并提出了一个新的评价约束来评价生成模型的个性化能力。
results: 这 paper 的研究结果表明,使用生成模型来推荐ITEMS是一个有前途的个性化问题,但现有的评价约束有限。 authors 还提出了未来的发展方向,以帮助推荐系统领域进一步advance towards generative recommender systems。Abstract
Recommender Systems are built to retrieve relevant items to satisfy users' information needs. The candidate corpus usually consists of a finite set of items that are ready to be served, such as videos, products, or articles. With recent advances in Generative AI such as GPT and Diffusion models, a new form of recommendation task is yet to be explored where items are to be created by generative models with personalized prompts. Taking image generation as an example, with a single prompt from the user and access to a generative model, it is possible to generate hundreds of new images in a few minutes. How shall we attain personalization in the presence of "infinite" items? In this preliminary study, we propose a two-stage framework, namely Prompt-Model Retrieval and Generated Item Ranking, to approach this new task formulation. We release GEMRec-18K, a prompt-model interaction dataset with 18K images generated by 200 publicly-available generative models paired with a diverse set of 90 textual prompts. Our findings demonstrate the promise of generative model recommendation as a novel personalization problem and the limitations of existing evaluation metrics. We highlight future directions for the RecSys community to advance towards generative recommender systems. Our code and dataset are available at https://github.com/MAPS-research/GEMRec.
摘要
<> Retriever 系统是建立在满足用户信息需求的基础上,通常包括一个有限的项集,如视频、产品或文章。在最近的生成AI技术,如GPT和扩散模型的进步下,一种新的推荐任务正在被探索,即通过生成模型和个性化提示来生成新的项目。例如,通过用户提供的单个提示和访问生成模型,可以在几分钟内生成数百个新的图像。在个性化任务中,如何实现个性化?在本初步研究中,我们提出了两个阶段的框架,即提示模型检索和生成项排序,以解决这种新的任务表述。我们发布了GEMRec-18K数据集,包括200个公共可用的生成模型和90个文本提示,用于描述生成模型的互动。我们的发现表明了生成模型推荐的潜力作为一种新的个性化问题,以及现有评价指标的局限性。我们提出了未来的推荐系统社区应该采取的方向,以推动生成推荐系统的发展。我们的代码和数据集可以在https://github.com/MAPS-research/GEMRec上获取。
results: 研究发现了一些最高效的西班牙语语言模型,并且将所有测试 Corpora和最佳模型公开发布,以便由独立团队重新测试或在未来创建新的西班牙语医学领域语言模型时进行参考。Abstract
This survey focuses in encoder Language Models for solving tasks in the clinical domain in the Spanish language. We review the contributions of 17 corpora focused mainly in clinical tasks, then list the most relevant Spanish Language Models and Spanish Clinical Language models. We perform a thorough comparison of these models by benchmarking them over a curated subset of the available corpora, in order to find the best-performing ones; in total more than 3000 models were fine-tuned for this study. All the tested corpora and the best models are made publically available in an accessible way, so that the results can be reproduced by independent teams or challenged in the future when new Spanish Clinical Language models are created.
摘要
Simplified Chinese:这个调查集中心在医疗领域的西班牙语语言模型。我们回顾17个公共领域的贡献,主要是医疗任务,然后列出最 relevante的西班牙语语言模型和西班牙医疗语言模型。我们对这些模型进行了详细的比较,使用一个精心选择的子集来评测它们,以找出最佳的一些。总共超过3000个模型被 fine-tuned для这项研究。所有测试 corpora 和最佳模型都被公开发布,以便由独立的团队重新进行测试或在未来when新的西班牙医疗语言模型被创建时进行挑战。
On stable wrapper-based parameter selection method for efficient ANN-based data-driven modeling of turbulent flows
paper_authors: Hyeongeun Yun, Yongcheol Choi, Youngjae Kim, Seongwon Kang for: 这种研究旨在分析和开发一种基于人工神经网络(ANN)和包裹方法的减少模型方法,以优化复杂的湍流和热传递现象的计算模型。methods: 该方法使用了人工神经网络(ANN)和包裹方法,并且在其他方法,如相关性基于滤波器方法,之前的优势在于去除不必要的或 irrelevante 参数,尤其在非线性中。然而,ANN训练过程中的过拟合和随机性可能会导致在选择试验中不一致的子集产生。results: 该研究分析了一些现有的ANN基于包裹方法,并开发了一种修订后的基于梯度基因子的包裹方法,以最小化总导数或方向一致性损失在每次减少步骤。通过应用这些方法到一个制造 subsets 选择问题、泡沫流中的尺寸模型和duct流中的空间变化的湍流普ран德尔数模型,发现,基于梯度基因子的包裹方法在多个试验中 display 了改进的一致性。此外,减少参数子集还显示了略高的训练速度。Abstract
To model complex turbulent flow and heat transfer phenomena, this study aims to analyze and develop a reduced modeling approach based on artificial neural network (ANN) and wrapper methods. This approach has an advantage over other methods such as the correlation-based filter method in terms of removing redundant or irrelevant parameters even under non-linearity among them. As a downside, the overfitting and randomness of ANN training may produce inconsistent subsets over selection trials especially in a higher physical dimension. This study analyzes a few existing ANN-based wrapper methods and develops a revised one based on the gradient-based subset selection indices to minimize the loss in the total derivative or the directional consistency at each elimination step. To examine parameter reduction performance and consistency-over-trials, we apply these methods to a manufactured subset selection problem, modeling of the bubble size in a turbulent bubbly flow, and modeling of the spatially varying turbulent Prandtl number in a duct flow. It is found that the gradient-based subset selection to minimize the total derivative loss results in improved consistency-over-trials compared to the other ANN-based wrapper methods, while removing unnecessary parameters successfully. For the reduced turbulent Prandtl number model, the gradient-based subset selection improves the prediction in the validation case over the other methods. Also, the reduced parameter subsets show a slight increase in the training speed compared to the others.
摘要
To address these challenges, this study analyzes existing ANN-based wrapper methods and develops a revised approach based on gradient-based subset selection indices to minimize total derivative loss or directional consistency at each elimination step. The performance and consistency of the parameter reduction methods are examined through applications to a manufactured subset selection problem, modeling of bubble size in turbulent bubbly flow, and modeling of spatially varying turbulent Prandtl number in a duct flow.The results show that the gradient-based subset selection approach results in improved consistency-over-trials compared to other ANN-based wrapper methods, while successfully removing unnecessary parameters. Additionally, the reduced turbulent Prandtl number model using the gradient-based subset selection approach improves prediction in the validation case compared to other methods. Finally, the reduced parameter subsets show a slight increase in training speed compared to other methods.
Explaining Relation Classification Models with Semantic Extents
results: 我们的研究显示,模型在分类任务中倾向于学习短cut Patterns,这些Patterns通常难以通过当前解释方法,如输入减少法,检测到。我们的approach可以帮助检测和消除这些假设决策模式,从而提高模型的可靠性和安全性。Abstract
In recent years, the development of large pretrained language models, such as BERT and GPT, significantly improved information extraction systems on various tasks, including relation classification. State-of-the-art systems are highly accurate on scientific benchmarks. A lack of explainability is currently a complicating factor in many real-world applications. Comprehensible systems are necessary to prevent biased, counterintuitive, or harmful decisions. We introduce semantic extents, a concept to analyze decision patterns for the relation classification task. Semantic extents are the most influential parts of texts concerning classification decisions. Our definition allows similar procedures to determine semantic extents for humans and models. We provide an annotation tool and a software framework to determine semantic extents for humans and models conveniently and reproducibly. Comparing both reveals that models tend to learn shortcut patterns from data. These patterns are hard to detect with current interpretability methods, such as input reductions. Our approach can help detect and eliminate spurious decision patterns during model development. Semantic extents can increase the reliability and security of natural language processing systems. Semantic extents are an essential step in enabling applications in critical areas like healthcare or finance. Moreover, our work opens new research directions for developing methods to explain deep learning models.
摘要
在最近的几年,大型预训言语模型,如BERT和GPT,对各种任务的信息提取系统进行了显著改进。现状的系统在科学性评分上具有极高的准确率。然而,当前存在一个复杂的问题,即解释性的缺乏,这使得在实际应用中难以获得可靠的结果。我们引入 semantic extent,用于分析关系分类任务的决策模式。semantic extent 是文本中决策时最重要的部分。我们的定义允许人类和模型使用相同的程序来确定 semantic extent。我们提供了一个注释工具和一个软件框架,以便人类和模型方便地和可重复地确定 semantic extent。对比两者可以看出,模型通常从数据中学习快捷的决策模式。这些模式通过现有的解释方法,如输入减少,难以检测。我们的方法可以帮助检测和消除快捷决策模式的形成。semantic extent 可以增加自然语言处理系统的可靠性和安全性。此外,我们的工作开启了新的研究方向,用于解释深度学习模型。
AutoML4ETC: Automated Neural Architecture Search for Real-World Encrypted Traffic Classification
results: 该工具能够对多个数据集,包括公开的库和实际的TLS和QUIC流量,以比较高精度和更有效的方式进行分类。Abstract
Deep learning (DL) has been successfully applied to encrypted network traffic classification in experimental settings. However, in production use, it has been shown that a DL classifier's performance inevitably decays over time. Re-training the model on newer datasets has been shown to only partially improve its performance. Manually re-tuning the model architecture to meet the performance expectations on newer datasets is time-consuming and requires domain expertise. We propose AutoML4ETC, a novel tool to automatically design efficient and high-performing neural architectures for encrypted traffic classification. We define a novel, powerful search space tailored specifically for the near real-time classification of encrypted traffic using packet header bytes. We show that with different search strategies over our search space, AutoML4ETC generates neural architectures that outperform the state-of-the-art encrypted traffic classifiers on several datasets, including public benchmark datasets and real-world TLS and QUIC traffic collected from the Orange mobile network. In addition to being more accurate, AutoML4ETC's architectures are significantly more efficient and lighter in terms of the number of parameters. Finally, we make AutoML4ETC publicly available for future research.
摘要
深度学习(DL)已成功应用于加密网络流量分类的实验室Setting中。然而,在生产环境中,DL分类器的性能必然逐渐下降。重新训练模型使用更新的数据集只能部分提高其性能。手动重新调整模型结构以符合 newer datasets的性能要求是时间consuming和需要域专业知识。我们提出了 AutoML4ETC,一种新的工具,可以自动设计高效和高性能的神经网络架构来分类加密流量。我们定义了一个特定于几秒钟内的加密流量分类的强大搜索空间。我们展示了不同的搜索策略在我们的搜索空间上,AutoML4ETC生成的神经网络架构可以超过当前加密流量分类器的状态态。此外,AutoML4ETC的架构还更加轻量级,具有更少的参数。最后,我们在未来的研究中公开了 AutoML4ETC。
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
paper_authors: Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh Murthy, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese
for: 这 paper 的目的是探讨如何使用 policy gradient 优化大型语言模型 (LLMs),以提高它们在完成多步任务时的性能。
methods: 该 paper 使用了一种原则性的框架,通过学习一个 retrospective 模型来自动调整语言代理提示,从环境反馈中获取权重。 Specifically, 该框架学习了多个环境和任务的奖励,用于细化一个预训练的语言模型,以优化语言代理提示。
results: 实验结果表明,使用 policy gradient 优化语言代理可以提高其性能,并且我们的方法在多个任务上显著超越了基elines。 这示示了使用 policy gradient 优化语言代理是一个promising的方向,可以应用于其他模型上以提高代理性能。Abstract
Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. This demonstrates that using policy gradient optimization to improve language agents, for which we believe our work is one of the first, seems promising and can be applied to optimize other models in the agent architecture to enhance agent performances over time.
摘要
Translated into Simplified Chinese:近些月,大型自然语言模型(LLM)的增强趋势出现,使得语言代理人可以独立完成目标 oriented 多步任务,而不仅仅是响应人类用户的查询。现有大多数语言代理人并不使用环境特定的奖励来优化。 although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. 这篇论文介绍了一种原则性的框架,用于加强大型语言模型,通过策略幂gradients来学习和调整语言代理人提示。specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans.实验结果表明,语言代理人随着时间的推移而改进,而我们的方法与不使用环境奖励的基线相比,显著提高了表现。这表明,通过策略幂奖励来改进语言代理人,我们认为我们的工作是这类第一个,并且可以应用到其他代理人体系中,以提高代理人表现。
Event-based Dynamic Graph Representation Learning for Patent Application Trend Prediction
results: 我们的方法在实际数据上进行了证明,并在不同的实验条件下表现出了效果。同时,我们的方法还能够学习分类代码的语义和跟踪公司技术发展轨迹。Abstract
Accurate prediction of what types of patents that companies will apply for in the next period of time can figure out their development strategies and help them discover potential partners or competitors in advance. Although important, this problem has been rarely studied in previous research due to the challenges in modelling companies' continuously evolving preferences and capturing the semantic correlations of classification codes. To fill in this gap, we propose an event-based dynamic graph learning framework for patent application trend prediction. In particular, our method is founded on the memorable representations of both companies and patent classification codes. When a new patent is observed, the representations of the related companies and classification codes are updated according to the historical memories and the currently encoded messages. Moreover, a hierarchical message passing mechanism is provided to capture the semantic proximities of patent classification codes by updating their representations along the hierarchical taxonomy. Finally, the patent application trend is predicted by aggregating the representations of the target company and classification codes from static, dynamic, and hierarchical perspectives. Experiments on real-world data demonstrate the effectiveness of our approach under various experimental conditions, and also reveal the abilities of our method in learning semantics of classification codes and tracking technology developing trajectories of companies.
摘要
可以准确预测公司将在下一时间段申请哪些专利,可以 помочь这些公司提前了解发展策略和找到潜在的合作伙伴或竞争对手。虽然这是一个重要的问题,但在过去的研究中它几乎没有得到了关注,因为模型公司的持续发展的偏好和专利分类代码的 semantics 相关性是一个挑战。为了填补这一漏洞,我们提出了一种基于事件的动态图学学习框架,用于预测专利申请趋势。具体来说,我们的方法基于公司和专利分类代码的启发性表示。当观察到新专利时,相关公司和分类代码的表示将根据历史记忆和当前编码的信息进行更新。此外,我们还提供了一种层次消息传递机制,以捕捉专利分类代码的semantics,并在层次分类中更新其表示。最后,我们通过将目标公司和分类代码的表示从静态、动态和层次三个角度进行聚合来预测专利申请趋势。实验结果表明,我们的方法在不同的实验条件下具有显著的有效性,并且能够学习分类代码的 semantics 和跟踪公司的技术发展轨迹。
Analysis and Optimization of Wireless Federated Learning with Data Heterogeneity
results: 实验结果显示,提案的算法可以与其他参考数据比较,从条件下降的角度来看,提高学习精度和能源消耗。Abstract
With the rapid proliferation of smart mobile devices, federated learning (FL) has been widely considered for application in wireless networks for distributed model training. However, data heterogeneity, e.g., non-independently identically distributions and different sizes of training data among clients, poses major challenges to wireless FL. Limited communication resources complicate the implementation of fair scheduling which is required for training on heterogeneous data, and further deteriorate the overall performance. To address this issue, this paper focuses on performance analysis and optimization for wireless FL, considering data heterogeneity, combined with wireless resource allocation. Specifically, we first develop a closed-form expression for an upper bound on the FL loss function, with a particular emphasis on data heterogeneity described by a dataset size vector and a data divergence vector. Then we formulate the loss function minimization problem, under constraints on long-term energy consumption and latency, and jointly optimize client scheduling, resource allocation, and the number of local training epochs (CRE). Next, via the Lyapunov drift technique, we transform the CRE optimization problem into a series of tractable problems. Extensive experiments on real-world datasets demonstrate that the proposed algorithm outperforms other benchmarks in terms of the learning accuracy and energy consumption.
摘要
Translated into Simplified Chinese:随着智能手持设备的普遍传播,联邦学习(FL)在无线网络中被广泛考虑用于分布模型训练。然而,数据不同性,例如客户端之间数据非独立相同分布和不同训练数据大小,对无线FL pose 严重挑战。有限的通信资源使得实施公平调度变得更加困难,这会进一步恶化总性能。为了解决这个问题,本文关注无线FL性能分析和优化,考虑到数据不同性,并与无线资源分配相结合。特别是,我们首先开发了一个closed-form表达式,用于计算FL损失函数的Upper bound,强调数据不同性,用dataset size vector和数据偏移vector来描述。然后,我们将损失函数最小化问题形式化,并将其限制在长期能 consumption和延迟下进行优化。通过LYAPUNOV漂移技术,我们将CRE优化问题转化为一系列可解决的问题。实验表明,提出的算法在实际 dataset 上比其他参考值更高的学习精度和能耗投入。
Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction
results: 在 CARLA simulate 中进行了广泛的实验,并 Validated 在 Town05 Benchmark 上,基于 TransFuser 基网络的多任务特征融合方法能够提高自动驾驶 agents 的安全性和完整性。Abstract
Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works promote the use of multiple sensors in fusion in feature level. While it is well known that multiple data modalities promote mutual contextual exchange, deployment to practical driving scenarios requires global 3D scene understanding in real-time with minimal computations, thus placing greater significance on training strategies given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our multi-task feature fusion augments and improves the base network, TransFuser, by significant margins for safer and more complete road navigation in CARLA simulator as validated on the Town05 Benchmark through extensive experiments.
摘要
感知融合方法对智能自驾车代理人来说仍然是关键,以便在视觉全球上下文中理解驾驶场景。具体来说,当地点预测任务时,单一模态网络仍然受到输入感知器的敏感度的限制,因此最近的工作推广使用多种感知器进行功能层次融合。虽然多个数据模式促进互相交换上下文,但是在实际驾驶场景中部署需要实时全景理解,因此更加重视训练策略,即使使用有限的实用感知器。在这种情况下,我们利用精心选择的协助任务(例如交通信号灯识别和semantic segmentation),并将协助任务特征与基本网络融合,使用协助头进行地点预测基于依据学习。我们的多任务特征融合在TransFuser基础网络上进行加强和改进,在CARLA simulator中进行了广泛的实验,并在Town05 Benchmark上验证了我们的方法能够提供更安全和更完整的道路导航。
results: 作者们在计算机视觉和自然语言处理任务上使用了多种模型、数据集和场景,并通过评估模型的表现来证明他们的方法的有效性。他们的结果表明,使用模型DNA可以准确地确定模型的 происхождение。Abstract
Understanding the life cycle of the machine learning (ML) model is an intriguing area of research (e.g., understanding where the model comes from, how it is trained, and how it is used). This paper focuses on a novel problem within this field, namely Model Provenance (MP), which concerns the relationship between a target model and its pre-training model and aims to determine whether a source model serves as the provenance for a target model. This is an important problem that has significant implications for ensuring the security and intellectual property of machine learning models but has not received much attention in the literature. To fill in this gap, we introduce a novel concept of Model DNA which represents the unique characteristics of a machine learning model. We utilize a data-driven and model-driven representation learning method to encode the model's training data and input-output information as a compact and comprehensive representation (i.e., DNA) of the model. Using this model DNA, we develop an efficient framework for model provenance identification, which enables us to identify whether a source model is a pre-training model of a target model. We conduct evaluations on both computer vision and natural language processing tasks using various models, datasets, and scenarios to demonstrate the effectiveness of our approach in accurately identifying model provenance.
摘要
Translated into Simplified Chinese:理解机器学习模型的生命周期是一个吸引人的研究领域(例如,了解模型来源,如何训练,以及如何使用)。这篇论文关注一个新的问题在这个领域,即模型来源(MP),即目标模型的前训练模型是否为其来源。这是一个重要的问题,它对机器学习模型的安全和知识产权有重要的意义,但在文献中尚未得到过多的关注。为了填补这一漏洞,我们引入了一种新的机器学习模型特征表示(Model DNA),用于表示机器学习模型的唯一特征。我们利用数据驱动和模型驱动的表示学习方法,将模型的训练数据和输入输出信息编码为模型的唯一代表(DNA)。使用这种模型DNA,我们开发了一种高效的模型来源标识框架,可以准确地确定目标模型的前训练模型是否为其来源。我们在计算机视觉和自然语言处理任务上使用了多种模型、数据集和场景,以示出我们方法的准确性。
Designing a Deep Learning-Driven Resource-Efficient Diagnostic System for Metastatic Breast Cancer: Reducing Long Delays of Clinical Diagnosis and Improving Patient Survival in Developing Countries
paper_authors: William Gao, Dayong Wang, Yi Huang for: 这个研究旨在解决癌症疗症癌症患者在开发中国家的诊断延迟问题,特别是在SUB-SAHARAN AFRICA、南亚和南美洲,该问题导致患者存活率偏低。methods: 本研究使用了深度学习技术开发了一个可以实现高精度诊断和计算效率的乳癌诊断系统。研究使用了MobileNetV2模型,并评估了不同模型的精度、普遍性和训练效率。results: 研究结果显示,MobileNetV2模型在诊断精度、普遍性和训练效率方面表现出色,比较其他VGG16、ResNet50和ResNet101模型更高。实际比较显示,MobileNetV2模型可以识别小型乳癌细胞,实现人工分析所不能做的。此外, MobileNetV2模型的计算效率足以在移动设备或低计算能力的设备上运行。Abstract
Breast cancer is one of the leading causes of cancer mortality. Breast cancer patients in developing countries, especially sub-Saharan Africa, South Asia, and South America, suffer from the highest mortality rate in the world. One crucial factor contributing to the global disparity in mortality rate is long delay of diagnosis due to a severe shortage of trained pathologists, which consequently has led to a large proportion of late-stage presentation at diagnosis. The delay between the initial development of symptoms and the receipt of a diagnosis could stretch upwards 15 months. To tackle this critical healthcare disparity, this research has developed a deep learning-based diagnosis system for metastatic breast cancer that can achieve high diagnostic accuracy as well as computational efficiency. Based on our evaluation, the MobileNetV2-based diagnostic model outperformed the more complex VGG16, ResNet50 and ResNet101 models in diagnostic accuracy, model generalization, and model training efficiency. The visual comparisons between the model prediction and ground truth have demonstrated that the MobileNetV2 diagnostic models can identify very small cancerous nodes embedded in a large area of normal cells which is challenging for manual image analysis. Equally Important, the light weighted MobleNetV2 models were computationally efficient and ready for mobile devices or devices of low computational power. These advances empower the development of a resource-efficient and high performing AI-based metastatic breast cancer diagnostic system that can adapt to under-resourced healthcare facilities in developing countries. This research provides an innovative technological solution to address the long delays in metastatic breast cancer diagnosis and the consequent disparity in patient survival outcome in developing countries.
摘要
乳癌是全球最主要的肿瘤死亡原因之一,特别是在发展中国家,如南部非洲、南亚和南美,患者的死亡率最高。一个关键的因素是诊断延迟,由于缺乏专业的病理学家,导致许多患者在诊断时已经是晚期状态。延迟从症状初显到诊断的时间可以达15个月。为了解决这个重要的医疗差距,这项研究开发了一个基于深度学习的乳癌诊断系统,可以实现高精度和计算效率。根据我们的评估,基于MobileNetV2的诊断模型在诊断精度、模型泛化和模型训练效率方面都高于更复杂的VGG16、ResNet50和ResNet101模型。视觉比较表明,MobileNetV2诊断模型可以很好地识别小于1毫米的恶性细胞,这是人工图像分析困难的。此外,MobileNetV2模型轻量级,适用于移动设备或低计算能力的设备。这些进步使得可以开发一个资源有效和高性能的人工智能基本乳癌诊断系统,适应发展中国家的医疗设施。这项研究提供了一种创新的科技解决方案,以减少晚期乳癌诊断延迟和患者存活率差距。
VQGraph: Graph Vector-Quantization for Bridging GNNs and MLPs
results: 该 paper 的实验和分析表明,VQGraph 可以具有更好的性能,比 GNN 更快速地进行推理,并且可以提高 GNN 和独立的 MLP 的准确率。 Code: https://github.com/YangLing0818/VQGraph。Abstract
Graph Neural Networks (GNNs) conduct message passing which aggregates local neighbors to update node representations. Such message passing leads to scalability issues in practical latency-constrained applications. To address this issue, recent methods adopt knowledge distillation (KD) to learn computationally-efficient multi-layer perceptron (MLP) by mimicking the output of GNN. However, the existing GNN representation space may not be expressive enough for representing diverse local structures of the underlying graph, which limits the knowledge transfer from GNN to MLP. Here we present a novel framework VQGraph to learn a powerful graph representation space for bridging GNNs and MLPs. We adopt the encoder of a variant of a vector-quantized variational autoencoder (VQ-VAE) as a structure-aware graph tokenizer, which explicitly represents the nodes of diverse local structures as numerous discrete tokens and constitutes a meaningful codebook. Equipped with the learned codebook, we propose a new token-based distillation objective based on soft token assignments to sufficiently transfer the structural knowledge from GNN to MLP. Extensive experiments and analyses demonstrate the strong performance of VQGraph, where we achieve new state-of-the-art performance on GNN-MLP distillation in both transductive and inductive settings across seven graph datasets. We show that VQGraph with better performance infers faster than GNNs by 828x, and also achieves accuracy improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average, respectively. Code: https://github.com/YangLing0818/VQGraph.
摘要
格图神经网络(GNNs)通过消息传递来聚合当地邻居来更新节点表示。然而,这种消息传递可能会导致实际应用中的执行效率问题。为解决这个问题,当前的方法采用知识填充(KD)来学习计算效率高的多层感知器(MLP),但是现有的GNN表示空间可能不够表达当地图structure的多样性,这限制了GNN知识的传递。在这种情况下,我们提出了一种新的框架VQGraph,用于学习一个强大的图表示空间,以将GNN和MLP相互链接。我们采用一种变体的vector-quantizedvariational autoencoder(VQ-VAE)的Encoder作为结构意识的图Tokenizer,从而Explicitly表示节点的多样性,并构成一个有意义的codebook。利用学习的codebook,我们提出了一种新的token-based填充目标,基于软件Token分配来充分传递GNN的结构知识到MLP。我们的实验和分析表明,VQGraph具有强大的性能,在GNN-MLP填充中达到了新的状态作图表现,在七个图 dataset上,我们实现了828倍 faster than GNNs,并且在 inductive 和推uctive Setting中,GNN和独立的MLP的准确率提高了3.90%和28.05%的平均值。代码:https://github.com/YangLing0818/VQGraph。
AdvFAS: A robust face anti-spoofing framework against adversarial examples
paper_authors: Jiawei Chen, Xiao Yang, Heng Yin, Mingzhi Ma, Bihui Chen, Jianteng Peng, Yandong Guo, Zhaoxia Yin, Hang Su
for: 防止面Recognition系统受到呈现攻击的可靠性
methods: 利用两个相互关联的分数来准确地分辨正确探测和错误探测的面图像
results: 在不同的攻击方式、数据集和后处理器上,可以准确地识别出面图像中的攻击,同时保持高精度对正常图像的识别。此外,成功应用于实际世界中的攻击示例。Abstract
Ensuring the reliability of face recognition systems against presentation attacks necessitates the deployment of face anti-spoofing techniques. Despite considerable advancements in this domain, the ability of even the most state-of-the-art methods to defend against adversarial examples remains elusive. While several adversarial defense strategies have been proposed, they typically suffer from constrained practicability due to inevitable trade-offs between universality, effectiveness, and efficiency. To overcome these challenges, we thoroughly delve into the coupled relationship between adversarial detection and face anti-spoofing. Based on this, we propose a robust face anti-spoofing framework, namely AdvFAS, that leverages two coupled scores to accurately distinguish between correctly detected and wrongly detected face images. Extensive experiments demonstrate the effectiveness of our framework in a variety of settings, including different attacks, datasets, and backbones, meanwhile enjoying high accuracy on clean examples. Moreover, we successfully apply the proposed method to detect real-world adversarial examples.
摘要
保证人脸识别系统对于投放攻击的可靠性需要实施人脸反伪技术。尽管在这域中已经取得了很大的进步,但是even the most state-of-the-art methods still cannot effectively defend against adversarial examples。虽然有许多防御攻击策略被提出,但它们通常受到了不可避免的universality、效果和效率的负担。为了突破这些挑战,我们深入探究了对抗攻击和人脸反伪的关联关系。基于这,我们提出了一个robust的人脸反伪框架,即AdvFAS,该框架利用了两个联合分数来准确地分辨 correctly detected和 incorrectly detected的人脸图像。广泛的实验表明我们的框架在不同的攻击、数据集和后端下具有极高的准确率,同时对于清洁的例子也有高效率。此外,我们成功应用了提议的方法来实际中检测真实的攻击示例。
N-gram Boosting: Improving Contextual Biasing with Normalized N-gram Targets
results: 实现26%的关键词识别率提高,相比于专有数据集和LibriSpeechNote:* “提高商务对话中关键词的识别率” means “improve the recognition rate of key words in business conversations”* “两步关键词增强机制” means “two-step keyword boosting mechanism”* “Normalized unigrams和n-grams” means “normalized unigrams and n-grams”* “过度增强多个词语” means “over-boosting multiple words”Abstract
Accurate transcription of proper names and technical terms is particularly important in speech-to-text applications for business conversations. These words, which are essential to understanding the conversation, are often rare and therefore likely to be under-represented in text and audio training data, creating a significant challenge in this domain. We present a two-step keyword boosting mechanism that successfully works on normalized unigrams and n-grams rather than just single tokens, which eliminates missing hits issues with boosting raw targets. In addition, we show how adjusting the boosting weight logic avoids over-boosting multi-token keywords. This improves our keyword recognition rate by 26% relative on our proprietary in-domain dataset and 2% on LibriSpeech. This method is particularly useful on targets that involve non-alphabetic characters or have non-standard pronunciations.
摘要
正确地转录特定名称和技术 терміns是在商业对话中的speech-to-text应用程序中 particualrly 重要。这些字眼,它们是理解对话的重要组成部分,通常是罕见的,因此在文本和音频训练数据中受到抑制,创建了一个大型挑战。我们提出了一个two-step键字提升机制,成功运作于 Normalized unigrams 和 n-grams 而不是单token,这样排除了遗漏命中问题。此外,我们显示了如何调整提升重量逻辑,以避免过度增强多token键字。这将提高我们的键字识别率 by 26% 相对于我们的专有项目数据,并且提高了2% 相对于 LibriSpeech。这种方法特别有用于目标包含非字母字符或非标准读法的情况。
Efficient Model Adaptation for Continual Learning at the Edge
paper_authors: Zachary A. Daniels, Jun Hu, Michael Lomnitz, Phil Miller, Aswin Raghavan, Joe Zhang, Michael Piacentino, David Zhang
For: This paper is written for those interested in non-stationary automated machine learning (AutoML) models for efficient continual learning under domain shifts.* Methods: The paper presents the Encoder-Adaptor-Reconfigurator (EAR) framework, which uses a fixed deep neural network (DNN) feature encoder and trains shallow networks on top of the encoder to handle novel data. The EAR framework combines DNNs with hyperdimensional computing (HDC) to detect when new data is out-of-distribution (OOD), and uses zero-shot neural architecture search (ZS-NAS) to identify low-parameter neural adaptors to adapt the model to the OOD data.* Results: The paper demonstrates strong performance compared to state-of-the-art algorithms for OOD detection and few-/zero-shot NAS on several benchmark datasets for domain adaptation. The EAR framework is capable of minimizing catastrophic forgetting on previous tasks by progressively growing the neural architecture as needed and dynamically routing data through the appropriate adaptors and reconfigurators for handling domain-incremental and class-incremental continual learning.Abstract
Most machine learning (ML) systems assume stationary and matching data distributions during training and deployment. This is often a false assumption. When ML models are deployed on real devices, data distributions often shift over time due to changes in environmental factors, sensor characteristics, and task-of-interest. While it is possible to have a human-in-the-loop to monitor for distribution shifts and engineer new architectures in response to these shifts, such a setup is not cost-effective. Instead, non-stationary automated ML (AutoML) models are needed. This paper presents the Encoder-Adaptor-Reconfigurator (EAR) framework for efficient continual learning under domain shifts. The EAR framework uses a fixed deep neural network (DNN) feature encoder and trains shallow networks on top of the encoder to handle novel data. The EAR framework is capable of 1) detecting when new data is out-of-distribution (OOD) by combining DNNs with hyperdimensional computing (HDC), 2) identifying low-parameter neural adaptors to adapt the model to the OOD data using zero-shot neural architecture search (ZS-NAS), and 3) minimizing catastrophic forgetting on previous tasks by progressively growing the neural architecture as needed and dynamically routing data through the appropriate adaptors and reconfigurators for handling domain-incremental and class-incremental continual learning. We systematically evaluate our approach on several benchmark datasets for domain adaptation and demonstrate strong performance compared to state-of-the-art algorithms for OOD detection and few-/zero-shot NAS.
摘要
大多数机器学习(ML)系统假设训练和部署时数据分布是静止的,这是一个错误的假设。当 ML 模型被部署在真实设备上时,数据分布经常会随着环境因素、传感器特性和任务的变化而变化。虽然可以有人在loop来监视数据分布的变化并开发新的建筑以适应这些变化,但这种设置不是经济可行的。相反,不稳定的自动化机器学习(AutoML)模型是需要的。这篇论文介绍了Encoder-Adaptor-Reconfigurator(EAR)框架,用于高效地进行逐步学习下domain shift。EAR框架使用固定的深度神经网络(DNN)特征编码器,并在编码器之上训练浅网络来处理新数据。EAR框架可以1) 将新数据作为out-of-distribution(OOD)检测,结合DNN和高维计算(HDC),2) 使用零参数神经适应器来适应OOD数据,并3) 通过逐步增长神经建筑和动态路由数据来避免快速卷积FORGET。我们系统地评估了我们的方法在几个标准的预测数据集上,并demonstrate了与当前的OOD检测和零参数NAS方法相比的优秀表现。
Disease Insight through Digital Biomarkers Developed by Remotely Collected Wearables and Smartphone Data
For: The paper is written to explore the potential of digital biomarkers and remote patient monitoring in improving healthcare, particularly in the context of long-term longitudinal data collection and large-scale remote monitoring studies.* Methods: The paper describes the development of an open-source platform called RADAR-base, which supports scalability, extensibility, security, privacy, and quality of data for remote data collection and digital phenotyping. The platform uses Confluent’s Apache Kafka for data management and provides features such as study design and set-up, active and passive remote data collection capabilities, and secure data transmission.* Results: The paper reports that the RADAR-base platform has successfully collected longitudinal data for various cohorts in different disease areas, including Multiple Sclerosis, Depression, Epilepsy, ADHD, Alzheimer’s disease, Autism, and Lung diseases. The digital biomarkers developed through collected data have provided useful insights into different diseases, and clinicians can use these biomarkers to augment their decision-making for disease prevention, personalization, and early intervention.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为了探讨数字生物标志和远程患者监测在医疗方面的潜力,特别是在长期长期数据收集和大规模远程监测研究中。* Methods: 论文描述了一种名为RADAR-base的开源平台,该平台支持扩展性、可维护性、安全性、隐私性和数据质量。该平台使用Confluent的Apache Kafka进行数据管理,并提供了研究设计和设置、活动(例如患者自reported数据)和过去数据收集能力,以及安全数据传输和可扩展的数据存储和管理解决方案。* Results: 论文报告称RADAR-base平台已成功收集了不同疾病的长期数据,包括多发性硬化症、抑郁症、癫痫症、ADHD、阿尔ц海默症、自闭症和肺病等。数字生物标志通过收集的数据提供了对不同疾病的有用信息,且临床医生可以使用这些生物标志来增强其决策作用,以预防、个性化和早期 intervene 疾病。Abstract
Digital Biomarkers and remote patient monitoring can provide valuable and timely insights into how a patient is coping with their condition (disease progression, treatment response, etc.), complementing treatment in traditional healthcare settings.Smartphones with embedded and connected sensors have immense potential for improving healthcare through various apps and mHealth (mobile health) platforms. This capability could enable the development of reliable digital biomarkers from long-term longitudinal data collected remotely from patients. We built an open-source platform, RADAR-base, to support large-scale data collection in remote monitoring studies. RADAR-base is a modern remote data collection platform built around Confluent's Apache Kafka, to support scalability, extensibility, security, privacy and quality of data. It provides support for study design and set-up, active (eg PROMs) and passive (eg. phone sensors, wearable devices and IoT) remote data collection capabilities with feature generation (eg. behavioural, environmental and physiological markers). The backend enables secure data transmission, and scalable solutions for data storage, management and data access. The platform has successfully collected longitudinal data for various cohorts in a number of disease areas including Multiple Sclerosis, Depression, Epilepsy, ADHD, Alzheimer, Autism and Lung diseases. Digital biomarkers developed through collected data are providing useful insights into different diseases. RADAR-base provides a modern open-source, community-driven solution for remote monitoring, data collection, and digital phenotyping of physical and mental health diseases. Clinicians can use digital biomarkers to augment their decision making for the prevention, personalisation and early intervention of disease.
摘要
《数字生物标志和远程患者监测可以提供有价值和时效的病情信息,补充传统医疗设置中的治疗。智能手机 embed 和连接的感知器具有潜在的提高医疗质量的潜力,通过不同的应用和移动医疗(mHealth)平台。这种能力可以帮助开发可靠的数字生物标志,从远程收集的患者长期数据中提取有用信息。我们开发了一个开源平台,RADAR-base,以支持大规模数据收集在远程监测研究中。RADAR-base 是一个现代远程数据收集平台,基于 Confluent 的 Apache Kafka,以支持可扩展性、安全性、隐私性和数据质量。它提供了研究设计和设置、活动(例如 PROMs)和通过PASSIVE(例如手机感知器、智能手环和 IoT)远程数据收集能力,以及特征生成(例如行为、环境和生理 marker)。后端支持安全数据传输,扩展性的数据存储、管理和访问解决方案。该平台已成功收集了多种疾病区域的长期数据,包括多发性硬化症、抑郁症、 эпилепсия、ADHD、阿尔茨海默症和肺病。数字生物标志通过收集的数据提供了对不同疾病的有用信息。RADAR-base 提供了一个现代开源、社区驱动的解决方案,用于远程监测、数据收集和数字化现象学。临床医生可以使用数字生物标志来补充决策,预防、个性化和早期干预疾病。
Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing with Non-Learnable Primitives
methods: 这篇论文使用了一种synergistic combination of non-learnable primitives (NLPs)和explicit task routing (ETR)来减少任务干扰。具体来说,这篇论文使用了非学习的原始元素(NLPs)来提取多个任务共同的特征,然后将这些特征重新整合为共同的分支和每个任务的特定分支。
results: 实验结果显示,ETR-NLP网络在多个 dataset 上均能够取得state-of-the-art的性能,并且需要 fewer learnable parameters 和相似的 computations 。代码可以在这里找到:https://github.com/zhichao-lu/etr-nlp-mtl。Abstract
Multi-task learning (MTL) seeks to learn a single model to accomplish multiple tasks by leveraging shared information among the tasks. Existing MTL models, however, have been known to suffer from negative interference among tasks. Efforts to mitigate task interference have focused on either loss/gradient balancing or implicit parameter partitioning with partial overlaps among the tasks. In this paper, we propose ETR-NLP to mitigate task interference through a synergistic combination of non-learnable primitives (NLPs) and explicit task routing (ETR). Our key idea is to employ non-learnable primitives to extract a diverse set of task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task. The non-learnable primitives and the explicit decoupling of learnable parameters into shared and task-specific ones afford the flexibility needed for minimizing task interference. We evaluate the efficacy of ETR-NLP networks for both image-level classification and pixel-level dense prediction MTL problems. Experimental results indicate that ETR-NLP significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets. Code is available at this \href{https://github.com/zhichao-lu/etr-nlp-mtl}.
摘要
多任务学习(MTL)目的是学习一个模型来完成多个任务,利用任务之间的共享信息。现有的MTL模型, however,有许多任务间的负面干扰。减轻任务干扰的努力主要集中在损失/梯度均衡或隐藏参数分区中。在这篇论文中,我们提出了ETR-NLP,用于减轻任务干扰的综合方法,包括非学习原理(NLP)和显式任务路由(ETR)。我们的关键想法是employnon-learnable原理来提取多个任务无关的特征,并将它们重新组合到一个共同的分支和每个任务的特定分支中。非学习原理和显式归一化学习参数为共享和任务特定的两个分支,允许我们适应性地减少任务干扰。我们在图像级别的分类和像素级的密集预测MTL问题中评估了ETR-NLP网络的效果。实验结果表明,ETR-NLPsignificantly exceedsstate-of-the-art基线,只需 fewer learnable parameters和相同的FLOPs。代码可以在这里找到:https://github.com/zhichao-lu/etr-nlp-mtl。
On the Biometric Capacity of Generative Face Models
paper_authors: Vishnu Naresh Boddeti, Gautam Sreekumar, Arun Ross for: 这篇论文想要回答一个关键问题,即给定一个生成人脸模型,该模型可以生成多少个唯一的身份?methods: 该论文提出了一种统计方法来估算生成人脸图像的生物学性能。他们使用了幂圆混合特征空间来计算生成的人脸图像的质量。results: 研究发现,使用不同的生成模型,生成的人脸图像的生物学性能强相关于识别率。在false acceptance rate(FAR)为0.1%时,StyleGAN3和DCFace的生物学性能Upper bound分别为1.43×10^6和1.190×10^4。此外,研究还发现,降低desired FAR会导致生物学性能下降,并且存在一定的年龄差异。Abstract
There has been tremendous progress in generating realistic faces with high fidelity over the past few years. Despite this progress, a crucial question remains unanswered: "Given a generative face model, how many unique identities can it generate?" In other words, what is the biometric capacity of the generative face model? A scientific basis for answering this question will benefit evaluating and comparing different generative face models and establish an upper bound on their scalability. This paper proposes a statistical approach to estimate the biometric capacity of generated face images in a hyperspherical feature space. We employ our approach on multiple generative models, including unconditional generators like StyleGAN, Latent Diffusion Model, and "Generated Photos," as well as DCFace, a class-conditional generator. We also estimate capacity w.r.t. demographic attributes such as gender and age. Our capacity estimates indicate that (a) under ArcFace representation at a false acceptance rate (FAR) of 0.1%, StyleGAN3 and DCFace have a capacity upper bound of $1.43\times10^6$ and $1.190\times10^4$, respectively; (b) the capacity reduces drastically as we lower the desired FAR with an estimate of $1.796\times10^4$ and $562$ at FAR of 1% and 10%, respectively, for StyleGAN3; (c) there is no discernible disparity in the capacity w.r.t gender; and (d) for some generative models, there is an appreciable disparity in the capacity w.r.t age. Code is available at https://github.com/human-analysis/capacity-generative-face-models.
摘要
“过去几年,生成高质量的人脸模型已经取得了很大的进步。然而,一个重要的问题仍然没有答案:“给定一个生成人脸模型,它可以生成多少个唯一的人脸?”就是生成人脸模型的生物特征容量。一个科学基础的答案将有助于评估和比较不同的生成人脸模型,并且设置一个最大的可扩展性上限。本文提出了一个统计方法来估算生成人脸模型中的生物特征容量。我们使用这个方法评估了多个生成模型,包括StyleGAN、Latent Diffusion Model和“Generated Photos”等,以及DCFace,一个基于类别的生成模型。我们还估算了容量对于人口特征如性别和年龄的影响。我们的容量估算结果显示:(a)在 ArcFace 表示下, False Acceptance Rate(FAR)为0.1% 的情况下,StyleGAN3 和 DCFace 的容量Upper bound 为 $1.43\times10^6$ 和 $1.190\times10^4$,respectively;(b)容量随着 desired FAR 下降, ArcFace 表示下,FAR 为1% 和 10% 的情况下,StyleGAN3 的容量估算为 $1.796\times10^4$ 和 $562$,respectively;(c) gender 不会影响容量; (d) certain 生成模型中,年龄会影响容量。codes 可以在 GitHub 上找到:https://github.com/human-analysis/capacity-generative-face-models。”
Accurate Neural Network Pruning Requires Rethinking Sparse Optimization
results: 研究达到了高粗糙性下的state-of-the-art result,并提供了关于粗糙训练的difficulty的分析。Abstract
Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, much less is known about the interaction between sparsity and the standard stochastic optimization techniques used for training sparse networks, and most existing work uses standard dense schedules and hyperparameters for training sparse networks. In this work, we examine the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks. We begin by showing that using standard dense training recipes for sparse training is suboptimal, and results in under-training. We provide new approaches for mitigating this issue for both sparse pre-training of vision models (e.g. ResNet50/ImageNet) and sparse fine-tuning of language models (e.g. BERT/GLUE), achieving state-of-the-art results in both settings in the high-sparsity regime, and providing detailed analyses for the difficulty of sparse training in both scenarios. Our work sets a new threshold in terms of the accuracies that can be achieved under high sparsity, and should inspire further research into improving sparse model training, to reach higher accuracies under high sparsity, but also to do so efficiently.
摘要
obteniendo versiones de redes neuronales profundas que sean tanto altamente precisas como altamente esparsas es uno de los desafíos principales en el área de compresión de modelos, y varias técnicas de alta performance de poda han sido investigadas por la comunidad. Sin embargo, mucho menos se sabe sobre la interacción entre la esparsidad y las técnicas de optimización estocástica estándar utilizadas para entrenar redes esparsas, y la mayoría de las trabajos utilizan schedule y hiperparámetros estándar para entrenar redes esparsas. En este trabajo, examinamos el impacto de la alta esparsidad en el entrenamiento de modelos utilizando los benchmarks de visión por computadora y procesamiento de lenguaje estándar. Comenzamos mostrando que el uso de recetas de entrenamiento densas estándar para entrenar redes esparsas es subóptimo y resulta en under-entrenamiento. Proporcionamos nuevos enfoques para mitigar este problema para both pre-entrenamiento esparso de modelos de visión (por ejemplo, ResNet50/ImageNet) y fine-tuning esparso de modelos de lenguaje (por ejemplo, BERT/GLUE), logrando resultados estatales de la obra en ambos escenarios en el régimen de alta esparsidad, y proporcionando análisis detallados de la dificultad del entrenamiento esparso en ambos escenarios. Nuestro trabajo establece un nuevo umbral en términos de las precisiones que se pueden lograr bajo alta esparsidad, y debe inspirar más investigación en mejorar el entrenamiento de modelos esparsos, para alcanzar mayores precisiones bajo alta esparsidad, pero también hacerlo eficientemente.
Incorporating Recklessness to Collaborative Filtering based Recommender Systems
results: 实验结果表明,recklessness不仅可以控制风险水平,还可以提高系统的预测量和质量。Abstract
Recommender systems that include some reliability measure of their predictions tend to be more conservative in forecasting, due to their constraint to preserve reliability. This leads to a significant drop in the coverage and novelty that these systems can provide. In this paper, we propose the inclusion of a new term in the learning process of matrix factorization-based recommender systems, called recklessness, which enables the control of the risk level desired when making decisions about the reliability of a prediction. Experimental results demonstrate that recklessness not only allows for risk regulation but also improves the quantity and quality of predictions provided by the recommender system.
摘要
建议系统,包括一些可靠度评估其预测的对策,往往会比较保守,因为它们需要保持可靠度。这会导致建议系统的覆盖率和新鲜度受到相当的减少。在这篇论文中,我们提议在矩阵分解基础的建议系统学习过程中添加一个新的变量,即“无耻”,以控制建议系统为预测中的风险水平。实验结果显示,无耻不仅可以调节风险,而且可以提高建议系统提供的量和质量。
The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations
paper_authors: Abel Salinas, Parth Vipul Shah, Yuzhong Huang, Robert McCormack, Fred Morstatter
for: This paper aims to analyze and compare demographic biases in two cutting-edge large language models (LLMs), ChatGPT and LLaMA, through the lens of job recommendations.
methods: The authors propose a simple method for measuring intersectional biases in LLMs, which can be extended to examine biases associated with any intersection of demographic identities.
results: The study finds distinct biases in both models toward various demographic identities, such as consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. These results highlight the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes.Here is the same information in Simplified Chinese text:
results: 研究发现ChatGPT和LLaMA都存在各种人群偏见,如 consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women。这些结果 highlights the importance of measuring LLMs的偏见在下游应用中,以便理解可能的危害和不公平的结果。Abstract
Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recommendations. We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA, two cutting-edge LLMs. Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. Our study highlights the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes.
摘要
SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents
paper_authors: Amirhossein Zolfagharian, Manel Abdellatif, Lionel C. Briand, Ramesh S for: 这篇论文是为了解决深度强化学习算法在安全关键系统中的安全问题而写的。methods: 这篇论文提出了一种基于机器学习的安全监测方法,名为SMARLA,可以用于检测深度强化学习代理的安全违反行为。SMARLA采用黑盒设计,不需要访问代理的内部,并且利用状态抽象来减少状态空间,从而使得学习安全违反预测模型更加容易。results: 实验分析表明,SMARLA可以准确预测安全违反行为,False Positive率较低,可以在代理执行的初期,大约半个执行前提前预测安全违反行为。Abstract
Deep reinforcement learning algorithms (DRL) are increasingly being used in safety-critical systems. Ensuring the safety of DRL agents is a critical concern in such contexts. However, relying solely on testing is not sufficient to ensure safety as it does not offer guarantees. Building safety monitors is one solution to alleviate this challenge. This paper proposes SMARLA, a machine learning-based safety monitoring approach designed for DRL agents. For practical reasons, SMARLA is designed to be black-box (as it does not require access to the internals of the agent) and leverages state abstraction to reduce the state space and thus facilitate the learning of safety violation prediction models from agent's states. We validated SMARLA on two well-known RL case studies. Empirical analysis reveals that SMARLA achieves accurate violation prediction with a low false positive rate, and can predict safety violations at an early stage, approximately halfway through the agent's execution before violations occur.
摘要
深度强化学习算法(DRL)在安全关键系统中越来越广泛使用。保证DRL代理的安全是一个关键问题。然而,仅仅通过测试是无法保证安全的,因为测试不能提供保证。本文提出了SMARLA,一种基于机器学习的安全监测方法,专门为DRL代理设计。由于实际原因,SMARLA采用黑盒设计(不需要代理的内部访问权限),并利用状态抽象来减少状态空间,从而使得学习安全违反预测模型从代理的状态中更加容易。我们对两个常见RL案例进行了验证。实验分析表明,SMARLA可以准确预测安全违反,并且False Positive率较低,能够在代理执行过程中的一半阶段预测安全违反。
Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators
results: 研究表明,使用STT-MRAM作为卷积神经网络训练加速器的缓存可以提供15-22倍的系统级能gy减少,而无需妥协training训练精度。此外,该研究还发现了一种可以在训练过程中减少写电压和持续时间的方法,以降低MRAM写操作的能量消耗。Abstract
Progress in artificial intelligence and machine learning over the past decade has been driven by the ability to train larger deep neural networks (DNNs), leading to a compute demand that far exceeds the growth in hardware performance afforded by Moore's law. Training DNNs is an extremely memory-intensive process, requiring not just the model weights but also activations and gradients for an entire minibatch to be stored. The need to provide high-density and low-leakage on-chip memory motivates the exploration of emerging non-volatile memory for training accelerators. Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators, including 3-4x higher density than SRAM, significantly reduced leakage power, high endurance and reasonable access time. On the one hand, MRAM write operations require high write energy and latency due to the need to ensure reliable switching. In this study, we perform a comprehensive device-to-system evaluation and co-optimization of STT-MRAM for efficient ML training accelerator design. We devised a cross-layer simulation framework to evaluate the effectiveness of STT-MRAM as a scratchpad replacing SRAM in a systolic-array-based DNN accelerator. To address the inefficiency of writes in STT-MRAM, we propose to reduce write voltage and duration. To evaluate the ensuing accuracy-efficiency trade-off, we conduct a thorough analysis of the error tolerance of input activations, weights, and errors during the training. We propose heterogeneous memory configurations that enable training convergence with good accuracy. We show that MRAM provide up to 15-22x improvement in system level energy across a suite of DNN benchmarks under iso-capacity and iso-area scenarios. Further optimizing STT-MRAM write operations can provide over 2x improvement in write energy for minimal degradation in application-level training accuracy.
摘要
过去一代人工智能和机器学习的进步主要受到深度神经网络(DNN)的训练所带来的 compute 需求,但是这些需求远 exceeds Moore's law 所提供的硬件性能增长。训练 DNN 需要一整个 minibatch 的模型重量、活化和梯度,这需要大量的内存。为了提供高密度且低漏电压的内存,我们开展了非易失性内存的探索,例如 Spin-Transfer-Torque MRAM (STT-MRAM)。STT-MRAM 具有训练 accelerator 的多个推荐性特点,包括较高的密度(3-4倍于 SRAM)、较低的漏电压、高终端urance 和合理的存取时间。然而,MRAM 写操作需要高的写入能量和延迟,因为需要保证可靠的转换。在这个研究中,我们对 STT-MRAM 的设备到系统评估和共优化进行了全面的评估。我们开发了跨层次的模拟框架,以评估 STT-MRAM 作为 DNN 加速器的磁盘储存器。为了解决 STT-MRAM 写入不效的问题,我们提议降低写入电压和时间。我们进行了详细的错误耐受分析,以评估对训练精度的影响。我们还提议了多元内存配置,以确保训练的团结。我们发现,使用 STT-MRAM 可以提供训练系统级能源的15-22倍提升,在一系列 DNN 参考数下。对于是o-容量和是o-面积情况,我们还提出了一些优化 STT-MRAM 写入操作的方法,可以提供更高的写入能源提升,而不会对应用程序级训练精度造成明显的损害。
On the Transition from Neural Representation to Symbolic Knowledge
results: 我们通过实验 demonstrates that the learned representation可以实现可读性 decomposition of visual input,并且可以适应下游任务,例如segmentation和推理等。此外,我们还使用了RL来进一步调整学习的got prototypes,以捕捉subjective factor。Abstract
Bridging the huge disparity between neural and symbolic representation can potentially enable the incorporation of symbolic thinking into neural networks from essence. Motivated by how human gradually builds complex symbolic representation from the prototype symbols that are learned through perception and environmental interactions. We propose a Neural-Symbolic Transitional Dictionary Learning (TDL) framework that employs an EM algorithm to learn a transitional representation of data that compresses high-dimension information of visual parts of an input into a set of tensors as neural variables and discover the implicit predicate structure in a self-supervised way. We implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game, then learn predicates by prototype clustering. We additionally use RL enabled by the Markovian of diffusion models to further tune the learned prototypes by incorporating subjective factors. Extensive experiments on 3 abstract compositional visual objects datasets that require the model to segment parts without any visual features like texture, color, or shadows apart from shape and 3 neural/symbolic downstream tasks demonstrate the learned representation enables interpretable decomposition of visual input and smooth adaption to downstream tasks which are not available by existing methods.
摘要
将神经和符号表示之间的巨大差异桥接可能会使神经网络中包含符号思维。人类通过通过感知和环境互动学习的方式,慢慢地构建了复杂的符号表示。我们提出一个神经-符号过渡字典学习(TDL)框架,使用EM算法学习输入数据的过渡表示,压缩输入数据的高维信息为神经变量,自然地发现数据中隐藏的逻辑结构。我们在扩散模型中对输入的分解视为合作游戏,然后通过聚类算法学习预测。此外,我们还使用基于扩散模型的Markov链接学习Subjective因素来进一步调整学习的获得的 проtotypes。我们在3个抽象compositional视觉数据集和3个神经/符号下渠任务中进行了广泛的实验,结果表明学习的表示允许可 interpreted decompositions of visual input和下渠任务中的灵活适应。
Domain specificity and data efficiency in typo tolerant spell checkers: the case of search in online marketplaces
paper_authors: Dayananda Ubrangala, Juhi Sharma, Ravi Prasad Kondapalli, Kiran R, Amit Agarwala, Laurent Boué
for: 本研究旨在提高在线市场场所中的搜索功能,帮助用户更准确地搜索产品。
methods: 本研究使用数据增强技术,生成域限定的隐藏表示,并通过批处理神经网络训练这些表示。
results: 研究表明,使用本方法可以提高搜索精度,并且可以在实时进行搜索。同时,本方法可以使用小量高质量的Synthetic数据进行训练,而不需要大量的 annotated 数据。Abstract
Typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell cheking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network to learn context-limited domain-specific embeddings. Those embeddings are deployed in a real-time inferencing API for the Microsoft AppSource marketplace to find the closest match between a misspelled user query and the available product names. Our data efficient solution shows that controlled high quality synthetic data may be a powerful tool especially considering the current climate of large language models which rely on prohibitively huge and often uncontrolled datasets.
摘要
Typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell checking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network to learn context-limited domain-specific embeddings. Those embeddings are deployed in a real-time inferencing API for the Microsoft AppSource marketplace to find the closest match between a misspelled user query and the available product names. Our data efficient solution shows that controlled high quality synthetic data may be a powerful tool, especially considering the current climate of large language models which rely on prohibitively huge and often uncontrolled datasets.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
SpaDen : Sparse and Dense Keypoint Estimation for Real-World Chart Understanding
results: 通过深度度量学习和卷积抽象层来学习KP嵌入,并使用这些嵌入来分类图表区域为不同的物体。通过对图表组件与图表注释进行匹配,可以获取图表数据 serie 名称。Abstract
We introduce a novel bottom-up approach for the extraction of chart data. Our model utilizes images of charts as inputs and learns to detect keypoints (KP), which are used to reconstruct the components within the plot area. Our novelty lies in detecting a fusion of continuous and discrete KP as predicted heatmaps. A combination of sparse and dense per-pixel objectives coupled with a uni-modal self-attention-based feature-fusion layer is applied to learn KP embeddings. Further leveraging deep metric learning for unsupervised clustering, allows us to segment the chart plot area into various objects. By further matching the chart components to the legend, we are able to obtain the data series names. A post-processing threshold is applied to the KP embeddings to refine the object reconstructions and improve accuracy. Our extensive experiments include an evaluation of different modules for KP estimation and the combination of deep layer aggregation and corner pooling approaches. The results of our experiments provide extensive evaluation for the task of real-world chart data extraction.
摘要
我们介绍了一种新的底向approach для图表数据EXTRACTION。我们的模型使用图表图像作为输入,并学习检测关键点(KP),这些关键点用于在图表区域中重建组件。我们的新特点在于检测图表中的热图,并将其与热图预测的热图进行融合。我们使用稀疏和密集每个像素目标,并与uni-modal自注意力基于的特征融合层来学习KP嵌入。通过深度度量学习进行无监督分群,我们可以将图表区域分成不同的对象。通过将图表组件与表格中的名称匹配,我们可以获取数据时间序列名称。我们还应用了预处理阈值来修正对象重建和提高准确性。我们的广泛的实验包括了不同模块的KP估计和深层层合并和角落聚合方法的组合。实验结果为实际世界图表数据EXTRACTION任务提供了广泛的评估。
Reasoning in Large Language Models Through Symbolic Math Word Problems
results: 研究发现,自我提示法可以使 LLM 提供更加简洁和可靠的理解,并且可以提高 symbolic 准确率。Abstract
Large language models (LLMs) have revolutionized NLP by solving downstream tasks with little to no labeled data. Despite their versatile abilities, the larger question of their ability to reason remains ill-understood. This paper addresses reasoning in math word problems (MWPs) by studying symbolic versions of the numeric problems, since a symbolic expression is a "concise explanation" of the numeric answer. We create and use a symbolic version of the SVAMP dataset and find that GPT-3's davinci-002 model also has good zero-shot accuracy on symbolic MWPs. To evaluate the faithfulness of the model's reasoning, we go beyond accuracy and additionally evaluate the alignment between the final answer and the outputted reasoning, which correspond to numeric and symbolic answers respectively for MWPs. We explore a self-prompting approach to encourage the symbolic reasoning to align with the numeric answer, thus equipping the LLM with the ability to provide a concise and verifiable reasoning and making it more interpretable. Surprisingly, self-prompting also improves the symbolic accuracy to be higher than both the numeric and symbolic accuracies, thus providing an ensembling effect. The SVAMP_Sym dataset will be released for future research on symbolic math problems.
摘要
(注意:以下是简化中文版本,有些词语和表达可能会被简化或修改,以适应中文语言特点)
Revisiting Deformable Convolution for Depth Completion
paper_authors: Xinglong Sun, Jean Ponce, Yu-Xiong Wang for:实现高品质的紧密深度地图 from sparse depth maps,这个问题在最近几年中受到了增加的关注。methods:我们提出了一个有效的架构,它利用弹性核心条件 convolution 作为单一通过调整模组,并经过实验证明其超越性。results:我们的模型在大规模的 KITTI 数据集上评估,实现了 State-of-the-art 的性能和测试速度。Abstract
Depth completion, which aims to generate high-quality dense depth maps from sparse depth maps, has attracted increasing attention in recent years. Previous work usually employs RGB images as guidance, and introduces iterative spatial propagation to refine estimated coarse depth maps. However, most of the propagation refinement methods require several iterations and suffer from a fixed receptive field, which may contain irrelevant and useless information with very sparse input. In this paper, we address these two challenges simultaneously by revisiting the idea of deformable convolution. We propose an effective architecture that leverages deformable kernel convolution as a single-pass refinement module, and empirically demonstrate its superiority. To better understand the function of deformable convolution and exploit it for depth completion, we further systematically investigate a variety of representative strategies. Our study reveals that, different from prior work, deformable convolution needs to be applied on an estimated depth map with a relatively high density for better performance. We evaluate our model on the large-scale KITTI dataset and achieve state-of-the-art level performance in both accuracy and inference speed. Our code is available at https://github.com/AlexSunNik/ReDC.
摘要
depth 完成,它目标是从粗略的深度图生成高质量的紧密深度图,在最近几年内吸引了越来越多的关注。先前的工作通常使用 RGB 图像作为指导,并 introduce 融合的空间卷积来精细修正估算的粗略深度图。然而,大多数的卷积修正方法需要多个迭代和受限于固定的接受范围,这可能包含无关和无用的信息,尤其是针对非常罕见的输入。在这篇论文中,我们解决了这两个挑战,我们修订了抽象核心卷积的想法。我们提出了一个有效的架构,利用抽象核心卷积作为单 passes 精细修正模块,并经验性地证明其优越性。为了更好地理解抽象卷积的功能和利用它来进行深度完成,我们进一步系统地研究了多种表现方式。我们的研究表明,与先前的工作不同,抽象卷积需要在估算的深度图中具有较高的密度,以便更好的表现。我们在大规模的 KITTI 数据集上评估了我们的模型,并达到了当今最佳水平的准确率和推理速度。我们的代码可以在 上下载。
How many preprints have actually been printed and why: a case study of computer science preprints on arXiv
methods: 该论文使用了 traditional fuzzy matching 方法和 Bidirectional Encoder Representations from Transformers (BERT) semantics-based mapping 方法来映射预印和最终发表的文献。
results: 研究发现,66% 的预印 eventually 被发表在同一个标题下,而11% 的预印被修改后发表。进一步的分析发现,已经发表的预印文献具有充分的修订、多个作者、详细的摘要和引言、广泛的参考文献和可用的源代码等特征。Abstract
Preprints play an increasingly critical role in academic communities. There are many reasons driving researchers to post their manuscripts to preprint servers before formal submission to journals or conferences, but the use of preprints has also sparked considerable controversy, especially surrounding the claim of priority. In this paper, a case study of computer science preprints submitted to arXiv from 2008 to 2017 is conducted to quantify how many preprints have eventually been printed in peer-reviewed venues. Among those published manuscripts, some are published under different titles and without an update to their preprints on arXiv. In the case of these manuscripts, the traditional fuzzy matching method is incapable of mapping the preprint to the final published version. In view of this issue, we introduce a semantics-based mapping method with the employment of Bidirectional Encoder Representations from Transformers (BERT). With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to investigate why these preprints but not others were accepted for publication. Our comparison reveals that in the field of computer science, published preprints feature adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.
摘要
Preprints 在学术社区中发挥越来越重要的作用。有很多原因使研究人员在发布到Preprint仓库之前不会正式提交到期刊或会议,但使用Preprint也引发了较大的争议,特别是优先权的问题。在这篇论文中,我们通过对计算机科学Preprint从2008年到2017年 submitting 到arXiv 的Case study来量化这些Preprints eventually 被发表在 peer-reviewed 期刊和会议中。其中一些发表的 manuscripts 被发表 unter different titles 和不更新 arXiv 上的 preprints。在这些 manuscripts 中,传统的模糊匹配方法无法将 preprint 映射到最终发表的版本。为解决这个问题,我们引入了基于 semantics 的映射方法,使用 Bidirectional Encoder Representations from Transformers (BERT)。With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to investigate why these preprints but not others were accepted for publication. Our comparison reveals that in the field of computer science, published preprints feature adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.
Improving Replay Sample Selection and Storage for Less Forgetting in Continual Learning
paper_authors: Daniel Brignac, Niels Lobo, Abhijit Mahalanobis
for: 本研究旨在Addressing the issues of selectively storing the most informative samples and determining the optimal number of stored samples in continual learning, in order to improve the performance of deep learners in training on a series of tasks of unknown length without suffering from catastrophic forgetting.
methods: 本研究使用了一种novel comparison of the commonly used reservoir sampling to various alternative population strategies, 以及提供了一个novel detailed analysis of how to find the optimal number of stored samples.
results: 本研究提出了一种新的方法来选择最有用的样本存储和确定最佳存储样本数量, 并且通过实验证明了这种方法可以提高深度学习器在不同任务集合上的性能。Abstract
Continual learning seeks to enable deep learners to train on a series of tasks of unknown length without suffering from the catastrophic forgetting of previous tasks. One effective solution is replay, which involves storing few previous experiences in memory and replaying them when learning the current task. However, there is still room for improvement when it comes to selecting the most informative samples for storage and determining the optimal number of samples to be stored. This study aims to address these issues with a novel comparison of the commonly used reservoir sampling to various alternative population strategies and providing a novel detailed analysis of how to find the optimal number of stored samples.
摘要
Simplified Chinese:continuous learning旨在帮助深度学习者在未知长度的任务序列上训练而不受过去任务的恶化。一种有效的解决方案是 reuse,即将一些过去经验存储在内存中,并在学习当前任务时重新播放。然而,还有一些可以提高的空间,例如选择存储的最有用样本和确定存储样本的优化数量。这项研究计划通过对常用的储存采样与其他人口策略进行比较,以及提供一种新的细化分析,以找到最佳存储样本的数量。
Thespian: Multi-Character Text Role-Playing Game Agents
results: 论文表明, compared to the state of the art agent framework, 我们的演员代理在多个角色学习和几个步骤内学习中表现更出色。Abstract
Text-adventure games and text role-playing games are grand challenges for reinforcement learning game playing agents. Text role-playing games are open-ended environments where an agent must faithfully play a particular character. We consider the distinction between characters and actors, where an actor agent has the ability to play multiple characters. We present a framework we call a thespian agent that can learn to emulate multiple characters along with a soft prompt that can be used to direct it as to which character to play at any time. We further describe an attention mechanism that allows the agent to learn new characters that are based on previously learned characters in a few-shot fashion. We show that our agent outperforms the state of the art agent framework in multi-character learning and few-shot learning.
摘要
文本冒险游戏和文本角色游戏是束缚学习游戏玩家代理的大挑战。文本角色游戏是开放式环境,agent必须坚定地扮演特定的角色。我们考虑了角色和演员之间的区别,其中演员代理有多个角色的扮演能力。我们提出了一个名为“演员代理”的框架,可以学习多个角色,并且可以使用软提示来指定在任何时候哪个角色来扮演。我们还描述了一种注意力机制,允许代理学习基于之前学习的角色的新角色,在几个尝试中学习。我们表明,我们的代理在多个角色学习和几个尝试学习方面超过了现状的最佳代理框架。
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation
results: 发现所有现有LLMs在类层次代码生成方面表现较差,与单独的方法级代码生成benchmarks like HumanEval的表现不符,并且不能等效地反映类层次代码生成能力。GPT-4和GPT-3.5仍然在类层次代码生成方面具有显著优势,而其他模型的表现较差。 holistic generation strategy 仅适用于GPT-4和GPT-3.5,而 method-by-method generation 是更好的生成策略。Abstract
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.
摘要
在这项工作中,我们首次对 LLM 进行了更加挑战性的代码生成场景的评估,即类级代码生成。我们首先手动构建了首个类级 Python 代码生成任务的benchmark,名为ClassEval,包含100个类级代码生成任务,需要约500个工作人时。基于这个benchmark,我们然后对11种当前最新的 LLM 进行了首次研究。根据我们的结果,我们有以下主要发现:首先,我们发现所有的 LLM 在类级代码生成场景下表现比在独立方法级代码生成benchmark like HumanEval 更差,而且单独的方法级代码生成能力不能完全反映类级代码生成能力。第二,我们发现 GPT-4 和 GPT-3.5 仍然在类级代码生成场景下表现出色,而其他 LLM 则表现较差。第三,我们发现在整个类都生成一次(即整体生成策略)仅适用于 GPT-4 和 GPT-3.5,而其他模型的有限理解长 instrucions 和利用中间信息的能力不足。最后,我们发现 LLM 对于生成方法相关的代码具有限制,并讨论了生成类中的常见错误类型。我们的benchmark可以在 GitHub 上获取:https://github.com/FudanSELab/ClassEval。
Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling
results: 能够生成具有组成性和连续性的3D人体动作,控制于用户指导的长文本流Abstract
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a sequence of continuous motions, the generated motions corresponding to each sentence may not be coherently linked. Existing long-term motion generation methods face two main issues. Firstly, they cannot directly generate coherent motions and require additional operations such as interpolation to process the generated actions. Secondly, they generate subsequent actions in an autoregressive manner without considering the influence of future actions on previous ones. To address these issues, we propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods: Past Inpainting Sampling and Compositional Transition Sampling. Past Inpainting Sampling completes subsequent motions by treating previous motions as conditions, while Compositional Transition Sampling models the distribution of the transition as the composition of two adjacent motions guided by different text prompts. Our experimental results demonstrate that our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream. The code is available at \href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM}.
摘要
文本到动作生成技术在最近几年来已经得到了越来越多的关注,但大多数现有的方法只能生成对应单句描述单个动作的短期动作。然而,当文本流描述一系列连续动作时,生成的动作对每句文本可能无法协调性链接。现有的长期动作生成方法面临两个主要问题。一是,它们无法直接生成协调的动作,需要额外操作如 interpolate 来处理生成的动作。二是,它们生成后续动作在自动推理模式下,不考虑前一个动作对后一个动作的影响。为解决这些问题,我们提出了一种新的方法,利用过去条件 diffusion 模型,并提供了两种可选的协调采样方法:过去填充采样和 compositional transition sampling。过去填充采样完成后续动作,将前一个动作作为条件,而 compositional transition sampling 模型了动作过渡的分布,以两个不同的文本提示导引的动作为指导。我们的实验结果表明,我们的提议的方法可以生成协调和有 coherence 的长期三维人体动作,由用户指定的长文本流控制。代码可以在 \href{https://github.com/yangzhao1230/PCMDM}{https://github.com/yangzhao1230/PCMDM} 上获取。
results: 该研究在多种不同的机器学习任务和各种输入表示方式上进行了多种攻击输入的生成和评估,并证明了生成攻击输入的重要性。Abstract
Machine learning models are known to be vulnerable to adversarial evasion attacks as illustrated by image classification models. Thoroughly understanding such attacks is critical in order to ensure the safety and robustness of critical AI tasks. However, most evasion attacks are difficult to deploy against a majority of AI systems because they have focused on image domain with only few constraints. An image is composed of homogeneous, numerical, continuous, and independent features, unlike many other input types to AI systems used in practice. Furthermore, some input types include additional semantic and functional constraints that must be observed to generate realistic adversarial inputs. In this work, we propose a new framework to enable the generation of adversarial inputs irrespective of the input type and task domain. Given an input and a set of pre-defined input transformations, our framework discovers a sequence of transformations that result in a semantically correct and functional adversarial input. We demonstrate the generality of our approach on several diverse machine learning tasks with various input representations. We also show the importance of generating adversarial examples as they enable the deployment of mitigation techniques.
摘要
In this work, we propose a new framework to enable the generation of adversarial inputs regardless of the input type and task domain. Given an input and a set of pre-defined input transformations, our framework discovers a sequence of transformations that result in a semantically correct and functional adversarial input. We demonstrate the generality of our approach on several diverse machine learning tasks with various input representations. We also show the importance of generating adversarial examples, as they enable the deployment of mitigation techniques.Here's the translation in Simplified Chinese:机器学习模型知道会受到攻击性识别攻击,如图像识别模型所示。深入理解这些攻击非常重要,以确保AI任务的安全和稳定性。然而,大多数攻击都难以在大多数AI系统上进行,因为它们只Focus on图像领域,并且只有几个约束。一个图像由同类、数字、连续和独立的特征组成,与许多实际应用中的输入类型不同。此外,一些输入类型还有额外的Semantic和功能约束,必须遵循以生成真实的攻击输入。在这种情况下,我们提出了一个新的框架,可以无论输入类型和任务领域,生成攻击性的输入。给定一个输入和一组预定的输入变换,我们的框架可以找出一个符号正确和功能正确的攻击输入序列。我们在多种多样的机器学习任务上展示了我们的方法的通用性,以及输入表示的多样性。我们还证明了生成攻击示例的重要性,以便应用防御技术。
results: 研究发现,使用生成的对话来补充翻译训练数据可以提高翻译表达的准确率。Abstract
We demonstrate task-oriented dialogue generation within the dataflow dialogue paradigm. We show an example of agenda driven dialogue generation for the MultiWOZ domain, and an example of generation without an agenda for the SMCalFlow domain, where we show an improvement in the accuracy of the translation of user requests to dataflow expressions when the generated dialogues are used to augment the translation training dataset.
摘要
我们展示了任务导向对话生成在数据流对话模式下。我们给出了多普遍领域的例子,如多语言对话(MultiWOZ),以及无任务例子,如SMCalFlow领域。我们显示了使用生成对话来增强用户请求到数据流表达的翻译准确性。
Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization
paper_authors: Mousumi Akter, Shubhra Kanti Karmaker Santu for: This paper aims to address the limitations of the existing automated evaluation metric ROUGE, which lacks semantic awareness and does not consider the ranking quality of the summarizer.methods: The authors propose a new metric called Sem-nCG, which is both rank and semantic aware, and demonstrate how it can be used to evaluate model summaries against multiple references. They also explore different ways of incorporating redundancy into the original metric through extensive experiments.results: The authors show that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.Abstract
While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.
摘要
While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.Here's the translation in Traditional Chinese:而ROUGE指数在抽取摘要任务评估中很受欢迎,但是它缺乏semantic awareness和排名质量的认识。过去的研究已经解决了这些问题,提出了一个名为Sem-nCG的增量自动评估指数,它同时具有排名和semantic awareness。然而,Sem-nCG不考虑模型生成的摘要中的重复度,现在也不支持多个参考摘要的评估。不幸的是,同时解决这两个限制是不单简的。因此,在这篇论文中,我们提出了一个具有重复度考虑的Sem-nCG指数,并评估这个新指数可以与多个参考摘要进行评估。我们还进行了各种将重复度纳入原始指数的实验。实验结果显示,新的重复度考虑指数与人类评价更高相关性比原始Sem-nCG指数更高,包括单一和多个参考摘要的情况下。
Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
results: 我们的提议模型在Voice Bank + DEMAND数据集上实现了相当或更好的结果,与SOTA模型相比,但具有较小的参数数量(0.58M)。Abstract
Speech enhancement is a demanding task in automated speech processing pipelines, focusing on separating clean speech from noisy channels. Transformer based models have recently bested RNN and CNN models in speech enhancement, however at the same time they are much more computationally expensive and require much more high quality training data, which is always hard to come by. In this paper, we present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity, which we have termed Spectrum Attention Fusion. We carefully construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features. Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
摘要
《speech enhancement是自动化语音处理管道中的一项艰难任务,旨在从噪音损坏的通道中分离干净的语音。基于Transformer的模型在最近几年内在语音增强中表现出色,但同时它们也比RNN和CNN模型更加 computationally expensive和需要更多高质量的训练数据,这些数据往往很难得到。本文提出一种改进语音增强模型,保持了自我注意的表现力,同时显著减少模型复杂度,我们称之为Spectrum Attention Fusion。我们 méticulously construct了一个卷积模块,用于取代一些自我注意层,让模型更有效地融合频谱特征。我们的提议模型在Voice Bank + DEMAND dataset上可以 дости到与SOTA模型相当或更好的结果,但具有明显更小的参数(0.58M)。》Note that Simplified Chinese is a standardized form of Chinese that is used in mainland China and Singapore, and it is different from Traditional Chinese, which is used in Taiwan and other countries.
results: 本研究的实验结果表明,三个数据集均具有高质量和可靠性,可以用于英语和斯里兰卡语言之间的多语言处理任务。Abstract
Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.
摘要
平行数据集是多语言任务的关键。然而,在考虑一个语言对的情况下,如低资源语言,现有的顶部向下的平行数据集,如文献,缺乏量和质量,因为缺乏人工标注。因此,对低资源语言来说,更可行在底部向上方向下进行,开发字典数据集。这些数据集可以用于中级任务,如监督多语言单词嵌入对齐。这些又可以用于更高级的任务,如对齐文本 corpora 用于机器翻译(MT)。虽然更可 accessible than generating and aligning massive corpus for a low-resource language, but even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Therefore, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which can be used in multilingual natural language processing tasks related to English and Sinhala languages. In this paper, we will explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.
Learning to Paraphrase Sentences to Different Complexity Levels
results: 使用这些三个数据集进行训练,我们在 ASSET 简化数据集上实现了同等级别的最佳性能,并在 sentence level targeting tasks 上超过了之前的成果。此外,我们还证明了一些大语言模型在这些任务上的零基础情况下的性能。Abstract
While sentence simplification is an active research topic in NLP, its adjacent tasks of sentence complexification and same-level paraphrasing are not. To train models on all three tasks, we present two new unsupervised datasets. We compare these datasets, one labeled by a weak classifier and the other by a rule-based approach, with a single supervised dataset. Using these three datasets for training, we perform extensive experiments on both multitasking and prompting strategies. Compared to other systems trained on unsupervised parallel data, models trained on our weak classifier labeled dataset achieve state-of-the-art performance on the ASSET simplification benchmark. Our models also outperform previous work on sentence level targeting. Finally, we establish how a handful of Large Language Models perform on these tasks under a zero-shot setting.
摘要
“对于句子简化是 aktive 的研究项目,但它的相邻任务,即句子复杂化和同等级重写,则未得到充分的研究。为了训练这三个任务,我们提出了两个新的无监督数据集。我们比较了这两个数据集,其中一个被轻量级分类器标注,另一个则使用规律化方法标注。使用这三个数据集进行训练,我们进行了广泛的实验,包括多任务训练和提示策略。相比其他基于无监督平行数据的系统,我们的模型在 ASSET 简化标准库上 achieve state-of-the-art 性能。我们的模型也超越了过去的句子级目标。最后,我们证明了一些大型自然语言模型在这些任务下的 zero-shot 性能。”
ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation
results: 实验结果表明,使用这两种采样策略可以在RL训练序列生成模型时提高效率和减少内存占用量。此外,RL从人类反馈(RLHF)训练大型自然语言模型,并评估了ESRL方法。实验结果表明,ESRL方法可以超越所有基eline,并在训练效率和内存占用量两个方面具有优异表现。Abstract
Applying Reinforcement Learning (RL) to sequence generation models enables the direct optimization of long-term rewards (\textit{e.g.,} BLEU and human feedback), but typically requires large-scale sampling over a space of action sequences. This is a computational challenge as presented by the practice of sequence generation problems, such as machine translation, where we often deal with a large action space (\textit{e.g.,} a vocabulary) and a long action sequence (\textit{e.g.,} a translation). In this work, we introduce two-stage sampling and dynamic sampling approaches to improve the sampling efficiency during training sequence generation models via RL. We experiment with our approaches on the traditional sequence generation tasks, including machine translation and abstractive summarization. Furthermore, we evaluate our approaches in RL from human feedback (RLHF) through training a large language model using the reward model. Experimental results show that the efficient sampling-based RL, referred to as ESRL, can outperform all baselines in terms of both training efficiency and memory consumption. Notably, ESRL yields consistent performance gains over the strong REINFORCE, minimum risk training, and proximal policy optimization methods.
摘要
使用强化学习(RL)对序列生成模型进行直接优化长期奖励(例如BLEU和人工反馈),通常需要大规模的样本扫描ACTION序列空间。这是一个计算挑战,如序列生成问题的实践所示,我们经常面临庞大的行动空间(例如词汇)和长行动序列(例如翻译)。在这种工作中,我们介绍了两个阶段采样和动态采样方法,以提高在RL训练序列生成模型的样本效率。我们在传统的序列生成任务中进行了实验,包括翻译和概括摘要。此外,我们通过训练大语言模型使用奖励模型进行RLHF来评估我们的方法。实验结果表明,能效采样基于RL(ESRL)可以超越所有基elines,包括强化算法、最小风险训练和距离政策优化方法。尤其是,ESRL在奖励模型训练中具有一致性的性能提升。
Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition
methods: 这篇论文提出了一个名为Emotion Decoupling aNd Alignment learning framework(EMO-DNA)的新方法,用于跨语料库SER。EMO-DNA包括两个新特点:对比情绪分离和双层情绪Alignment。
results: 实验结果显示,EMO-DNA在多个跨语料库情况下表现更好,较以前的方法。源代码可以在https://github.com/Jiaxin-Ye/Emo-DNA中找到。Abstract
Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one, which is a rather challenging task due to the significant discrepancy between two corpora. Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. The novelties of EMO-DNA are two-fold: contrastive emotion decoupling and dual-level emotion alignment. On one hand, our contrastive emotion decoupling achieves decoupling learning via a contrastive decoupling loss to strengthen the separability of emotion-relevant features from corpus-specific ones. On the other hand, our dual-level emotion alignment introduces an adaptive threshold pseudo-labeling to select confident target samples for class-level alignment, and performs corpus-level alignment to jointly guide model for learning class-discriminative corpus-invariant features across corpora. Extensive experimental results demonstrate the superior performance of EMO-DNA over the state-of-the-art methods in several cross-corpus scenarios. Source code is available at https://github.com/Jiaxin-Ye/Emo-DNA.
摘要
cross-corpus speech emotion recognition (SER) 目的是将极限对于一个具有良好标签的资料集进行推导,并将其扩展到另一个无标签的资料集,这是一个非常困难的任务,因为两个资料集之间存在很大的差异。现有的方法通常基于不监督领域适应(UDA),尝试从全球分布对顶点进行对顶点的对顶点适应,但是实际上,这些结果通常是混合了资料集特有的特征或不是类别特征。为了解决这些挑战,我们提出了一个名为Emotion Decoupling anD Alignment learning framework(EMO-DNA)的跨资料集 SER方法,并将其应用于不同的资料集之间。EMO-DNA的两大创新之处是:一是对应性抑制和二维度类别对顶点。在一方面,我们的对应性抑制通过一个对应抑制损失来强化感情相关的特征和资料集特有的特征之间的分离性。在另一方面,我们的二维度类别对顶点引入了一个可靠度预测 Pseudo-labeling,以选择信任度高的目标样本进行类别对顶点,并执行类别层对顶点进行协调,以帮助模型从多个资料集中学习类别特征。我们的实验结果显示,EMO-DNA在多个跨资料集情况下表现出色,较以往的方法有更好的性能。资料集和代码可以在https://github.com/Jiaxin-Ye/Emo-DNA 中找到。
From Fake to Hyperpartisan News Detection Using Domain Adaptation
results: 实验结果显示这些技术可以提高性能,并且组合 clustering 和主题探索算法与 UDA 可以得到更好的结果。Abstract
Unsupervised Domain Adaptation (UDA) is a popular technique that aims to reduce the domain shift between two data distributions. It was successfully applied in computer vision and natural language processing. In the current work, we explore the effects of various unsupervised domain adaptation techniques between two text classification tasks: fake and hyperpartisan news detection. We investigate the knowledge transfer from fake to hyperpartisan news detection without involving target labels during training. Thus, we evaluate UDA, cluster alignment with a teacher, and cross-domain contrastive learning. Extensive experiments show that these techniques improve performance, while including data augmentation further enhances the results. In addition, we combine clustering and topic modeling algorithms with UDA, resulting in improved performances compared to the initial UDA setup.
摘要
不监督领域适应(Unsupervised Domain Adaptation,简称UDA)是一种常用的技术,旨在降低两个数据分布之间的领域偏移。它在计算机视觉和自然语言处理领域得到了成功应用。在当前工作中,我们探索了两种文本分类任务之间的无监督领域适应技术的效果:假新闻和激进政治新闻检测。我们不在训练过程中使用目标标签,而是通过知识传递来帮助学习。因此,我们评估了UDA、教师对齐和跨领域对比学习。广泛的实验表明,这些技术可以提高表现,并且包含数据增强更加提高结果。此外,我们将 clustering 和主题挖掘算法与 UDA 结合使用,从而获得了与初始 UDA 设置相比的更好的表现。
Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology
results: 初步结果表明,使用LLM可以大幅提高匹配效果,虽然还有一些不足之处,但可以作为人工审核的准备。此外,研究还发现了应用LLM到终端临床试验匹配中的一些重要发展方向,例如Context Limitation和准确性,特别是从患者 longitudinal医疗记录中提取patient信息。Abstract
Clinical trial matching is a key process in health delivery and discovery. In practice, it is plagued by overwhelming unstructured data and unscalable manual processing. In this paper, we conduct a systematic study on scaling clinical trial matching using large language models (LLMs), with oncology as the focus area. Our study is grounded in a clinical trial matching system currently in test deployment at a large U.S. health network. Initial findings are promising: out of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate eligibility criteria of clinical trials and extract complex matching logic (e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially outperform prior strong baselines and may serve as a preliminary solution to help triage patient-trial candidates with humans in the loop. Our study also reveals a few significant growth areas for applying LLMs to end-to-end clinical trial matching, such as context limitation and accuracy, especially in structuring patient information from longitudinal medical records.
摘要
临床试验匹配是健康提供和发现的关键过程。在实践中,它受到过载的未结构化数据和不可扩展的手动处理的困扰。在这篇论文中,我们进行了系统性的研究,使用大语言模型(LLMs)扩大临床试验匹配,以肿瘤学为研究领域。我们的研究基于一个在大型美国医疗网络中测试的临床试验匹配系统。初步发现结果非常有 promise:直接使用最新的GPT-4等高级语言模型,可以立即结构化临床试验资格的复杂逻辑(例如,嵌入AND/OR/NOT)。虽然仍然很过不完美,但LLMs在比较强的基准模型之上显著超越,并可能作为人工协助的初步解决方案。我们的研究还揭示了应用LLMs到终端临床试验匹配中的一些重要成长领域,如CONTEXT限制和准确率,特别是从悠久医疗记录中结构化病人信息。
You talk what you read: Understanding News Comment Behavior by Dispositional and Situational Attribution
results: 可以更好地理解用户的注意点和意见,并在新闻摘要和新闻方面观点预测等应用中得到验证Abstract
Many news comment mining studies are based on the assumption that comment is explicitly linked to the corresponding news. In this paper, we observed that users' comments are also heavily influenced by their individual characteristics embodied by the interaction history. Therefore, we position to understand news comment behavior by considering both the dispositional factors from news interaction history, and the situational factors from corresponding news. A three-part encoder-decoder framework is proposed to model the generative process of news comment. The resultant dispositional and situational attribution contributes to understanding user focus and opinions, which are validated in applications of reader-aware news summarization and news aspect-opinion forecasting.
摘要
许多新闻评论研究假设评论直接关联新闻。在这篇论文中,我们发现用户的评论也受到其个人特征所 Embodied 的交互历史影响。因此,我们提议通过考虑新闻交互历史中的个性特征和相关新闻的情况因素,来理解新闻评论行为。我们建议使用三部分encoder-decoder框架来模型新闻评论生成过程。结果的个性和情况因素各自贡献于理解用户注意力和意见,这些结果在新闻摘要和新闻方面观点预测等应用中得到验证。
Speaker Diarization of Scripted Audiovisual Content
results: 我们在66集节目测试集上进行了评估,结果显示,我们的方法在我们的 Metrics 上与两个没有监督的基底模型相比,获得了51.7%的改善。Abstract
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
摘要
媒体地方化业务通常需要最终电影或电视制作的准确脚本,以创建外语字幕或配音脚本。特别是,准确脚本(即播送脚本)必须按照时间代码、发言人名称和字幕进行结构。当前的语音识别技术可以减轻转录步骤。然而,当前的说话人分类模型仍然在电视节目上存在两个主要问题:(1)它们无法跟踪大量的说话人,(2)它们在发生频繁的说话人变换时低准确。为解决这个问题,我们提出了一种新的方法,利用制作过程中使用的制作脚本,提取 pseudo-标注数据来进行说话人分类任务。我们提出了一种新的半监督方法,并在66集测试集上示出了51.7%的提升相对于两个无监督基线模型。
Capturing Spectral and Long-term Contextual Information for Speech Emotion Recognition Using Deep Learning Techniques
results: 结果表明,该ensemble模型可以超越传统方法的限制,实现更高的情感识别精度。Abstract
Traditional approaches in speech emotion recognition, such as LSTM, CNN, RNN, SVM, and MLP, have limitations such as difficulty capturing long-term dependencies in sequential data, capturing the temporal dynamics, and struggling to capture complex patterns and relationships in multimodal data. This research addresses these shortcomings by proposing an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals. We found that GCNs excel at capturing Long-term contextual dependencies and relationships within textual data by leveraging graph-based representations of text and thus detecting the contextual meaning and semantic relationships between words. On the other hand, HuBERT utilizes self-attention mechanisms to capture long-range dependencies, enabling the modeling of temporal dynamics present in speech and capturing subtle nuances and variations that contribute to emotion recognition. By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches. This allows for the simultaneous analysis of multimodal data, and the fusion of these modalities enables the extraction of complementary information, enhancing the discriminative power of the emotion recognition system. The results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.
摘要
传统方法在语音情绪识别方面,如LSTM、CNN、RNN、SVM和MLP,具有缺点,如难以捕捉语音序列数据中长期依赖关系、模式和复杂的关系。本研究提出一个 ensemble 模型,将文本数据处理用 Graph Convolutional Networks (GCN), audio 信号分析用 HuBERT 变换器。我们发现GCNs可以很好地捕捉文本数据中长期上下文关系和Semantic relationships между字符,而 HuBERT 利用自我注意机制,可以捕捉语音中的长期关系,捕捉语音中的时间动态和细节,从而提高情绪识别的准确率。通过将 GCN 和 HuBERT ensemble,我们可以利用这两种方法的优点。这使得同时分析多 modal 数据,并将这些模式结合,从而提高情绪识别系统的推断力。结果表明,组合模型可以超越传统方法的局限性,实现更高的情绪识别精度。
Tweet Insights: A Visualization Platform to Extract Temporal Insights from Twitter
results: 这个数据集可以用于探索和 caracterize 语言的时间变化,包括补充的信息,如情感和主题关系的时间变化。Abstract
This paper introduces a large collection of time series data derived from Twitter, postprocessed using word embedding techniques, as well as specialized fine-tuned language models. This data comprises the past five years and captures changes in n-gram frequency, similarity, sentiment and topic distribution. The interface built on top of this data enables temporal analysis for detecting and characterizing shifts in meaning, including complementary information to trending metrics, such as sentiment and topic association over time. We release an online demo for easy experimentation, and we share code and the underlying aggregated data for future work. In this paper, we also discuss three case studies unlocked thanks to our platform, showcasing its potential for temporal linguistic analysis.
摘要
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP
results: 对4种类型的后门攻击和4个不同的数据集进行实验,与基线方法(STRIP、RAP、ONION)相比,提出的方法在精度和准确性方面表现出色。Abstract
Backdoor attacks have emerged as a prominent threat to natural language processing (NLP) models, where the presence of specific triggers in the input can lead poisoned models to misclassify these inputs to predetermined target classes. Current detection mechanisms are limited by their inability to address more covert backdoor strategies, such as style-based attacks. In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to stay stealthy. Based on this observation, we hypothesize that while the model's predictions for paraphrased clean samples should remain stable, predictions for poisoned samples should revert to their true labels upon the mutations applied to triggers during the paraphrasing process. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem. We adopt fuzzing, a technique commonly used for unearthing software vulnerabilities, to discover optimal paraphrase prompts that can effectively eliminate triggers while concurrently maintaining input semantics. Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall.
摘要
<>TRANSLATE_TEXT黑客攻击已经成为自然语言处理(NLP)模型的主要威胁,特定的触发器在输入中存在可以让毒化模型错分这些输入为预先确定的目标类。现有的检测机制受限于其无法处理更加潜在的黑客攻击策略,如样式基本攻击。在这项工作中,我们提出了一种创新的测试时毒化样本检测框架,基于模型预测的可读性,围绕输入的含义。我们认为触发器(如罕见词)不应该fundamentally改变毒化样本中的含义,因为他们想要保持隐蔽。基于这一观察,我们提出了一个假设:如果在重新排序和修改触发器时,模型对涂改后的毒化样本的预测结果应该回归到其真实标签。我们使用ChatGPT,一种现代大语言模型,来实现重新排序和修改触发器的任务,并将此任务定义为提示工程问题。我们采用了混淆,一种通常用于发现软件漏洞的技术,来找到最佳的提示问题,以有效地除去触发器而同时保持输入含义。在4种黑客攻击和4个不同的数据集上,我们的方法比基准方法(包括STRIP、RAP和ONION)在准确率和敏感率方面表现出色。
Chinese Financial Text Emotion Mining: GCGTS – A Character Relationship-based Approach for Simultaneous Aspect-Opinion Pair Extraction
results: 对比先前的 SDRN 和 GTS 模型,提出的 GCGTS 模型在中文金融文本中的表达能力显著提高,提供了一种新的和有效的 AOPE 方法。Abstract
Aspect-Opinion Pair Extraction (AOPE) from Chinese financial texts is a specialized task in fine-grained text sentiment analysis. The main objective is to extract aspect terms and opinion terms simultaneously from a diverse range of financial texts. Previous studies have mainly focused on developing grid annotation schemes within grid-based models to facilitate this extraction process. However, these methods often rely on character-level (token-level) feature encoding, which may overlook the logical relationships between Chinese characters within words. To address this limitation, we propose a novel method called Graph-based Character-level Grid Tagging Scheme (GCGTS). The GCGTS method explicitly incorporates syntactic structure using Graph Convolutional Networks (GCN) and unifies the encoding of characters within the same syntactic semantic unit (Chinese word level). Additionally, we introduce an image convolutional structure into the grid model to better capture the local relationships between characters within evaluation units. This innovative structure reduces the excessive reliance on pre-trained language models and emphasizes the modeling of structure and local relationships, thereby improving the performance of the model on Chinese financial texts. Through comparative experiments with advanced models such as Synchronous Double-channel Recurrent Network (SDRN) and Grid Tagging Scheme (GTS), the proposed GCGTS model demonstrates significant improvements in performance.
摘要
《评价分析》从中文金融文本中提取方面评价对(AOPE)是一种特殊的精细文本情感分析任务。主要目标是同时提取方面 термина和评价 термина从多样化的金融文本中。先前的研究主要集中在开发格式化标注方案,以便进行这种提取过程。然而,这些方法通常依赖于字符级别(token级别)的特征编码,可能会忽略中文字符在单词水平上的逻辑关系。为了解决这一限制,我们提出了一种新的方法——基于图学推理的字符级别Grid Tagging Scheme(GCGTS)。GCGTS方法会将中文字符在单词水平上的逻辑关系进行Explicit地表示,并通过图学推理来强化字符级别的编码。此外,我们还引入了图像推理结构,以更好地捕捉单词水平上的本地关系。这种创新的结构可以减少依赖于预训练语言模型的过度依赖,并强调结构和本地关系的模型化,从而提高中文金融文本上的模型性能。通过与高级模型如同步双通道循环网络(SDRN)和Grid Tagging Scheme(GTS)进行比较实验,我们的提出的GCGTS模型在中文金融文本上表现出了显著的改善。
Prompt2Gaussia: Uncertain Prompt-learning for Script Event Prediction
methods: 使用公共预训练语言模型作为知识库,自动挖掘脚本相关知识via prompt-学习,并使用 Gaussian Distribution 表示不确定性
results: 比 Priors 基eline 高1.46%和1.05% 在两个benchmark上,表现出优于先前的方法Abstract
Script Event Prediction (SEP) aims to predict the subsequent event for a given event chain from a candidate list. Prior research has achieved great success by integrating external knowledge to enhance the semantics, but it is laborious to acquisite the appropriate knowledge resources and retrieve the script-related knowledge. In this paper, we regard public pre-trained language models as knowledge bases and automatically mine the script-related knowledge via prompt-learning. Still, the scenario-diversity and label-ambiguity in scripts make it uncertain to construct the most functional prompt and label token in prompt learning, i.e., prompt-uncertainty and verbalizer-uncertainty. Considering the innate ability of Gaussian distribution to express uncertainty, we deploy the prompt tokens and label tokens as random variables following Gaussian distributions, where a prompt estimator and a verbalizer estimator are proposed to estimate their probabilistic representations instead of deterministic representations. We take the lead to explore prompt-learning in SEP and provide a fresh perspective to enrich the script semantics. Our method is evaluated on the most widely used benchmark and a newly proposed large-scale one. Experiments show that our method, which benefits from knowledge evoked from pre-trained language models, outperforms prior baselines by 1.46\% and 1.05\% on two benchmarks, respectively.
摘要
Causality Guided Disentanglement for Cross-Platform Hate Speech Detection
results: 根据实验结果,这种模型在四个不同平台上的检测仇恨言论效果比现有的状态态方法更高。Abstract
Social media platforms, despite their value in promoting open discourse, are often exploited to spread harmful content. Current deep learning and natural language processing models used for detecting this harmful content overly rely on domain-specific terms affecting their capabilities to adapt to generalizable hate speech detection. This is because they tend to focus too narrowly on particular linguistic signals or the use of certain categories of words. Another significant challenge arises when platforms lack high-quality annotated data for training, leading to a need for cross-platform models that can adapt to different distribution shifts. Our research introduces a cross-platform hate speech detection model capable of being trained on one platform's data and generalizing to multiple unseen platforms. To achieve good generalizability across platforms, one way is to disentangle the input representations into invariant and platform-dependent features. We also argue that learning causal relationships, which remain constant across diverse environments, can significantly aid in understanding invariant representations in hate speech. By disentangling input into platform-dependent features (useful for predicting hate targets) and platform-independent features (used to predict the presence of hate), we learn invariant representations resistant to distribution shifts. These features are then used to predict hate speech across unseen platforms. Our extensive experiments across four platforms highlight our model's enhanced efficacy compared to existing state-of-the-art methods in detecting generalized hate speech.
摘要
Seasonality Based Reranking of E-commerce Autocomplete Using Natural Language Queries
results: 研究发现,通过 incorporating 季节性信号,自动完成排名模型可以提高搜索结果的相关性和商业指标。I hope that helps! Let me know if you have any other questions.Abstract
Query autocomplete (QAC) also known as typeahead, suggests list of complete queries as user types prefix in the search box. It is one of the key features of modern search engines specially in e-commerce. One of the goals of typeahead is to suggest relevant queries to users which are seasonally important. In this paper we propose a neural network based natural language processing (NLP) algorithm to incorporate seasonality as a signal and present end to end evaluation of the QAC ranking model. Incorporating seasonality into autocomplete ranking model can improve autocomplete relevance and business metric.
摘要
查询自动完成(QAC)也称为前缀提示,是现代搜索引擎的一个关键特性,尤其在电商领域。QAC的一个目标是向用户提供相关的查询,以便在各个季节中提高搜索结果的相关性。在这篇论文中,我们提议一种基于神经网络的自然语言处理(NLP)算法,以兼容季节信号并提供综合评估QAC排名模型的终端评估。在搜索结果中包含季节信号可以提高搜索结果的相关性和业务指标。
Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models
paper_authors: Mahammed Kamruzzaman, Gene Louis Kim
for: 这 paper 是为了评估文档级别的情感分析模型,并考虑了资源成本的影响。
methods: 这 paper 使用了不同的特征提取技术、集成、任务专门的深度学习模型和领域独立的大语言模型。
results: 研究发现,精度最佳化的大语言模型可以达到最高的准确率,但有些配置可以提供巨大(最多24,283*)的资源减少,只带来少量(<1%)的减少。此外,对于较小的数据集,准确率之间的差异减小,而资源消耗差异增加。Abstract
While reaching for NLP systems that maximize accuracy, other important metrics of system performance are often overlooked. Prior models are easily forgotten despite their possible suitability in settings where large computing resources are unavailable or relatively more costly. In this paper, we perform a broad comparative evaluation of document-level sentiment analysis models with a focus on resource costs that are important for the feasibility of model deployment and general climate consciousness. Our experiments consider different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternate configurations provide huge (up to 24, 283 *) resource savings for a marginal (<1%) loss in accuracy. Furthermore, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.
摘要
While striving for NLP systems with maximum accuracy, other important metrics of system performance are often overlooked. Prior models may be suitable in settings where large computing resources are scarce or relatively more costly, but are often forgotten. In this paper, we conduct a comprehensive comparative evaluation of document-level sentiment analysis models, with a focus on resource costs that are crucial for the feasibility of model deployment and general climate consciousness. Our experiments cover different feature extraction techniques, the effect of ensembling, task-specific deep learning modeling, and domain-independent large language models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy, some alternative configurations can provide significant (up to 24,283 *) resource savings for a minor (<1%) loss in accuracy. Moreover, we find that for smaller datasets, the differences in accuracy shrink while the difference in resource consumption grows further.Note: * indicates a number in Chinese characters.
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
results: exceeds 两个教师模型的性能,并且比不使用 distillation 的模型更好Here’s a breakdown of each point:
for: The paper is written to improve the sample efficiency of language models.
methods: The paper uses an ensemble of a GPT-2 and small LLaMA models, as well as distillation techniques.
results: The distilled LLaMA model exceeds the performance of both of its teachers and a similar model trained without distillation.Abstract
We present our proposed solution to the BabyLM challenge [arXiv:2301.11796], whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.
摘要
我们提出了一种解决方案来提高语言模型的样本效率挑战(arXiv:2301.11796)。我们使用一个GPT-2和一些小型LLaMA模型在发展可能的10M字 BabyLM数据集上进行训练,然后将其浓缩为一个小型,58M参数的LLaMA模型,该模型超越了它的两位教师以及不使用浓缩的相似模型。这表明,浓缩不仅可以保持教师模型在较小数据集上的完整性,而且可以超越它,并导致直接训练的性能明显更好。
Federated Representation Learning for Automatic Speech Recognition
for: This paper is written for learning Automatic Speech Recognition (ASR) representations while preserving data privacy using Federated Learning (FL) and Self-supervised Learning (SSL).
methods: The paper uses the Contrastive Predictive Coding framework with FedSGD to pre-train an LSTM encoder on unlabeled speech data from Libri-Light, simulating non-IID speaker-siloed data distributions.
results: The pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. The federated pre-trained models are also adapted to a new language, French, and show a 20% (WER) improvement over no pre-training.Here is the same information in Simplified Chinese text:
results: 预训ASR编码器在FL中表现与中央预训模型一样好,与没有预训相比提高12-15%(WER)。此外,使用了 Federated 预训模型,在新语言法语言中进行了20%(WER)的提高。Abstract
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new language, French, and show a 20% (WER) improvement over no pre-training.
摘要
federated learning (FL) 是一种隐私保护的思想,允许边缘设备学习共同无需分享数据。边缘设备如 Alexa 和 Siri 是可能的无标语音数据的来源,可以用来学习强大的语音表示。在这种工作中,我们将自我监督学习 (SSL) 和 FL 结合以学习对自动语音识别 (ASR) 尊重数据隐私约束的表示。我们使用 Libri-Light 无标语音 dataset 中的 speaker 和 chapter 信息来模拟非同一个 speaker-siloed 数据分布,并在 Contrastive Predictive Coding 框架中使用 FedSGD 预训练 LSTM 编码器。我们发现预训练 ASR 编码器在 FL 中表现与中央预训练模型一样好,并且提高了12-15% (WER) compared to no pre-training。我们进一步适应了联邦预训练模型到一个新语言法语,并显示了20% (WER) 的提高 compared to no pre-training。
Bengali Fake Reviews: A Benchmark Dataset and Detection System
paper_authors: G. M. Shahariar, Md. Tanvir Rouf Shawon, Faisal Muhammad Shah, Mohammad Shafiul Alam, Md. Shahriar Mahbub
for: This paper aims to identify fake reviews in the Bengali language, which is an under-explored research area in the field of fake review detection.
methods: The authors propose a unique pipeline to translate English words to their corresponding Bengali meaning and back transliterate Romanized Bengali to Bengali. They also use multiple deep learning and pre-trained transformer language models to develop a reliable detection system.
results: The proposed ensemble model achieved a weighted F1-score of 0.9843 on 13390 reviews, including 1339 actual fake reviews and 5356 augmented fake reviews generated with the nlpaug library. The model achieved a 0.9558 weighted F1-score when the fake reviews were augmented using the bnaug library.Here’s the Chinese translation of the three key points:
results: 提议的ensemble模型在13390个评论中取得了0.9843的权重F1分数,其中包括1339个实际的假评论和5356个扩充的假评论,使用nlpaug库生成的。模型在使用bnaug库扩充假评论后的权重F1分数为0.9558。Abstract
The proliferation of fake reviews on various online platforms has created a major concern for both consumers and businesses. Such reviews can deceive customers and cause damage to the reputation of products or services, making it crucial to identify them. Although the detection of fake reviews has been extensively studied in English language, detecting fake reviews in non-English languages such as Bengali is still a relatively unexplored research area. This paper introduces the Bengali Fake Review Detection (BFRD) dataset, the first publicly available dataset for identifying fake reviews in Bengali. The dataset consists of 7710 non-fake and 1339 fake food-related reviews collected from social media posts. To convert non-Bengali words in a review, a unique pipeline has been proposed that translates English words to their corresponding Bengali meaning and also back transliterates Romanized Bengali to Bengali. We have conducted rigorous experimentation using multiple deep learning and pre-trained transformer language models to develop a reliable detection system. Finally, we propose a weighted ensemble model that combines four pre-trained transformers: BanglaBERT, BanglaBERT Base, BanglaBERT Large, and BanglaBERT Generator . According to the experiment results, the proposed ensemble model obtained a weighted F1-score of 0.9843 on 13390 reviews, including 1339 actual fake reviews and 5356 augmented fake reviews generated with the nlpaug library. The remaining 6695 reviews were randomly selected from the 7710 non-fake instances. The model achieved a 0.9558 weighted F1-score when the fake reviews were augmented using the bnaug library.
摘要
“伪评论的滥蔽在网上 платфорms上导致了consumers和企业的担忧。这些评论可能会欺骗顾客和破坏产品或服务的声誉,从而使其识别成为一个重要的研究领域。虽然在英文语言中检测伪评论的研究已经很广泛,但在非英文语言 such as 孟加拉语仍然是一个相对未探索的研究领域。本文发布了第一个公开可用的孟加拉语伪评论检测(BFRD)数据集,包括7710个非伪评论和1339个伪评论 food-related 评论,从社交媒体帖子中收集。为了将英文字在评论中转换为孟加拉语的意思,我们提出了一个唯一的管道,将英文字转换为孟加拉语的意思,并且将罗马化孟加拉语转换为孟加拉语。我们进行了严谨的实验,使用多种深度学习和预训习语言模型,以建立一个可靠的检测系统。最后,我们提出了一个权重组合模型,结合四个预训习transformer语言模型:BanglaBERT、BanglaBERT Base、BanglaBERT Large 和 BanglaBERT Generator。根据实验结果,我们的提案组合模型在13390个评论中取得了0.9843的权重F1分,包括1339个实际的伪评论和5356个扩展的伪评论,使用 nlpaug 库生成。剩下的6695个评论是随机选择的。当伪评论使用 bnaug 库生成时,组合模型在13390个评论中取得了0.9558的权重F1分。”
Athena 2.0: Discourse and User Modeling in Open Domain Dialogue
results: 这个论文表明,Athena 2.0 可以在多个流行话题上进行有效的对话,并且可以根据用户的个性进行个性化对话。Abstract
Conversational agents are consistently growing in popularity and many people interact with them every day. While many conversational agents act as personal assistants, they can have many different goals. Some are task-oriented, such as providing customer support for a bank or making a reservation. Others are designed to be empathetic and to form emotional connections with the user. The Alexa Prize Challenge aims to create a socialbot, which allows the user to engage in coherent conversations, on a range of popular topics that will interest the user. Here we describe Athena 2.0, UCSC's conversational agent for Amazon's Socialbot Grand Challenge 4. Athena 2.0 utilizes a novel knowledge-grounded discourse model that tracks the entity links that Athena introduces into the dialogue, and uses them to constrain named-entity recognition and linking, and coreference resolution. Athena 2.0 also relies on a user model to personalize topic selection and other aspects of the conversation to individual users.
摘要
很多人每天都与对话代理 interact,这些对话代理的popularity在不断增长。although many conversational agents act as personal assistants, they can have many different goals. some are task-oriented, such as providing customer support for a bank or making a reservation. others are designed to be empathetic and form emotional connections with the user. the alexa prize challenge aims to create a socialbot that allows the user to engage in coherent conversations on a range of popular topics that will interest the user. here we describe athena 2.0,UCSC的 conversational agent for Amazon's socialbot grand challenge 4. athena 2.0 utilizes a novel knowledge-grounded discourse model that tracks the entity links that athena introduces into the dialogue, and uses them to constrain named-entity recognition and linking, and coreference resolution. athena 2.0 also relies on a user model to personalize topic selection and other aspects of the conversation to individual users.
Tag Prediction of Competitive Programming Problems using Deep Learning Techniques
results: 实验结果显示,使用MLP模型可以达到最高准确率为78.0%。Abstract
In the past decade, the amount of research being done in the fields of machine learning and deep learning, predominantly in the area of natural language processing (NLP), has risen dramatically. A well-liked method for developing programming abilities like logic building and problem solving is competitive programming. It can be tough for novices and even veteran programmers to traverse the wide collection of questions due to the massive number of accessible questions and the variety of themes, levels of difficulty, and questions offered. In order to help programmers find questions that are appropriate for their knowledge and interests, there is a need for an automated method. This can be done using automated tagging of the questions using Text Classification. Text classification is one of the important tasks widely researched in the field of Natural Language Processing. In this paper, we present a way to use text classification techniques to determine the domain of a competitive programming problem. A variety of models, including are implemented LSTM, GRU, and MLP. The dataset has been scraped from Codeforces, a major competitive programming website. A total of 2400 problems were scraped and preprocessed, which we used as a dataset for our training and testing of models. The maximum accuracy reached using our model is 78.0% by MLP(Multi Layer Perceptron).
摘要
在过去的一个十年中,机器学习和深度学习领域内的自然语言处理(NLP)领域的研究量有了很大的增长。一种受欢迎的方法是竞赛编程,它可以帮助程序员提高逻辑建设和问题解决能力。但由于问题的数量繁多,多种主题、Difficulty Level和问题的选择,新手和经验 programmer都可能难以找到合适的问题。为了帮助程序员找到适合自己水平和兴趣的问题,需要一种自动化的方法。这可以通过自动标记问题来实现,使用自然语言处理领域的文本分类技术。在这篇论文中,我们提出了一种使用文本分类技术来确定竞赛编程问题的领域。我们实现了多种模型,包括LSTM、GRU和MLP。数据集来自Codeforces竞赛编程网站,共抽取了2400个问题,并进行了处理和训练模型。最高的准确率达到了78.0%,由MLP(多层感知器)实现。
Wider and Deeper LLM Networks are Fairer LLM Evaluators
results: 研究发现,使用更深和宽的网络可以导致更公正的评估结果。此外,利用WideDeep可以加速评估过程,提高人类协议率至93%。Abstract
Measuring the quality of responses generated by LLMs is a challenging task, particularly when it comes to evaluating whether the response is aligned with human preference. A novel approach involves using the LLM itself to make evaluation and stabilizing the results through multiple independent evaluations, similar to a single-layer narrow LLM network. This network consists of a fixed number of neurons, with each neuron being the same LLM. In this paper, we draw upon the extensive research on deep neural networks to explore whether deeper and wider networks can lead to fairer evaluations. Specifically, inspired by the observation that different neurons in a neural network are responsible for detecting different concepts, we first adaptively generate as many neuron roles as possible for each evaluation sample. Each perspective corresponds to the role of a specific LLM neuron in the first layer. In subsequent layers, we follow the idea that higher layers in deep networks are responsible for more comprehensive features, each layer receives representations from all neurons in the previous layer, integrating the locally learned evaluation information to obtain a more comprehensive evaluation result. Interestingly, this network design resembles the process of academic paper reviewing. To validate the effectiveness of our method, we construct the largest and most diverse English evaluation benchmark LLMEval$^2$ for LLM evaluators, comprising 15 tasks, 8 abilities, and 2,553 samples. Experimental results demonstrate that a wider network (involving many reviewers) with 2 layers (one round of discussion) performs the best, improving kappa correlation coefficient from 0.28 to 0.34. We also leverage WideDeep to aid in the assessment of Chinese LLMs, which has accelerated the evaluation time by 4.6 times, resulting in a 60% cost saving. WideDeep achieves a remarkable 93% agreement level among humans.
摘要
评估语言模型(LLM)生成的响应质量是一项具有挑战性的任务,特别是当需要评估响应是否与人类偏好相符。一种新的方法是使用LLM自己进行评估,并通过多个独立评估来稳定结果,类似于单层窄LLM网络。这种网络由固定数量的神经元组成,每个神经元都是相同的LLM。在这篇论文中,我们 draw upon deep neural networks的广泛研究,探索深度和宽度是否可以导致更公平的评估。具体来说,我们根据观察deep neural network中不同神经元负责检测不同概念的现象,我们首先适应性地生成评估样本中的最多神经元角色。在后续层中,我们遵循深度网络中高层的神经元负责更全面的特征,每层都收到前一层所有神经元的表示,将地方学习的评估信息集成起来,以获得更全面的评估结果。这种网络设计与学术论文审核过程类似。为验证方法的效果,我们构建了最大和最多样的英语评估数据集LLMEval$^2$,包括15个任务、8种能力和2553个样本。实验结果表明,一个更宽的网络(即多个评审者)以2层(一次讨论)的设计表现最佳,从0.28到0.34的κ相互关系系数进行提高。我们还使用了WideDeep来辅助中文LLM的评估,从而提高评估速度,并实现了60%的成本减少。WideDeep达到了人类93%的一致水平。
Curricular Transfer Learning for Sentence Encoded Tasks
results: 在我们的实验中,我们比其他已知的预训方法在 MultiWoZ 任务上获得了较好的表现。Abstract
Fine-tuning language models in a downstream task is the standard approach for many state-of-the-art methodologies in the field of NLP. However, when the distribution between the source task and target task drifts, \textit{e.g.}, conversational environments, these gains tend to be diminished. This article proposes a sequence of pre-training steps (a curriculum) guided by "data hacking" and grammar analysis that allows further gradual adaptation between pre-training distributions. In our experiments, we acquire a considerable improvement from our method compared to other known pre-training approaches for the MultiWoZ task.
摘要
通常,在自然语言处理(NLP)领域中,许多状态下方法会在下游任务中细化语言模型。然而,当源任务和目标任务分布之间的差异较大,例如对话环境,这些收益往往减少。这篇文章提出了一系列的预训练步骤(课程),通过“数据黑客”和语法分析引导,以便进行逐渐的适应 между预训练分布。在我们的实验中,我们对多语言对话任务(MultiWoZ)获得了明显的改善,相比其他已知的预训练方法。
XNLP: An Interactive Demonstration System for Universal Structured NLP
results: 该论文的系统在多个方面提高了XNLP示范平台,包括通用XNLP模型、高性能、可解释性、可扩展性和交互性等方面,为研究者提供了一个统一的平台来探索多种XNLP任务。Abstract
Structured Natural Language Processing (XNLP) is an important subset of NLP that entails understanding the underlying semantic or syntactic structure of texts, which serves as a foundational component for many downstream applications. Despite certain recent efforts to explore universal solutions for specific categories of XNLP tasks, a comprehensive and effective approach for unifying all XNLP tasks long remains underdeveloped. In the meanwhile, while XNLP demonstration systems are vital for researchers exploring various XNLP tasks, existing platforms can be limited to, e.g., supporting few XNLP tasks, lacking interactivity and universalness. To this end, we propose an advanced XNLP demonstration platform, where we propose leveraging LLM to achieve universal XNLP, with one model for all with high generalizability. Overall, our system advances in multiple aspects, including universal XNLP modeling, high performance, interpretability, scalability, and interactivity, providing a unified platform for exploring diverse XNLP tasks in the community. XNLP is online: https://xnlp.haofei.vip
摘要
“结构化自然语言处理(XNLP)是 NLP 中一个重要的子集,它涉及理解文本中的 semantics 或 syntax 结构,这种基础组件对多个下游应用有很大的重要性。尽管有些最近的努力是为了探索特定类型的 XNLP 任务的通用解决方案,但是一个包容和有效的 XNLP 任务统一方法仍然没有得到开发。在这之前,XNLP 演示系统是研究者探索不同 XNLP 任务的重要工具,但现有的平台有限,例如只支持一些 XNLP 任务,缺乏通用性和互动性。为此,我们提出了一个高级 XNLP 演示平台,我们提议利用 LLM 来实现 universality XNLP,一个模型可以涵盖所有 XNLP 任务,具有高通用性。总的来说,我们的系统在多个方面进步,包括 universality XNLP 模型、高性能、可解释性、可扩展性和互动性,提供一个统一的平台 для研究者探索不同 XNLP 任务。XNLP 在线:https://xnlp.haofei.vip”Note: "LLM" stands for "large language model" in English.
results: 经过多种生物学上的验证和比较,该方法能够有效地跟踪细胞的运动,并且可以承受大量的视频帧和噪声。Abstract
The accurate tracking of live cells using video microscopy recordings remains a challenging task for popular state-of-the-art image processing based object tracking methods. In recent years, several existing and new applications have attempted to integrate deep-learning based frameworks for this task, but most of them still heavily rely on consecutive frame based tracking embedded in their architecture or other premises that hinder generalized learning. To address this issue, we aimed to develop a new deep-learning based tracking method that relies solely on the assumption that cells can be tracked based on their spatio-temporal neighborhood, without restricting it to consecutive frames. The proposed method has the additional benefit that the motion patterns of the cells can be learned completely by the predictor without any prior assumptions, and it has the potential to handle a large number of video frames with heavy artifacts. The efficacy of the proposed method is demonstrated through multiple biologically motivated validation strategies and compared against several state-of-the-art cell tracking methods.
摘要
Live cells的准确跟踪使用视频微scopic记录仍然是流行的state-of-the-art图像处理基于对象跟踪方法中的挑战。在过去几年,许多现有和新的应用程序尝试 integrate deep learning基础框架来解决这个问题,但大多数它们仍然受限于顺序帧基础或其他假设,这会阻碍普适学习。为解决这个问题,我们目标是开发一种新的deep learning基础的跟踪方法,不需要基于顺序帧的假设,可以基于细胞的空间时间邻域来跟踪细胞。这种方法的优点是可以通过predictor完全学习细胞的运动模式,无需任何先前假设,并且可以处理大量的视频帧并快速响应变化。我们通过多种生物学上motivated的验证方法证明了提案的方法的有效性,并与多种现有细胞跟踪方法进行比较。
Learning Optimal Admission Control in Partially Observable Queueing Networks
results: 本研究的结果显示,我们的算法可以在半可观POMDP中实现最佳的平均保持/拒绝成本,并且其 regret bound只随S的幂级函数而增长,而不是像先前的研究一样,受到网络的宽度影响。Abstract
We present an efficient reinforcement learning algorithm that learns the optimal admission control policy in a partially observable queueing network. Specifically, only the arrival and departure times from the network are observable, and optimality refers to the average holding/rejection cost in infinite horizon. While reinforcement learning in Partially Observable Markov Decision Processes (POMDP) is prohibitively expensive in general, we show that our algorithm has a regret that only depends sub-linearly on the maximal number of jobs in the network, $S$. In particular, in contrast with existing regret analyses, our regret bound does not depend on the diameter of the underlying Markov Decision Process (MDP), which in most queueing systems is at least exponential in $S$. The novelty of our approach is to leverage Norton's equivalent theorem for closed product-form queueing networks and an efficient reinforcement learning algorithm for MDPs with the structure of birth-and-death processes.
摘要
我们提出了一个高效的增强学习算法,可以在具有部分可见性的队列网络中学习最佳接受控制策略。具体来说,只有队列网络的到达和离开时间是可见的,且优化指的是在无穷远征 horizon 中的平均保持/拒绝成本。而在 partially observable Markov decision processes (POMDP) 中,通常情况下增强学习是不可持续的,但我们显示了我们的算法仅对最大队列数量 $S$ 有对��� penalty。具体来说,我们的 regret bound 不同于现有的 regret 分析,不随 diameters 的增长,而是随 $S$ 的增长。我们的方法具有以下两个特点:一是利用 Norton's 等效定理 для closed product-form queueing networks,二是一种高效的增强学习算法 для birth-and-death 过程中的 MDP。
Scaling Survival Analysis in Healthcare with Federated Survival Forests: A Comparative Study on Heart Failure and Breast Cancer Genomics
results: 实验结果表明,FedSurF++可以与现有方法相比,在一次通信往返完成后达到相似的性能,并且具有更高的效率、更好的鲁棒性和更好的隐私保护。 Additionally, the paper demonstrates the success of FedSurF++ on two real-world datasets, highlighting its potential for improving the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy.Abstract
Survival analysis is a fundamental tool in medicine, modeling the time until an event of interest occurs in a population. However, in real-world applications, survival data are often incomplete, censored, distributed, and confidential, especially in healthcare settings where privacy is critical. The scarcity of data can severely limit the scalability of survival models to distributed applications that rely on large data pools. Federated learning is a promising technique that enables machine learning models to be trained on multiple datasets without compromising user privacy, making it particularly well-suited for addressing the challenges of survival data and large-scale survival applications. Despite significant developments in federated learning for classification and regression, many directions remain unexplored in the context of survival analysis. In this work, we propose an extension of the Federated Survival Forest algorithm, called FedSurF++. This federated ensemble method constructs random survival forests in heterogeneous federations. Specifically, we investigate several new tree sampling methods from client forests and compare the results with state-of-the-art survival models based on neural networks. The key advantage of FedSurF++ is its ability to achieve comparable performance to existing methods while requiring only a single communication round to complete. The extensive empirical investigation results in a significant improvement from the algorithmic and privacy preservation perspectives, making the original FedSurF algorithm more efficient, robust, and private. We also present results on two real-world datasets demonstrating the success of FedSurF++ in real-world healthcare studies. Our results underscore the potential of FedSurF++ to improve the scalability and effectiveness of survival analysis in distributed settings while preserving user privacy.
摘要
生存分析是医学中的基本工具,用于模型在人口中事件兴 interest 的时间 until 发生。然而,在实际应用中,生存数据经常是不完整、审核、分布和保密的,尤其在医疗设置中,隐私是极为重要的。数据的缺乏可能会对大规模生存模型的扩展应用产生严重的限制。联邦学习是一种有前途的技术,它允许机器学习模型在多个数据集上进行训练,而不需要牺牲用户隐私。这使得联邦学习在生存数据和大规模生存应用中具有极大的潜力。在这项工作中,我们提出了一种基于联邦生存森林算法的扩展,即 FedSurF++。这是一种联邦ensemble方法,可以在不同的联邦中随机生成生存森林。我们 investigate 了一些新的客户端森林采样方法,并与现有的生存模型基于神经网络进行比较。FedSurF++ 的关键优势在于它可以在单一的通信循环中完成,并且可以与现有方法相比具有相似的性能。我们的实验结果表明,FedSurF++ 可以在算法和隐私保护方面做出显著改进,使原始 FedSurF 算法更加高效、稳定和安全。此外,我们还对两个实际 dataset 进行了实践,证明了 FedSurF++ 在真实世界医疗研究中的成功。我们的结果表明,FedSurF++ 可以在分布式设置中提高生存分析的扩展性和效果,同时保持用户隐私。
Harnessing the Web and Knowledge Graphs for Automated Impact Investing Scoring
paper_authors: Qingzhi Hu, Daniel Daza, Laurens Swinkels, Kristina Ūsaitė, Robbert-Jan ‘t Hoen, Paul Groth
for: This paper aims to automate the process of creating an SDG framework for companies.
methods: The proposed system uses a data-driven approach, collecting and filtering a dataset of texts from various web sources and a knowledge graph, and then training classifiers to predict SDG scores for a given company.
results: The best performing model achieved a micro average F1 score of 0.89, demonstrating the effectiveness of the proposed solution. Additionally, the system provides explanations in the form of data relevant to the predicted score, facilitating its use by humans.Abstract
The Sustainable Development Goals (SDGs) were introduced by the United Nations in order to encourage policies and activities that help guarantee human prosperity and sustainability. SDG frameworks produced in the finance industry are designed to provide scores that indicate how well a company aligns with each of the 17 SDGs. This scoring enables a consistent assessment of investments that have the potential of building an inclusive and sustainable economy. As a result of the high quality and reliability required by such frameworks, the process of creating and maintaining them is time-consuming and requires extensive domain expertise. In this work, we describe a data-driven system that seeks to automate the process of creating an SDG framework. First, we propose a novel method for collecting and filtering a dataset of texts from different web sources and a knowledge graph relevant to a set of companies. We then implement and deploy classifiers trained with this data for predicting scores of alignment with SDGs for a given company. Our results indicate that our best performing model can accurately predict SDG scores with a micro average F1 score of 0.89, demonstrating the effectiveness of the proposed solution. We further describe how the integration of the models for its use by humans can be facilitated by providing explanations in the form of data relevant to a predicted score. We find that our proposed solution enables access to a large amount of information that analysts would normally not be able to process, resulting in an accurate prediction of SDG scores at a fraction of the cost.
摘要
联合国发布可持续发展目标(SDGs),以促进政策和活动,确保人类发展和可持续性。在金融业中,SDG框架生成了分数,用于评估公司如何与17个SDG启合。这些分数可以帮助评估投资,以建立包容性和可持续的经济。由于需要高质量和可靠性,创建和维护这些框架需要很长时间和广泛的领域专业知识。在这项工作中,我们描述了一个数据驱动的系统,用于自动化SDG框架的创建过程。首先,我们提出了一种新的方法,用于收集和筛选来自不同网络源和知识图库相关公司的文本数据集。然后,我们实现和部署基于这些数据的分类器,用于预测公司与SDG的Alignment Score。我们的结果表明,我们的最佳表现模型可以准确预测SDG分数,微均F1分数为0.89,证明我们的解决方案的有效性。此外,我们还描述了如何将模型与人类使用者集成,通过提供预测分数的数据相关信息进行解释。我们发现,我们的提议的解决方案可以提供大量信息,让分析员不可能处理的信息,并且可以在成本的一小部分下准确预测SDG分数。
A Machine Learning Method for Predicting Traffic Signal Timing from Probe Vehicle Data
paper_authors: Juliette Ugirumurera, Joseph Severino, Erik A. Bensen, Qichao Wang, Jane Macfarlane
for: This paper aims to estimate traffic signal timing information from vehicle probe data using machine learning techniques.
methods: The authors use Extreme Gradient Boosting (XGBoost) model to estimate signal cycle lengths and a neural network model to determine the corresponding red times per phase from probe data.
results: The authors achieve an error of less than 0.56 seconds for cycle length predictions and red times predictions within 7.2 seconds on average.Here’s the result in Simplified Chinese text:
results: 作者在周期长度预测中达到了 menos于0.56秒的错误,红灯时间预测在7.2秒的平均错误范围内。Abstract
Traffic signals play an important role in transportation by enabling traffic flow management, and ensuring safety at intersections. In addition, knowing the traffic signal phase and timing data can allow optimal vehicle routing for time and energy efficiency, eco-driving, and the accurate simulation of signalized road networks. In this paper, we present a machine learning (ML) method for estimating traffic signal timing information from vehicle probe data. To the authors best knowledge, very few works have presented ML techniques for determining traffic signal timing parameters from vehicle probe data. In this work, we develop an Extreme Gradient Boosting (XGBoost) model to estimate signal cycle lengths and a neural network model to determine the corresponding red times per phase from probe data. The green times are then be derived from the cycle length and red times. Our results show an error of less than 0.56 sec for cycle length, and red times predictions within 7.2 sec error on average.
摘要
停车信号对交通运输具有重要作用,协调交通流量,保证交通圆环安全。此外,了解停车信号阶段和时间数据可以帮助车辆进行优化的路径规划,以提高时间和能源效率, ec driving 和停车信号网络的准确模拟。本文提出了一种机器学习(ML)方法,通过车辆探测数据来估算停车信号时间信息。作者知道的研究中,很少有使用车辆探测数据来确定停车信号时间参数的工作。本文开发了极Gradient Boosting(XGBoost)模型来估算停车信号阶段长度,以及一个神经网络模型来确定每个阶段的红灯时间。绿灯时间则可以从阶段长度和红灯时间中 derivation。我们的结果显示停车信号阶段长度的预测错误在0.56秒之下,而红灯时间的预测错误在7.2秒之内。
Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra
results: 对各种算法进行了广泛的实验,包括使用模拟数据和公共可用的图像,并与传统矩阵和tensor completion算法进行比较。结果显示,我们的泛化矩阵完成模型和相应的算法在精度和效率方面与低阶矩阵和传统矩阵相似。Abstract
To improve the accuracy of color image completion with missing entries, we present a recovery method based on generalized higher-order scalars. We extend the traditional second-order matrix model to a more comprehensive higher-order matrix equivalent, called the "t-matrix" model, which incorporates a pixel neighborhood expansion strategy to characterize the local pixel constraints. This "t-matrix" model is then used to extend some commonly used matrix and tensor completion algorithms to their higher-order versions. We perform extensive experiments on various algorithms using simulated data and algorithms on simulated data and publicly available images and compare their performance. The results show that our generalized matrix completion model and the corresponding algorithm compare favorably with their lower-order tensor and conventional matrix counterparts.
摘要
要提高颜色图像完成缺失项的准确性,我们提出了基于泛化高阶约束的恢复方法。我们将传统的第二阶矩阵模型扩展到更为全面的高阶约束模型,称之为“t-矩阵”模型,该模型通过Pixel neighborhood Expansion strategy来捕捉当地像素约束。这个“t-矩阵”模型然后用于扩展一些通常用的矩阵和tensor completion算法到其高阶版本。我们在各种算法上进行了广泛的实验,使用模拟数据和公共可用的图像,并比较了其性能。结果显示,我们的泛化约束模型和相应的算法与其低阶矩阵和传统矩阵counterpart相比,表现出色。
Intensity-free Integral-based Learning of Marked Temporal Point Processes
results: 研究人员透过实验和实际应用评估了IFIB的性能,结果显示IFIB在实际应用中表现更好,并且可以更好地捕捉事件的特性和趋势。另外,IFIB的代码亦可以在GitHub上获取。Abstract
In the marked temporal point processes (MTPP), a core problem is to parameterize the conditional joint PDF (probability distribution function) $p^*(m,t)$ for inter-event time $t$ and mark $m$, conditioned on the history. The majority of existing studies predefine intensity functions. Their utility is challenged by specifying the intensity function's proper form, which is critical to balance expressiveness and processing efficiency. Recently, there are studies moving away from predefining the intensity function -- one models $p^*(t)$ and $p^*(m)$ separately, while the other focuses on temporal point processes (TPPs), which do not consider marks. This study aims to develop high-fidelity $p^*(m,t)$ for discrete events where the event marks are either categorical or numeric in a multi-dimensional continuous space. We propose a solution framework IFIB (\underline{I}ntensity-\underline{f}ree \underline{I}ntegral-\underline{b}ased process) that models conditional joint PDF $p^*(m,t)$ directly without intensity functions. It remarkably simplifies the process to compel the essential mathematical restrictions. We show the desired properties of IFIB and the superior experimental results of IFIB on real-world and synthetic datasets. The code is available at \url{https://github.com/StepinSilence/IFIB}.
摘要
“在标注时间点过程(MTPP)中,核心问题是参数化 conditional joint PDF(概率分布函数)$p^*(m,t)$, conditioned on the history。大多数现有研究采用预定 INTENSITY 函数。其Utility 受到预定 INTENSITY 函数的正确形式的挑战,这是 critical 的平衡表达力和处理效率。现在,有一些研究弃除预定 INTENSITY 函数——一种 models $p^*(t)$ 和 $p^*(m)$ 分别,而另一种关注时间点过程(TPPs),不考虑标记。本研究目的是开发高准确的 $p^*(m,t)$ для精细事件,其事件标记可以是 categorical 或 numeric 在多维连续空间。我们提出了解决方案框架 IFIB(INTENSITY-free INTEGRAL-based process),它直接模型 conditional joint PDF $p^*(m,t)$ без INTENSITY 函数。它够remarkably 简化过程,迫使其数学约束。我们显示了 IFIB 的愿望性质和实验结果,并提供了实验结果。代码可以在 \url{https://github.com/StepinSilence/IFIB} 上获取。”
paper_authors: Saipraneeth Devunuri, Shirin Qiam, Lewis Lehe
for: This research aims to determine if current large language models (LLMs) can retrieve information from the General Transit Feed Specification (GTFS) using natural language instructions.
methods: The research uses ChatGPT (GPT-3.5) to test its understanding of the GTFS specification and to perform information extraction from a filtered GTFS feed with 4 routes. The research compares zero-shot and program synthesis for information retrieval.
results: GPT-3.5 answers 77% of multiple-choice questions correctly, and program synthesis achieves ~90% accuracy on simple questions and ~40% accuracy on complex questions for information retrieval.Abstract
The General Transit Feed Specification (GTFS) standard for publishing transit data is ubiquitous. GTFS being tabular data, with information spread across different files, necessitates specialized tools or packages to retrieve information. Concurrently, the use of Large Language Models for text and information retrieval is growing. The idea of this research is to see if the current widely adopted LLMs (ChatGPT) are able to retrieve information from GTFS using natural language instructions. We first test whether ChatGPT (GPT-3.5) understands the GTFS specification. GPT-3.5 answers 77% of our multiple-choice questions (MCQ) correctly. Next, we task the LLM with information extractions from a filtered GTFS feed with 4 routes. For information retrieval, we compare zero-shot and program synthesis. Program synthesis works better, achieving ~90% accuracy on simple questions and ~40% accuracy on complex questions.
摘要
“通用交通资料标准(GTFS)是公共交通资料的标准化方式,它是表格式的数据,资讯分散在不同的档案中,因此需要特殊的工具或套件来撷取资讯。同时,使用大型自然语言模型(LLM)来检索和撷取文本资讯的使用情况正在增加。本研究的想法是看看目前最受推崇的LLM(ChatGPT)是否可以使用自然语言指令来从GTFS中撷取资讯。我们首先测试了ChatGPT是否理解GTFS规范。ChatGPT回答了我们的多选询题(MCQ)中的77%问题正确。接下来,我们将LLM调用到一个筛选后的GTFS资料中,并进行信息撷取。对于信息撷取,我们比较了零配置和程式合成。程式合成的方法在简单问题上取得了约90%的准确率,而在复杂问题上取得了约40%的准确率。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
Multi-attacks: Many images $+$ the same adversarial attack $\to$ many target labels
results: 研究发现,可以通过控制攻击方法的规模,改变多个图像的类别,并且可以在不同的图像和目标类上同时进行多个攻击。此外,研究还发现,采用协同学习可以减少攻击的可能性。Abstract
We show that we can easily design a single adversarial perturbation $P$ that changes the class of $n$ images $X_1,X_2,\dots,X_n$ from their original, unperturbed classes $c_1, c_2,\dots,c_n$ to desired (not necessarily all the same) classes $c^*_1,c^*_2,\dots,c^*_n$ for up to hundreds of images and target classes at once. We call these \textit{multi-attacks}. Characterizing the maximum $n$ we can achieve under different conditions such as image resolution, we estimate the number of regions of high class confidence around a particular image in the space of pixels to be around $10^{\mathcal{O}(100)}$, posing a significant problem for exhaustive defense strategies. We show several immediate consequences of this: adversarial attacks that change the resulting class based on their intensity, and scale-independent adversarial examples. To demonstrate the redundancy and richness of class decision boundaries in the pixel space, we look for its two-dimensional sections that trace images and spell words using particular classes. We also show that ensembling reduces susceptibility to multi-attacks, and that classifiers trained on random labels are more susceptible. Our code is available on GitHub.
摘要
我们示示出,可以轻松地设计单一的敌对干扰P,使$n$幅影像$X_1,X_2,\ldots,X_n$的原始、未干扰类别$c_1,c_2,\ldots,c_n$变化为愿意的类别$c^*_1,c^*_2,\ldots,c^*_n$。我们称这为“多元攻击”。在不同的图像分辨率下,我们估计在像素空间中高度信任类别的区域数量为$10^{\mathcal{O}(100)}$,这会对对抗策略造成严重的问题。我们显示了一些直接的后果:对于干扰的强度而变化的攻击,以及不受影像大小影响的攻击示例。我们还证明了分类器的集成可以对多元攻击进行防护,并且显示了使用随机标签训练的分类器更容易受到攻击。我们的代码可以在GitHub上找到。
Adapting to Change: Robust Counterfactual Explanations in Dynamic Data Landscapes
paper_authors: Bardh Prenkaj, Mario Villaizan-Vallelado, Tobias Leemann, Gjergji Kasneci
for: This paper proposes a novel semi-supervised method for counterfactual explanation, called Dynamic GRAph Counterfactual Explainer (DyGRACE), which can be used to identify counterfactuals in graph-structured data.
methods: DyGRACE uses two graph autoencoders (GAEs) to learn the representation of each class in a binary classification scenario, and optimizes a parametric density function (implemented as a logistic regression function) to identify counterfactuals by maximizing the factual autoencoder’s reconstruction error.
results: The paper shows that DyGRACE is effective in identifying counterfactuals and can act as a drift detector, identifying distributional drift based on differences in reconstruction errors between iterations. It also avoids reliance on the oracle’s predictions in successive iterations, increasing the efficiency of counterfactual discovery.Abstract
We introduce a novel semi-supervised Graph Counterfactual Explainer (GCE) methodology, Dynamic GRAph Counterfactual Explainer (DyGRACE). It leverages initial knowledge about the data distribution to search for valid counterfactuals while avoiding using information from potentially outdated decision functions in subsequent time steps. Employing two graph autoencoders (GAEs), DyGRACE learns the representation of each class in a binary classification scenario. The GAEs minimise the reconstruction error between the original graph and its learned representation during training. The method involves (i) optimising a parametric density function (implemented as a logistic regression function) to identify counterfactuals by maximising the factual autoencoder's reconstruction error, (ii) minimising the counterfactual autoencoder's error, and (iii) maximising the similarity between the factual and counterfactual graphs. This semi-supervised approach is independent of an underlying black-box oracle. A logistic regression model is trained on a set of graph pairs to learn weights that aid in finding counterfactuals. At inference, for each unseen graph, the logistic regressor identifies the best counterfactual candidate using these learned weights, while the GAEs can be iteratively updated to represent the continual adaptation of the learned graph representation over iterations. DyGRACE is quite effective and can act as a drift detector, identifying distributional drift based on differences in reconstruction errors between iterations. It avoids reliance on the oracle's predictions in successive iterations, thereby increasing the efficiency of counterfactual discovery. DyGRACE, with its capacity for contrastive learning and drift detection, will offer new avenues for semi-supervised learning and explanation generation.
摘要
我们提出了一种新的半supervised图CounterfactualExplainer(GCE)方法,即动态图CounterfactualExplainer(DyGRACE)。它利用初始数据分布的知识来搜索有效的counterfactuals,而不是使用可能已经过时的决策函数。使用两个图自动encoder(GAEs),DyGRACE学习了每个类型在二分类场景中的表示。在训练过程中,GAEs minimizes the reconstruction error between the original graph and its learned representation。方法包括(i)通过maximizing the factual autoencoder's reconstruction error来优化一个parametric density function(实际上是一个логистиック回归函数)来标识counterfactuals,(ii)minimize the counterfactual autoencoder's error,和(iii)maximize the similarity between the factual and counterfactual graphs。这种半supervised的方法不依赖于下一个黑盒模型的预测。一个логистиック回归模型在一组图对上训练,以便在每个未看过的图上标识最佳counterfactual候选者,而GAEs可以在迭代过程中不断更新以表示learned graph representation的持续适应。DyGRACE非常有效,可以作为分布检测器,基于不同迭代的reconstruction error来发现分布的变化。它不依赖于下一个黑盒模型的预测,从而提高了对counterfactual的发现效率。DyGRACE,拥有对比学习和分布检测的能力,将为半supervised学习和解释生成提供新的 Avenues。
RobustMQ: Benchmarking Robustness of Quantized Models
results: 研究发现,量化模型对于不同类型的噪声 exhibit 不同的抵抗性,例如:量化模型在攻击性噪声方面具有更高的抵抗性,但在自然噪声和系统性噪声方面更容易受到影响;随着量化比特宽的增加,对于攻击性噪声的抵抗性下降,对于自然噪声和系统性噪声的抵抗性增加。Abstract
Quantization has emerged as an essential technique for deploying deep neural networks (DNNs) on devices with limited resources. However, quantized models exhibit vulnerabilities when exposed to various noises in real-world applications. Despite the importance of evaluating the impact of quantization on robustness, existing research on this topic is limited and often disregards established principles of robustness evaluation, resulting in incomplete and inconclusive findings. To address this gap, we thoroughly evaluated the robustness of quantized models against various noises (adversarial attacks, natural corruptions, and systematic noises) on ImageNet. The comprehensive evaluation results empirically provide valuable insights into the robustness of quantized models in various scenarios, for example: (1) quantized models exhibit higher adversarial robustness than their floating-point counterparts, but are more vulnerable to natural corruptions and systematic noises; (2) in general, increasing the quantization bit-width results in a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness; (3) among corruption methods, \textit{impulse noise} and \textit{glass blur} are the most harmful to quantized models, while \textit{brightness} has the least impact; (4) among systematic noises, the \textit{nearest neighbor interpolation} has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful. Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.
摘要
量化技术已成为深度神经网络(DNNs)部署在有限资源设备上的重要手段。然而,量化模型在实际应用中容易受到各种噪音的影响,这些噪音包括攻击性的攻击、自然损害和系统性的噪音。despite the importance of evaluating the impact of quantization on robustness, existing research on this topic is limited and often disregards established principles of robustness evaluation, resulting in incomplete and inconclusive findings. To address this gap, we thoroughly evaluated the robustness of quantized models against various noises (adversarial attacks, natural corruptions, and systematic noises) on ImageNet. The comprehensive evaluation results empirically provide valuable insights into the robustness of quantized models in various scenarios, for example:1. 量化模型对攻击性攻击 exhibit higher robustness than their floating-point counterparts, but are more vulnerable to natural corruptions and systematic noises;2. in general, increasing the quantization bit-width results in a decrease in adversarial robustness, an increase in natural robustness, and an increase in systematic robustness;3. among corruption methods, impulse noise and glass blur are the most harmful to quantized models, while brightness has the least impact;4. among systematic noises, the nearest neighbor interpolation has the highest impact, while bilinear interpolation, cubic interpolation, and area interpolation are the three least harmful.Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.
Vehicles Control: Collision Avoidance using Federated Deep Reinforcement Learning
paper_authors: Badr Ben Elallid, Amine Abouaomar, Nabil Benamar, Abdellatif Kobbane
for: 运输管理和安全性问题在城市化社会中日益严重,因此开发智能控制系统成为急需。
methods: 本研究使用联边深度强化学习(FDRL)技术来实现车辆控制的最佳化。
results: FDDPG算法比DDPG更有效地控制车辆,预防碰撞和提高平均速度。Abstract
In the face of growing urban populations and the escalating number of vehicles on the roads, managing transportation efficiently and ensuring safety have become critical challenges. To tackle these issues, the development of intelligent control systems for vehicles is paramount. This paper presents a comprehensive study on vehicle control for collision avoidance, leveraging the power of Federated Deep Reinforcement Learning (FDRL) techniques. Our main goal is to minimize travel delays and enhance the average speed of vehicles while prioritizing safety and preserving data privacy. To accomplish this, we conducted a comparative analysis between the local model, Deep Deterministic Policy Gradient (DDPG), and the global model, Federated Deep Deterministic Policy Gradient (FDDPG), to determine their effectiveness in optimizing vehicle control for collision avoidance. The results obtained indicate that the FDDPG algorithm outperforms DDPG in terms of effectively controlling vehicles and preventing collisions. Significantly, the FDDPG-based algorithm demonstrates substantial reductions in travel delays and notable improvements in average speed compared to the DDPG algorithm.
摘要
面对城市人口增长和交通量的增加,有效地管理交通和保障安全已成为急需解决的挑战。为了应对这些问题,智能控制系统的开发 для车辆已成为核心。本文通过 Federated Deep Reinforcement Learning(FDRL)技术来进行全面的研究,以最大化车辆控制的效率和安全性。我们的主要目标是减少交通延迟和提高车辆的平均速度,同时保持数据隐私。为达到这个目标,我们进行了本地模型(DDPG)和全球模型(FDDPG)的比较分析,以确定它们在避免碰撞方面的效果。结果表明,使用 FDDPG 算法可以更好地控制车辆,避免碰撞。特别是,基于 FDDPG 算法的方法在减少交通延迟和提高车辆的平均速度方面表现出了明显的改善。
Recurrent Neural Networks with more flexible memory: better predictions than rough volatility
results: 对比vanilla LSTM和扩展LSTM,扩展LSTM需要训练两倍多的epoch数,但是验证和测试loss的变化更小。此外,使用最小验证损失的模型可以系统性地超过20%的粗略预测值。Abstract
We extend recurrent neural networks to include several flexible timescales for each dimension of their output, which mechanically improves their abilities to account for processes with long memory or with highly disparate time scales. We compare the ability of vanilla and extended long short term memory networks (LSTMs) to predict asset price volatility, known to have a long memory. Generally, the number of epochs needed to train extended LSTMs is divided by two, while the variation of validation and test losses among models with the same hyperparameters is much smaller. We also show that the model with the smallest validation loss systemically outperforms rough volatility predictions by about 20% when trained and tested on a dataset with multiple time series.
摘要
我们将回传神经网络扩展为每个输出维度中包含多个灵活时间尺度,这会机械地提高它们在处理长期记忆过程或高度不同时间尺度的能力。我们将vanilla和延长的长期快Memory网络(LSTM)用于预测资产波动性,知道具有长期记忆。通常,对于延长LSTM的训练需要的轮数比vanilla要少半,而模型之间的验证和测试损失的变化也变得较小。此外,我们还发现使用 smallest validation loss的模型系统地超过20%的粗糙波动预测。Note: " Simplified Chinese" is also known as "Mandarin" or "Standard Chinese".
Stability and Generalization of Hypergraph Collaborative Networks
results: 实验结果表明,HCNN在实际数据上具有较高的稳定性和泛化能力,并且可以在半监督学习任务中获得更好的性能。Abstract
Graph neural networks have been shown to be very effective in utilizing pairwise relationships across samples. Recently, there have been several successful proposals to generalize graph neural networks to hypergraph neural networks to exploit more complex relationships. In particular, the hypergraph collaborative networks yield superior results compared to other hypergraph neural networks for various semi-supervised learning tasks. The collaborative network can provide high quality vertex embeddings and hyperedge embeddings together by formulating them as a joint optimization problem and by using their consistency in reconstructing the given hypergraph. In this paper, we aim to establish the algorithmic stability of the core layer of the collaborative network and provide generalization guarantees. The analysis sheds light on the design of hypergraph filters in collaborative networks, for instance, how the data and hypergraph filters should be scaled to achieve uniform stability of the learning process. Some experimental results on real-world datasets are presented to illustrate the theory.
摘要
图 neural network 已经被证明可以非常有效地利用邻居关系来进行样本之间的对应。最近,有几种成功的提议来扩展图 neural network 到 hypergraph neural network,以利用更复杂的关系。特别是,在 hypergraph 协作网络中,得到的结果比其他 hypergraph neural network 更好,用于各种半监督学习任务。协作网络可以同时提供高质量的顶点嵌入和 гипер边嵌入,通过将它们定义为一个共同优化问题,并通过它们在重建给定的 hypergraph 中的一致性来实现这一点。在这篇论文中,我们想要证明协作网络核心层的算法稳定性,并提供一致性保证。分析推出了协作网络中的 гиперграhp 筛子设计方法,例如如何在数据和 гиперграф筛子上进行扫描,以实现学习过程的均匀稳定性。一些实验结果在真实世界数据上进行了描述,以证明理论。
Learning Networks from Gaussian Graphical Models and Gaussian Free Fields
for: 该论文旨在估计权重网络的结构从重复测量 Gaussian Graphical Model (GGM) 中获得。
methods: 该论文提出了一种新的估计器,基于 Gaussian Free Field (GFF) 的 Fourier分析性质,用于估计权重网络(等价地,其 Laplacian)。
results: 该论文提供了具体的恢复保证和样本复杂度下的界限,证明该估计器可以达到参数率的估计。在 Erdos-Renyi 随机图 $G(d,p)$ 上,当样本大小 $n$ 大于 $d^4 \log d \cdot p^{-2}$ 时,可以高probability 地recovery网络结构。Abstract
We investigate the problem of estimating the structure of a weighted network from repeated measurements of a Gaussian Graphical Model (GGM) on the network. In this vein, we consider GGMs whose covariance structures align with the geometry of the weighted network on which they are based. Such GGMs have been of longstanding interest in statistical physics, and are referred to as the Gaussian Free Field (GFF). In recent years, they have attracted considerable interest in the machine learning and theoretical computer science. In this work, we propose a novel estimator for the weighted network (equivalently, its Laplacian) from repeated measurements of a GFF on the network, based on the Fourier analytic properties of the Gaussian distribution. In this pursuit, our approach exploits complex-valued statistics constructed from observed data, that are of interest on their own right. We demonstrate the effectiveness of our estimator with concrete recovery guarantees and bounds on the required sample complexity. In particular, we show that the proposed statistic achieves the parametric rate of estimation for fixed network size. In the setting of networks growing with sample size, our results show that for Erdos-Renyi random graphs $G(d,p)$ above the connectivity threshold, we demonstrate that network recovery takes place with high probability as soon as the sample size $n$ satisfies $n \gg d^4 \log d \cdot p^{-2}$.
摘要
我们研究了从重复观测 Gaussian Graphical Model (GGM) 中Estimating the structure of a weighted network的问题。在这种情况下,我们考虑 GGM 的均值结构和weighted network的geometry相互关联。这些 GGM 在统计物理中已经受到了长期的关注,被称为 Gaussian Free Field (GFF)。在最近几年,它们在机器学习和理论计算机科学中也获得了很多关注。在这个工作中,我们提出了一个新的Weighted network的Estimator,基于 GFF 的对称性和当中的观测数据的傅立做�� statistiche的对���能�能。我们的方法利用了观测数据中的复数统计,具有对���能�能的实际价值。我们显示了我们的检测器具有固定网络大小的 parametric 速率,并且在网络规模 growing 的情况下,我们显示了当 Erdos-Renyi 随机网络 $G(d,p)$ 的connectivity阈值以上时,网络重建很可能会在高概率下发生。具体来说,我们显示了 $n \gg d^4 \log d \cdot p^{-2}$ 的 sample size 下,网络重建很可能会在高概率下发生。
RAHNet: Retrieval Augmented Hybrid Network for Long-tailed Graph Classification
results: 对多个流行的 Benchmark 进行了实验,并证明了提出的方法与现有方法相比具有显著的优势。Abstract
Graph classification is a crucial task in many real-world multimedia applications, where graphs can represent various multimedia data types such as images, videos, and social networks. Previous efforts have applied graph neural networks (GNNs) in balanced situations where the class distribution is balanced. However, real-world data typically exhibit long-tailed class distributions, resulting in a bias towards the head classes when using GNNs and limited generalization ability over the tail classes. Recent approaches mainly focus on re-balancing different classes during model training, which fails to explicitly introduce new knowledge and sacrifices the performance of the head classes. To address these drawbacks, we propose a novel framework called Retrieval Augmented Hybrid Network (RAHNet) to jointly learn a robust feature extractor and an unbiased classifier in a decoupled manner. In the feature extractor training stage, we develop a graph retrieval module to search for relevant graphs that directly enrich the intra-class diversity for the tail classes. Moreover, we innovatively optimize a category-centered supervised contrastive loss to obtain discriminative representations, which is more suitable for long-tailed scenarios. In the classifier fine-tuning stage, we balance the classifier weights with two weight regularization techniques, i.e., Max-norm and weight decay. Experiments on various popular benchmarks verify the superiority of the proposed method against state-of-the-art approaches.
摘要
图像分类是现实世界多媒体应用中的一项重要任务,图像可以表示各种多媒体数据类型,如图像、视频和社交网络。先前的努力主要采用图像神经网络(GNN)在平衡的情况下进行分类,但实际数据通常会出现长尾类分布,导致使用GNN时偏向头类,并且对尾类的泛化能力有限。现有的方法主要是在模型训练过程中重新平衡不同类别,但这会失去新知识的引入和头类性能的牺牲。为解决这些缺陷,我们提出了一种新的框架,即Retrieval Augmented Hybrid Network(RAHNet),用于同时学习一个可靠的特征提取器和一个不偏向的分类器。在特征提取器训练阶段,我们开发了一个图像检索模块,用于搜索与尾类相关的图像,以直接增强尾类之间的内部多样性。此外,我们还创新地优化了一种类型中心的自适应对比损失函数,以获得适应长尾enario的特征表示,这更适合长尾分布的情况。在分类器精度调整阶段,我们使用两种质量规则技术,即最大 нор化和权重衰减,来均衡分类器的权重。实验表明,我们的方法在各种流行的标准准则上表现出色,胜过当前的状态艺术方法。
Interoperable synthetic health data with SyntHIR to enable the development of CDSS tools
paper_authors: Pavitra Chauhan, Mohsen Gamal Saad Askar, Bjørn Fjukstad, Lars Ailo Bongo, Edvard Pedersen
for: 该论文目的是提出一种基于Synthetic Health Information Resources(SyntHIR)架构的临床决策支持系统(CDSS)开发方法,以便在临床工作流程中实现CDSS工具的开发和测试。
methods: 该论文使用了Fast Healthcare Interoperability Resources(FHIR)标准实现数据互操作性,使用Gretel框架生成假数据,Microsoft Azure FHIR服务器作为基于FHIR的电子健康记录(EHR)系统,以及SMART on FHIR框架实现工具传输性。
results: 作者使用挪威病人登记(NPR)和挪威药物额度(NorPD)的数据开发了一种基于机器学习的CDSS工具,并在SyntHIR系统上测试了该工具。然后,他们将该工具提升到Open DIPS环境中进行测试。结论,SyntHIR提供了一个通用的CDSS工具开发架构,使用假FHIR数据进行测试,并提供了一个可用的测试环境。然而,在生成假数据方面,还有一定的可改进空间。代码可以在https://github.com/potter-coder89/SyntHIR.git中获取。Abstract
There is a great opportunity to use high-quality patient journals and health registers to develop machine learning-based Clinical Decision Support Systems (CDSS). To implement a CDSS tool in a clinical workflow, there is a need to integrate, validate and test this tool on the Electronic Health Record (EHR) systems used to store and manage patient data. However, it is often not possible to get the necessary access to an EHR system due to legal compliance. We propose an architecture for generating and using synthetic EHR data for CDSS tool development. The architecture is implemented in a system called SyntHIR. The SyntHIR system uses the Fast Healthcare Interoperability Resources (FHIR) standards for data interoperability, the Gretel framework for generating synthetic data, the Microsoft Azure FHIR server as the FHIR-based EHR system and SMART on FHIR framework for tool transportability. We demonstrate the usefulness of SyntHIR by developing a machine learning-based CDSS tool using data from the Norwegian Patient Register (NPR) and Norwegian Patient Prescriptions (NorPD). We demonstrate the development of the tool on the SyntHIR system and then lift it to the Open DIPS environment. In conclusion, SyntHIR provides a generic architecture for CDSS tool development using synthetic FHIR data and a testing environment before implementing it in a clinical setting. However, there is scope for improvement in terms of the quality of the synthetic data generated. The code is open source and available at https://github.com/potter-coder89/SyntHIR.git.
摘要
“有一大机会使用高品质的病人日记和健康登记来开发机器学习型临床决策支持系统(CDSS)。实现CDSS工具在临床工作流程中的实现,需要与电子健康纪录(EHR)系统集成、验证和测试这个工具。然而,由于法律合规,常常无法获得EHR系统的必要存取权。我们提出了一个架构,用于生成和使用合成EHR数据来开发CDSS工具。这个架构是在SyntHIR系统中实现的。SyntHIR系统使用了快速医疗通信资源(FHIR)标准来实现数据互操作,使用Gretel框架生成合成数据,使用Microsoft Azure FHIR服务器作为基于FHIR的EHR系统,并使用SMART on FHIR框架来实现工具可移性。我们透过使用挪威病人登记(NPR)和挪威病人处方(NorPD)的数据,展示了这个工具的开发和运用。我们首先在SyntHIR系统上开发了这个工具,然后将其升级到Open DIPS环境。在结论中,SyntHIR提供了一个通用的架构,用于CDSS工具的开发使用合成FHIR数据,以及一个可用的测试环境,以便在临床设置中实现。然而,这个架构中的合成数据质量仍然有改进的空间。代码可以在https://github.com/potter-coder89/SyntHIR.git中取得。”
Deep learning for spike detection in deep brain stimulation surgery
paper_authors: Arkadiusz Nowacki, Ewelina Kołpa, Mateusz Szychiewicz, Konrad Ciecierski
for: 这个论文是为了描述一种基于深度学习的神经活动记录分析方法,用于诊断parkinson病。
methods: 该方法使用了一种卷积神经网络(CNN)来分析记录的神经活动。
results: 实验结果表明,该方法可以达到98.98%的最高准确率和0.9898的接受收操作特征曲线值。Abstract
Deep brain stimulation (DBS) is a neurosurgical procedure successfully used to treat conditions such as Parkinson's disease. Electrostimulation, carried out by implanting electrodes into an identified focus in the brain, makes it possible to reduce the symptoms of the disease significantly. In this paper, a method for analyzing recordings of neuronal activity acquired during DBS neurosurgery using deep learning is presented. We tested using a convolutional neural network (CNN) for this purpose. Based on the time window, the classifier assesses whether neuronal activity (spike) is present. The maximum accuracy value for the classifier was 98.98%, and the area under the receiver operating characteristic curve (AUC) was 0.9898. The method made it possible to obtain a classification without using data preprocessing.
摘要
深度脑刺激(DBS)是一种成功地用于治疗parkinson病的 neurosurgical procedure。通过在脑中implanting electrodes,可以使得脑动力学性的症状减轻。在这篇论文中,一种使用深度学习分析DBS neurosurgery记录的方法被提出。我们使用了卷积神经网络(CNN)来实现这一目的。根据时间窗口,分类器评估神经活动(脉冲)是否存在。最大的准确率值为98.98%,受测频谱特征曲线(AUC)的值为0.9898。这种方法可以不使用数据预处理来获得分类。
A stochastic optimization approach to train non-linear neural networks with a higher-order variation regularization
for: 本研究旨在Addressing the issue of overfitting in highly expressive parametric models, such as deep neural networks, by introducing a $(k,q)$th order variation regularization ($(k,q)$-VR).
results: 数字实验表明,使用 $(k,q)$-VR regularization可以使神经网络更“鲜硬”,比传统参数规范化更有效。此外,该方法还可以应用于物理学信息训练神经网络(PINNs)。Abstract
While highly expressive parametric models including deep neural networks have an advantage to model complicated concepts, training such highly non-linear models is known to yield a high risk of notorious overfitting. To address this issue, this study considers a $(k,q)$th order variation regularization ($(k,q)$-VR), which is defined as the $q$th-powered integral of the absolute $k$th order derivative of the parametric models to be trained; penalizing the $(k,q)$-VR is expected to yield a smoother function, which is expected to avoid overfitting. Particularly, $(k,q)$-VR encompasses the conventional (general-order) total variation with $q=1$. While the $(k,q)$-VR terms applied to general parametric models are computationally intractable due to the integration, this study provides a stochastic optimization algorithm, that can efficiently train general models with the $(k,q)$-VR without conducting explicit numerical integration. The proposed approach can be applied to the training of even deep neural networks whose structure is arbitrary, as it can be implemented by only a simple stochastic gradient descent algorithm and automatic differentiation. Our numerical experiments demonstrate that the neural networks trained with the $(k,q)$-VR terms are more ``resilient'' than those with the conventional parameter regularization. The proposed algorithm also can be extended to the physics-informed training of neural networks (PINNs).
摘要
“高度表达性的 parametric 模型,如深度神经网络,具有模型复杂概念的优势。然而,训练这些非线性模型可能会导致恶性适应。为解决这个问题,本研究考虑了 $(k,q)$ 次变化正则化($(k,q)$-VR),它是指要训练的 parametric 模型的 $q$ 次幂的绝对 $k$ 次导数的积分;减少 $(k,q)$-VR 会导致更平滑的函数,以避免适应。特别是,$(k,q)$-VR 包括普通(一般顺序)总变量,即 $q=1$。然而,$(k,q)$-VR 应用于普通 parametric 模型是计算易于实现的,这里的研究提供了一种随机优化算法,可以高效地训练普通模型并且不需要显式的数值积分。这种方法可以应用于深度神经网络的训练,其结构可以是任意的,只需使用简单的随机梯度下降算法和自动微分。我们的数值实验表明,使用 $(k,q)$-VR 项训练的神经网络比使用传统参数正则化更为“抗耗”。此外,该算法还可以扩展到物理学 informed 神经网络(PINNs)的训练。”
Frustratingly Easy Model Generalization by Dummy Risk Minimization
results: 在多种任务上,包括普通分类、语义分割、 OUT-OF-distribution泛化、对抗训练和长尾识别等,DuRM能够一般性地提高模型的表现,并且可以与现有的通用技术相结合Abstract
Empirical risk minimization (ERM) is a fundamental machine learning paradigm. However, its generalization ability is limited in various tasks. In this paper, we devise Dummy Risk Minimization (DuRM), a frustratingly easy and general technique to improve the generalization of ERM. DuRM is extremely simple to implement: just enlarging the dimension of the output logits and then optimizing using standard gradient descent. Moreover, we validate the efficacy of DuRM on both theoretical and empirical analysis. Theoretically, we show that DuRM derives greater variance of the gradient, which facilitates model generalization by observing better flat local minima. Empirically, we conduct evaluations of DuRM across different datasets, modalities, and network architectures on diverse tasks, including conventional classification, semantic segmentation, out-of-distribution generalization, adverserial training, and long-tailed recognition. Results demonstrate that DuRM could consistently improve the performance under all tasks with an almost free lunch manner. Furthermore, we show that DuRM is compatible with existing generalization techniques and we discuss possible limitations. We hope that DuRM could trigger new interest in the fundamental research on risk minimization.
摘要
empirical risk minimization (ERM) 是机器学习的基本思想之一,但它在各种任务中的泛化能力有限。在这篇论文中,我们提出了干扰风险最小化(DuRM),一种简单易行但具有普遍性的技术,以提高ERM的泛化能力。DuRM的实现非常简单:首先扩大输出логи特的维度,然后使用标准的梯度下降优化。我们在理论和实验两个方面 validate DuRM的有效性。从理论上看,我们表明DuRM可以提高模型的泛化能力,通过在更好的平坦的地方找到更好的地方的梯度。从实验来看,我们在不同的数据集、模式和网络架构上进行了多种任务的评估,包括传统的分类、 semantic segmentation、out-of-distribution泛化、对抗训练和长尾识别。结果表明,DuRM可以在所有任务上提高性能,几乎没有代价。此外,我们还证明了DuRM与现有的泛化技术相容,并讨论了可能的局限性。我们希望DuRM可以引起新的研究兴趣,关于风险最小化的基础研究。
DIVERSIFY: A General Framework for Time Series Out-of-distribution Detection and Generalization
results: 实验结果显示,DIVERSIFY可以对时间序列资料的扩展和异常检测进行更好的检测和分类,并且与其他基准相比有着优秀的性能。Abstract
Time series remains one of the most challenging modalities in machine learning research. The out-of-distribution (OOD) detection and generalization on time series tend to suffer due to its non-stationary property, i.e., the distribution changes over time. The dynamic distributions inside time series pose great challenges to existing algorithms to identify invariant distributions since they mainly focus on the scenario where the domain information is given as prior knowledge. In this paper, we attempt to exploit subdomains within a whole dataset to counteract issues induced by non-stationary for generalized representation learning. We propose DIVERSIFY, a general framework, for OOD detection and generalization on dynamic distributions of time series. DIVERSIFY takes an iterative process: it first obtains the "worst-case" latent distribution scenario via adversarial training, then reduces the gap between these latent distributions. We implement DIVERSIFY via combining existing OOD detection methods according to either extracted features or outputs of models for detection while we also directly utilize outputs for classification. In addition, theoretical insights illustrate that DIVERSIFY is theoretically supported. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that DIVERSIFY learns more generalized features and significantly outperforms other baselines.
摘要
时序序列仍然是机器学习研究中最为困难的模式之一。它的非站立性性会导致模型在不同时间点上的分布变化,从而使得泛化检测和泛化学习受到挑战。我们提出了一种解决方案,即DIVERSIFY,用于对动态分布的时序序列进行泛化检测和泛化学习。DIVERSIFY采用迭代过程,首先通过对恶性情况下的射频分布进行反对抗训练,然后减少这些射频分布之间的差距。我们通过将现有的OOD检测方法与特定的特征或模型输出结合使用来实现DIVERSIFY。此外,我们还提供了理论启示,表明DIVERSIFY的理论基础。我们在七个不同的数据集上进行了广泛的实验,结果表明DIVERSIFY可以更好地学习通用的特征,并与其他基eline大大超越。
Adaptive Proximal Gradient Method for Convex Optimization
results: 我们的方法可以使用更大的步长than those initially suggested in [MM20]。Abstract
In this paper, we explore two fundamental first-order algorithms in convex optimization, namely, gradient descent (GD) and proximal gradient method (ProxGD). Our focus is on making these algorithms entirely adaptive by leveraging local curvature information of smooth functions. We propose adaptive versions of GD and ProxGD that are based on observed gradient differences and, thus, have no added computational costs. Moreover, we prove convergence of our methods assuming only local Lipschitzness of the gradient. In addition, the proposed versions allow for even larger stepsizes than those initially suggested in [MM20].
摘要
在本文中,我们研究了两种基本的首频算法在凸优化中, namely,梯度下降(GD)和贝叶克梯度方法(ProxGD)。我们的关注点是使这两种算法完全适应性的,利用滑坡函数的本地勋氏度信息。我们提出了基于观察到的梯度差的自适应GD和ProxGD版本,无需额外计算成本。此外,我们证明了我们的方法在假设只有梯度的本地利弗希茨性下都是收敛的。此外,我们的方法还允许更大的步长 чем initially suggested in [MM20].
Finding Tori: Self-supervised Learning for Analyzing Korean Folk Song
results: 实验结果表明,作者的方法可以更好地捕捉韩国传统民歌中的折衣特征,比传统的折衣历史图表更加准确。Abstract
In this paper, we introduce a computational analysis of the field recording dataset of approximately 700 hours of Korean folk songs, which were recorded around 1980-90s. Because most of the songs were sung by non-expert musicians without accompaniment, the dataset provides several challenges. To address this challenge, we utilized self-supervised learning with convolutional neural network based on pitch contour, then analyzed how the musical concept of tori, a classification system defined by a specific scale, ornamental notes, and an idiomatic melodic contour, is captured by the model. The experimental result shows that our approach can better capture the characteristics of tori compared to traditional pitch histograms. Using our approaches, we have examined how musical discussions proposed in existing academia manifest in the actual field recordings of Korean folk songs.
摘要
在这篇论文中,我们介绍了一种计算机分析方法,用于分析约700小时的朝鲜传统民歌场记录数据集,这些歌曲主要由非专业音乐家演唱,没有伴奏。由于这些歌曲具有许多挑战,我们采用了自我超vised学习方法,使用基于折衣的卷积神经网络进行分析。我们发现,我们的方法可以更好地捕捉到韵律折衣的特点,比传统的折衣历史gram相更加精准。通过我们的方法,我们对现有学术研究中的音乐讨论进行了实际场记录的检验。
Deep neural networks from the perspective of ergodic theory
results: 这篇论文表明,通过这种方法,一些可能看起来神秘的规则可以被归结为启发。Abstract
The design of deep neural networks remains somewhat of an art rather than precise science. By tentatively adopting ergodic theory considerations on top of viewing the network as the time evolution of a dynamical system, with each layer corresponding to a temporal instance, we show that some rules of thumb, which might otherwise appear mysterious, can be attributed heuristics.
摘要
深度神经网络的设计仍然归于一种艺术化的领域,而不是精确的科学。我们通过将神经网络视为时间演化的动力系统,每层对应于一个时间实例,采用ergodic理论考虑,可以解释一些可能看起来神秘的规则,归为习惯。
Self-Normalizing Neural Network, Enabling One Shot Transfer Learning for Modeling EDFA Wavelength Dependent Gain
paper_authors: Agastya Raj, Zehao Wang, Frank Slyne, Tingjun Chen, Dan Kilper, Marco Ruffini
for: 这 paper 是为了模型多个 EDFA 的波长依赖性增强的 ML 框架。
methods: 这 paper 使用 semi-supervised, self-normalizing neural networks,允许一次转移学习。
results: 实验结果显示,这种方法可以在不同的增强器类型上达到高精度的转移学习。Abstract
We present a novel ML framework for modeling the wavelength-dependent gain of multiple EDFAs, based on semi-supervised, self-normalizing neural networks, enabling one-shot transfer learning. Our experiments on 22 EDFAs in Open Ireland and COSMOS testbeds show high-accuracy transfer-learning even when operated across different amplifier types.
摘要
我们提出了一种新的机器学习框架,用于模型多个电子激发器的波长依赖性增强。我们使用半监督、自适应神经网络,实现一次转移学习。我们的实验表明,even when operated across different amplifier types, our framework can achieve high-accuracy transfer learning on 22 EDFAs in Open Ireland and COSMOS testbeds.Here's a word-for-word translation of the text:我们提出了一种新的机器学习框架,用于模型多个电子激发器的波长依赖性增强。我们使用半监督、自适应神经网络,实现一次转移学习。我们的实验表明,even when operated across different amplifier types, our framework can achieve high-accuracy transfer learning on 22 EDFAs in Open Ireland and COSMOS testbeds.
Likelihood-ratio-based confidence intervals for neural networks
results: 该paper表明了可能性比率基于的 uncertainty estimate在医学预测和天文物理等领域可能已经成本较高,但是可能有优势于其他方法。这个研究提出了可能性比率基于的 uncertainty estimate的可能性和未来研究的潜在途径。Abstract
This paper introduces a first implementation of a novel likelihood-ratio-based approach for constructing confidence intervals for neural networks. Our method, called DeepLR, offers several qualitative advantages: most notably, the ability to construct asymmetric intervals that expand in regions with a limited amount of data, and the inherent incorporation of factors such as the amount of training time, network architecture, and regularization techniques. While acknowledging that the current implementation of the method is prohibitively expensive for many deep-learning applications, the high cost may already be justified in specific fields like medical predictions or astrophysics, where a reliable uncertainty estimate for a single prediction is essential. This work highlights the significant potential of a likelihood-ratio-based uncertainty estimate and establishes a promising avenue for future research.
摘要
Translation notes:* "asymmetric intervals" is translated as "非对称 интерVAL" (fēi duìxìng interVAL), which emphasizes the unequal nature of the intervals.* "incorporate" is translated as "包含" (bāofàn), which means "to contain" or "to include".* "training time" is translated as "训练时间" (xiùxíng shíjiān), which emphasizes the time spent on training the network.* "network architecture" is translated as "网络架构" (wǎngluò jiàgòu), which emphasizes the design of the network.* "regularization techniques" is translated as "规范技术" (guīfáng jìshù), which emphasizes the use of techniques to regularize the network.* "reliable uncertainty estimate" is translated as "可靠的不确定度估计" (kějì de bùxìngdòng dàigè), which emphasizes the accuracy and reliability of the uncertainty estimate.
Knowledge-Driven Multi-Agent Reinforcement Learning for Computation Offloading in Cybertwin-Enabled Internet of Vehicles
results: 比较其他方法,KMARL 方法可以获得更高的奖励和更好的扩展性,受益于域知识的整合Abstract
By offloading computation-intensive tasks of vehicles to roadside units (RSUs), mobile edge computing (MEC) in the Internet of Vehicles (IoV) can relieve the onboard computation burden. However, existing model-based task offloading methods suffer from heavy computational complexity with the increase of vehicles and data-driven methods lack interpretability. To address these challenges, in this paper, we propose a knowledge-driven multi-agent reinforcement learning (KMARL) approach to reduce the latency of task offloading in cybertwin-enabled IoV. Specifically, in the considered scenario, the cybertwin serves as a communication agent for each vehicle to exchange information and make offloading decisions in the virtual space. To reduce the latency of task offloading, a KMARL approach is proposed to select the optimal offloading option for each vehicle, where graph neural networks are employed by leveraging domain knowledge concerning graph-structure communication topology and permutation invariance into neural networks. Numerical results show that our proposed KMARL yields higher rewards and demonstrates improved scalability compared with other methods, benefitting from the integration of domain knowledge.
摘要
通过将计算任务转移到路边单元(RSU),移动边缘计算(MEC)在互联网联盟(IoV)中可以减轻车辆上的计算负担。然而,现有的模型基于的任务转移方法受到增加车辆和数据驱动方法的计算复杂性的限制,而数据驱动方法缺乏解释性。为解决这些挑战,在这篇论文中,我们提出了基于知识驱动多代理征分学习(KMARL)的方法,以降低cybertwin-enabled IoV中任务转移延迟。具体来说,在考虑的场景中,cybertwin acts as a communication agent for each vehicle to exchange information and make offloading decisions in the virtual space。为降低任务转移延迟,我们提出了一种基于KMARL的选择最佳转移选项的方法,其中使用了图神经网络,并利用了图结构通信网络和 permutation invariance 的知识来适应各种场景。数据结果表明,我们的提出的KMARL可以获得更高的奖励,并且在扩展性方面表现出色,受到知识 интеграation的利用帮助。
paper_authors: Guillem García Subies, Álvaro Barbero Jiménez, Paloma Martínez Fernández
for: 这篇论文主要针对的是使用语言模型解决西班牙语医疗领域中的任务。
methods: 论文回顾了17个语料库,主要关注医疗任务,并列出了最 relevantespanish Language Models和西班牙医疗语言模型。
results: 研究对这些模型进行了严格的比较,通过对一个精心选择的subset of available corpora进行了测试,以确定最佳performing models。Abstract
This survey focuses in encoder Language Models for solving tasks in the clinical domain in the Spanish language. We review the contributions of 17 corpora focused mainly in clinical tasks, then list the most relevant Spanish Language Models and Spanish Clinical Language models. We perform a thorough comparison of these models by benchmarking them over a curated subset of the available corpora, in order to find the best-performing ones; in total more than 3000 models were fine-tuned for this study. All the tested corpora and the best models are made publically available in an accessible way, so that the results can be reproduced by independent teams or challenged in the future when new Spanish Clinical Language models are created.
摘要
Translation notes:* "clinical domain" 是 translated as "医疗领域" (yī jī jīng yì)* "Spanish language" is translated as "西班牙语" (xī bān shā yǔ)* "corpora" is translated as "语料" (yǔ liào)* "fine-tuned" is translated as "微调" (wēi tiān)* "publicly available" is translated as "公开可用" (gōng kāi kě yòng)
AutoML4ETC: Automated Neural Architecture Search for Real-World Encrypted Traffic Classification
results: 论文的实验结果表明,使用AutoML4ETC可以生成高性能的加密网络流量分类器,并且这些模型比现有的状态泰半的分类器更加准确。此外,AutoML4ETC生成的模型也更加简单,具有较少的参数数量。Abstract
Deep learning (DL) has been successfully applied to encrypted network traffic classification in experimental settings. However, in production use, it has been shown that a DL classifier's performance inevitably decays over time. Re-training the model on newer datasets has been shown to only partially improve its performance. Manually re-tuning the model architecture to meet the performance expectations on newer datasets is time-consuming and requires domain expertise. We propose AutoML4ETC, a novel tool to automatically design efficient and high-performing neural architectures for encrypted traffic classification. We define a novel, powerful search space tailored specifically for the near real-time classification of encrypted traffic using packet header bytes. We show that with different search strategies over our search space, AutoML4ETC generates neural architectures that outperform the state-of-the-art encrypted traffic classifiers on several datasets, including public benchmark datasets and real-world TLS and QUIC traffic collected from the Orange mobile network. In addition to being more accurate, AutoML4ETC's architectures are significantly more efficient and lighter in terms of the number of parameters. Finally, we make AutoML4ETC publicly available for future research.
摘要
深度学习(DL)已成功应用于加密网络流量分类的实验设置中。然而,在生产环境中,DL分类器的性能一定程度下降。重新训练模型使用 newer datasets 只能部分提高其性能。手动重新调整模型结构以满足 newer datasets 的性能要求是时间consuming 并需要域专业知识。我们提出 AutoML4ETC,一种新的工具,可以自动设计高效和高性能的神经网络架构来分类加密流量。我们定义了一个特定于加密流量的近实时分类的强大搜索空间。我们表明,通过不同的搜索策略,AutoML4ETC 可以生成高性能的加密流量分类器,超过当前加密流量分类器的状态。此外,AutoML4ETC 的架构不仅更加准确,还更加轻量级,具体来说是参数的数量更少。最后,我们将 AutoML4ETC 公开发布,以便未来的研究。
Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology
results: 研究显示,使用LLM可以大幅提高临床试验匹配的效率,并且可以substantially outperform先前的强基eline。然而,LLM仍然需要人工干预,以确保匹配的准确性。此外,研究还发现了一些应用LLM到终端临床试验匹配中的成长点,例如,限定上下文和准确率,特别是从长期医疗记录中提取patient信息。Abstract
Clinical trial matching is a key process in health delivery and discovery. In practice, it is plagued by overwhelming unstructured data and unscalable manual processing. In this paper, we conduct a systematic study on scaling clinical trial matching using large language models (LLMs), with oncology as the focus area. Our study is grounded in a clinical trial matching system currently in test deployment at a large U.S. health network. Initial findings are promising: out of box, cutting-edge LLMs, such as GPT-4, can already structure elaborate eligibility criteria of clinical trials and extract complex matching logic (e.g., nested AND/OR/NOT). While still far from perfect, LLMs substantially outperform prior strong baselines and may serve as a preliminary solution to help triage patient-trial candidates with humans in the loop. Our study also reveals a few significant growth areas for applying LLMs to end-to-end clinical trial matching, such as context limitation and accuracy, especially in structuring patient information from longitudinal medical records.
摘要
临床试验匹配是医疗提供和发现的关键过程。在实践中,它受到压力于极多的无结构数据和不可扩展的手动处理。在这篇论文中,我们进行了系统性的研究,使用大型自然语言模型(LLM)扩大临床试验匹配。我们的研究基于一个大型美国医疗网络中的临床试验匹配系统,正在测试阶段。初步发现表示,直接使用最新的GPT-4等 cutting-edge LLM可以结构化复杂的参与条件和抽象出临床试验匹配逻辑(例如,嵌套的AND/OR/NOT)。虽然仍有一定的改进空间,但LLM已经明显超过了之前的强基线,并可能作为人工干预的准则来帮助批处 patient-trial 候选者。我们的研究还揭示了应用LLM到终端临床试验匹配中的一些重要成长点,如背景限制和准确率,特别是从悠久医疗记录中结构化病人信息。
High-Accuracy Prediction of Metal-Insulator-Metal Metasurface with Deep Learning
paper_authors: Kaizhu Liu, Hsiang-Chen Chui, Changsen Sun, Xue Han
for: 预测电磁软件计算结果
methods: 使用ResNets-10模型进行预测плазмон射频表面S11参数
results: 预测Error值为-48.45、-46.47和-35.54对应的三种金属-半导体-金属结构,表明提议的网络可以取代传统电磁计算方法,并且训练过程只需要1,100个epoch。Abstract
Deep learning prediction of electromagnetic software calculation results has been a widely discussed issue in recent years. But the prediction accuracy was still one of the challenges to be solved. In this work, we proposed that the ResNets-10 model was used for predicting plasmonic metasurface S11 parameters. The two-stage training was performed by the k-fold cross-validation and small learning rate. After the training was completed, the prediction loss for aluminum, gold, and silver metal-insulator-metal metasurfaces was -48.45, -46.47, and -35.54, respectively. Due to the ultralow error value, the proposed network can replace the traditional electromagnetic computing method for calculation within a certain structural range. Besides, this network can finish the training process less than 1,100 epochs. This means that the network training process can effectively lower the design process time. The ResNets-10 model we proposed can also be used to design meta-diffractive devices and biosensors, thereby reducing the time required for the calculation process. The ultralow error of the network indicates that this work contributes to the development of future artificial intelligence electromagnetic computing software.
摘要
Recently, the issue of deep learning prediction of electromagnetic software calculation results has been widely discussed. However, the prediction accuracy was still a challenge to be solved. In this work, we proposed using the ResNets-10 model to predict the S11 parameters of plasmonic metasurfaces. We performed two-stage training with k-fold cross-validation and a small learning rate. After training was completed, the prediction loss for aluminum, gold, and silver metal-insulator-metal metasurfaces was -48.45, -46.47, and -35.54, respectively. Due to the ultralow error value, the proposed network can replace traditional electromagnetic computing methods for calculation within a certain structural range. Additionally, this network can finish the training process in less than 1,100 epochs, effectively lowering the design process time. The ResNets-10 model we proposed can also be used to design meta-diffractive devices and biosensors, thereby reducing the time required for the calculation process. The ultralow error of the network indicates that this work contributes to the development of future artificial intelligence electromagnetic computing software.Here's the text in Traditional Chinese for comparison:近年来,深度学习预测电磁软件计算结果的问题在各个领域中得到了广泛的讨论。然而,预测精度仍然是一个挑战。在这个工作中,我们提出了使用ResNets-10模型来预测射频金属表面S11参数。我们进行了两阶段训练,使用k-fold跨项验证和小学习率。训练完成后,对于铝、金、银 метал-隔离-锂的预测损失分别为-48.45、-46.47和-35.54。由于预测损失值几乎到零,我们的网络可以取代传统电磁计算方法,对某些结构范围内的计算进行预测。此外,这个网络可以在1,100次迭代完成训练过程,很快地完成设计过程。我们提出的ResNets-10模型可以用来设计meta-diffractive设备和生物感应器,因此可以缩短计算过程的时间。这个工作对未来人工智能电磁计算软件的发展做出了贡献。
results: 能够重construct和生成高质量的晶体结构,与原始CDVAE模型相当;更重要的是,与量子化计算得到的碳结构进行比较,DP-CDVAE模型生成的结构与真正的基态结构更加相似,能减少了能量差值的平均值(68.1 meV/atom),这表明DP-CDVAE模型可以更好地生成表征基态结构的晶体结构。Abstract
The crystal diffusion variational autoencoder (CDVAE) is a machine learning model that leverages score matching to generate realistic crystal structures that preserve crystal symmetry. In this study, we leverage novel diffusion probabilistic (DP) models to denoise atomic coordinates rather than adopting the standard score matching approach in CDVAE. Our proposed DP-CDVAE model can reconstruct and generate crystal structures whose qualities are statistically comparable to those of the original CDVAE. Furthermore, notably, when comparing the carbon structures generated by the DP-CDVAE model with relaxed structures obtained from density functional theory calculations, we find that the DP-CDVAE generated structures are remarkably closer to their respective ground states. The energy differences between these structures and the true ground states are, on average, 68.1 meV/atom lower than those generated by the original CDVAE. This significant improvement in the energy accuracy highlights the effectiveness of the DP-CDVAE model in generating crystal structures that better represent their ground-state configurations.
摘要
“半导体晶体扩散自适应机器学习模型(CDVAE)是一种利用得分匹配生成真实的晶体结构,保持晶体对称的机器学习模型。在这项研究中,我们利用新的扩散概率模型(DP)来减少原子坐标中的噪声,而不是采用标准的得分匹配方法。我们提议的DP-CDVAE模型可以重建和生成晶体结构,其质量与原始CDVAE模型相似。此外,我们发现,对于氢材质量计算结果的缓和结构,DP-CDVAE模型生成的结构与真正的基态结构的能量差距平均为68.1 meV/原子更低,这表明DP-CDVAE模型可以更好地生成 represent their ground-state configurations的晶体结构。”Note: "晶体" (jīngbèi) in Chinese refers to crystal, and "晶体结构" (jīngbèi jìgòng) refers to crystal structure.
Speaker Diarization of Scripted Audiovisual Content
results: 相比两个无监督基线模型,提出的方法提高了51.7%的相对改善。Abstract
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
摘要
媒体地化行业通常需要最终电影或电视制作的幂等脚本,以创建外语字幕或配音脚本。特别是,幂等脚本(即播放脚本)必须以时间码、发言人名称和脚本结构组织。现有的语音识别技术使得转录步骤得到alleviation。然而,当前的发言人分类模型仍然在电视节目上存在两个主要问题:(i)它们无法跟踪大量的发言人,(ii)它们在发现频繁的发言人变化时的准确率低。为解决这个问题,我们提出了一种新的方法,利用制作过程中使用的制作脚本,提取 Pseudo-labeled 数据 для发言人分类任务。我们提出了一种新的半supervised Approach,并在66集测试集上达到了51.7%的改进率相比两个无监督基线模型。
Improved Order Analysis and Design of Exponential Integrator for Diffusion Models Sampling
results: 相比于现有的高阶几何积分法,该论文的提案可以提高采样质量和稳定性,并且可以避免一些容易出现的设计选择导致的问题。在实际应用中,该论文可以提供更高的采样效率和更好的采样质量。例如,在一个ImageNet扩散模型中,通过将单步DPM-Solver++替换为我们的order-satisfied RES solver,可以降低数值缺陷的比例为25.2%,并提高FID的值为25.4%(16.77 vs 12.51)。Abstract
Efficient differential equation solvers have significantly reduced the sampling time of diffusion models (DMs) while retaining high sampling quality. Among these solvers, exponential integrators (EI) have gained prominence by demonstrating state-of-the-art performance. However, existing high-order EI-based sampling algorithms rely on degenerate EI solvers, resulting in inferior error bounds and reduced accuracy in contrast to the theoretically anticipated results under optimal settings. This situation makes the sampling quality extremely vulnerable to seemingly innocuous design choices such as timestep schedules. For example, an inefficient timestep scheduler might necessitate twice the number of steps to achieve a quality comparable to that obtained through carefully optimized timesteps. To address this issue, we reevaluate the design of high-order differential solvers for DMs. Through a thorough order analysis, we reveal that the degeneration of existing high-order EI solvers can be attributed to the absence of essential order conditions. By reformulating the differential equations in DMs and capitalizing on the theory of exponential integrators, we propose refined EI solvers that fulfill all the order conditions, which we designate as Refined Exponential Solver (RES). Utilizing these improved solvers, RES exhibits more favorable error bounds theoretically and achieves superior sampling efficiency and stability in practical applications. For instance, a simple switch from the single-step DPM-Solver++ to our order-satisfied RES solver when Number of Function Evaluations (NFE) $=9$, results in a reduction of numerical defects by $25.2\%$ and FID improvement of $25.4\%$ (16.77 vs 12.51) on a pre-trained ImageNet diffusion model.
摘要
高效的差分方程解析器已经大幅降低了扩散模型(DM)的采样时间,同时保持高度采样质量。其中,对数Integrator(EI)已经取得了优势,但现有的高级EI基本采样算法仍然遵循不完全的EI解析器,从而导致较差的误差约束和降低了理论上预期的准确性。这种情况使采样质量极其敏感于不当的设计选择,如步长调度。例如,使用不优化的步长调度可能需要两倍的步长数量以达到相同的质量。为解决这个问题,我们重新评估了高级差分解析器的设计。通过严格的顺序分析,我们发现了现有高级EI解析器的不完全性可以归因于缺失的关键顺序条件。通过对DM的差分方程进行修改和利用差分方程解析器理论,我们提出了改进的差分方程解析器,称之为改进的差分方程解析器(RES)。使用这些改进的解析器,RES在理论上和实际应用中都显示出更有利的误差约束和更高的采样效率和稳定性。例如,将单步DPM-Solver++换为我们的顺序满足RES解析器,当Number of Function Evaluations(NFE)=9时,可以降低数值缺陷的比例为25.2%,并提高FID的改进率(16.77 vs 12.51)15.43%在预训练的ImageNet扩散模型上。
Optimization on Pareto sets: On a theory of multi-objective optimization
methods: 论文使用了本地方法来解决这个受限制的优化问题,该问题的约束集是 implicitly defined 的,通常是非对称和不 глад的。
results: 论文提出了一种算法,其最后迭代速度为 $O(K^{-1/2})$ 收敛到stationarity,当目标函数强 convex 和 Lipschitz 平滑时。Abstract
In multi-objective optimization, a single decision vector must balance the trade-offs between many objectives. Solutions achieving an optimal trade-off are said to be Pareto optimal: these are decision vectors for which improving any one objective must come at a cost to another. But as the set of Pareto optimal vectors can be very large, we further consider a more practically significant Pareto-constrained optimization problem, where the goal is to optimize a preference function constrained to the Pareto set. We investigate local methods for solving this constrained optimization problem, which poses significant challenges because the constraint set is (i) implicitly defined, and (ii) generally non-convex and non-smooth, even when the objectives are. We define notions of optimality and stationarity, and provide an algorithm with a last-iterate convergence rate of $O(K^{-1/2})$ to stationarity when the objectives are strongly convex and Lipschitz smooth.
摘要
在多目标优化中,每个决策向量必须均衡多个目标之间的贸易偏好。解决得到的优化解是称为Pareto优化:这些决策向量在改进任一目标时,必须来到另一目标的代价。但是Pareto优化集可能非常大,因此我们进一步考虑一种更实际 significannotive的Pareto受限优化问题,其目的是在Pareto集中优化偏好函数。我们研究本地方法来解决这个受限优化问题,这个问题存在两个主要挑战:首先,约束集是(i)隐式定义的,其次,通常是非拟合的和不平滑的。我们定义优化和稳定性的概念,并提供一种以$O(K^{-1/2})$的速率 converges to stationarity的算法,当目标函数是强Converter和Lipschitz光滑时。
Event-based Dynamic Graph Representation Learning for Patent Application Trend Prediction
paper_authors: Tao Zou, Le Yu, Leilei Sun, Bowen Du, Deqing Wang, Fuzhen Zhuang
for: 预测公司将在未来 periods of time 申请哪些专利,以便了解其发展策略和找到前期伙伴或竞争对手。
methods: 基于公司和专利分类代码的记忆表示和层次消息传递机制,实现专利申请趋势预测。
results: 在不同实验条件下,方法能够有效地预测专利申请趋势,同时还能学习分类代码的 semantics 和跟踪公司技术发展轨迹。Abstract
Accurate prediction of what types of patents that companies will apply for in the next period of time can figure out their development strategies and help them discover potential partners or competitors in advance. Although important, this problem has been rarely studied in previous research due to the challenges in modelling companies' continuously evolving preferences and capturing the semantic correlations of classification codes. To fill in this gap, we propose an event-based dynamic graph learning framework for patent application trend prediction. In particular, our method is founded on the memorable representations of both companies and patent classification codes. When a new patent is observed, the representations of the related companies and classification codes are updated according to the historical memories and the currently encoded messages. Moreover, a hierarchical message passing mechanism is provided to capture the semantic proximities of patent classification codes by updating their representations along the hierarchical taxonomy. Finally, the patent application trend is predicted by aggregating the representations of the target company and classification codes from static, dynamic, and hierarchical perspectives. Experiments on real-world data demonstrate the effectiveness of our approach under various experimental conditions, and also reveal the abilities of our method in learning semantics of classification codes and tracking technology developing trajectories of companies.
摘要
<>将来公司会申请哪些专利的预测可以掌握其发展策略,并在先前发现 potential partners或竞争对手。虽然这是一个重要的问题,但在过去的研究中它受到了较少的关注,这是因为模elling公司的持续发展偏好以及捕捉专利分类代码的含义相互关系是一个挑战。为了解决这个问题,我们提出了一种基于事件的动态图学学习框架 для专利申请趋势预测。具体来说,我们的方法基于公司和专利分类代码的印象 remembered representations。当观察到新专利时,相关公司和分类代码的表示被更新,根据历史记忆和当前编码的消息。此外,我们还提供了一种层次消息传递机制,以捕捉专利分类代码的含义相互关系,并将其更新为层次分类树。最后,我们预测专利申请趋势,通过 static、动态和层次三个视角的表示进行汇总。实验结果表明,我们的方法在不同的实验条件下表现出色,并且能够学习分类代码的 semantics 以及跟踪公司的科技发展轨迹。
Learning the solution operator of two-dimensional incompressible Navier-Stokes equations using physics-aware convolutional neural networks
results: 该方法与当前领先的数据基本方法进行比较,并与数据基本方法结合使用时的性能进行比较。Abstract
In recent years, the concept of introducing physics to machine learning has become widely popular. Most physics-inclusive ML-techniques however are still limited to a single geometry or a set of parametrizable geometries. Thus, there remains the need to train a new model for a new geometry, even if it is only slightly modified. With this work we introduce a technique with which it is possible to learn approximate solutions to the steady-state Navier--Stokes equations in varying geometries without the need of parametrization. This technique is based on a combination of a U-Net-like CNN and well established discretization methods from the field of the finite difference method.The results of our physics-aware CNN are compared to a state-of-the-art data-based approach. Additionally, it is also shown how our approach performs when combined with the data-based approach.
摘要
近年来,将物理学引入机器学习的概念在学术界得到了广泛的推广。然而,大多数物理包含的机器学习技术仍然受限于单个几何或一组可Parametrize的几何。因此,在新的几何模型中训练新模型仍然存在需求,即使这个几何只是稍微修改过。本工作我们介绍了一种可以在不同几何中学习稳态内离液方程的近似解的技术。这种技术基于U-Net-like CNN和已知的精度分割方法。我们的物理意识CNN的结果与现有的数据基本方法相比较,并且还展示了我们的方法与数据基本方法的组合效果。
Can Attention Be Used to Explain EHR-Based Mortality Prediction Tasks: A Case Study on Hemorrhagic Stroke
results: 研究发现,该模型在比较旧方法时表现出较高的准确性和可读性,同时提供了明确的特征重要性。Abstract
Stroke is a significant cause of mortality and morbidity, necessitating early predictive strategies to minimize risks. Traditional methods for evaluating patients, such as Acute Physiology and Chronic Health Evaluation (APACHE II, IV) and Simplified Acute Physiology Score III (SAPS III), have limited accuracy and interpretability. This paper proposes a novel approach: an interpretable, attention-based transformer model for early stroke mortality prediction. This model seeks to address the limitations of previous predictive models, providing both interpretability (providing clear, understandable explanations of the model) and fidelity (giving a truthful explanation of the model's dynamics from input to output). Furthermore, the study explores and compares fidelity and interpretability scores using Shapley values and attention-based scores to improve model explainability. The research objectives include designing an interpretable attention-based transformer model, evaluating its performance compared to existing models, and providing feature importance derived from the model.
摘要
stroke 是一种重要的死亡和残疾原因,需要早期预测方法以降低风险。传统的评估病人方法,如急性生理学和慢性健康评估(APACHE II、IV)和简化型急性生理评分 III(SAPS III),有限的准确性和可解性。这篇论文提出了一种新的方法:一种可解的、注意力基本变换模型,用于早期stroke死亡预测。这个模型目的是解决之前预测模型的限制,提供可解性(提供明确、理解的解释)和准确性(从输入到输出的模型动态 truthful explanation)。此外,研究还 explore和比较了准确性和可解性分数使用Shapley值和注意力基本分数来提高模型解释性。研究的目标包括设计一种可解的注意力基本变换模型,评估其表现与现有模型相比,并提供来自模型的特征重要性。
Analysis and Optimization of Wireless Federated Learning with Data Heterogeneity
results: 实验结果表明,提出的算法在实际数据上比其他参考方案更高的学习精度和能耗。Abstract
With the rapid proliferation of smart mobile devices, federated learning (FL) has been widely considered for application in wireless networks for distributed model training. However, data heterogeneity, e.g., non-independently identically distributions and different sizes of training data among clients, poses major challenges to wireless FL. Limited communication resources complicate the implementation of fair scheduling which is required for training on heterogeneous data, and further deteriorate the overall performance. To address this issue, this paper focuses on performance analysis and optimization for wireless FL, considering data heterogeneity, combined with wireless resource allocation. Specifically, we first develop a closed-form expression for an upper bound on the FL loss function, with a particular emphasis on data heterogeneity described by a dataset size vector and a data divergence vector. Then we formulate the loss function minimization problem, under constraints on long-term energy consumption and latency, and jointly optimize client scheduling, resource allocation, and the number of local training epochs (CRE). Next, via the Lyapunov drift technique, we transform the CRE optimization problem into a series of tractable problems. Extensive experiments on real-world datasets demonstrate that the proposed algorithm outperforms other benchmarks in terms of the learning accuracy and energy consumption.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard written form of Chinese used in mainland China.Please note that the translation is done by a machine and may not be perfect, and there may be some cultural or linguistic differences that are not captured by the translation.
results: 该论文的实验结果表明,BLNOs 可以在小训练集和短训练时间内达到优秀的泛化性能,并且其泛化误差与选择维度无关。此外,BLNOs 的半连接结构可以减少可调参数的数量。在一个具有诊断检查和心脏形态的心脏模型中,BLNOs 通过150个在线生成的12个电气征诊断图来训练,并在7个参数上进行自动hyperparameter调整。最佳BLNO在 fewer than 3 hours内在单个CPU上训练完毕,具有7层隐藏层和19个神经元。在独立测试集中,该模型的平方误差在 $10^{-4}$ 之间。Abstract
We introduce Branched Latent Neural Operators (BLNOs) to learn input-output maps encoding complex physical processes. A BLNO is defined by a simple and compact feedforward partially-connected neural network that structurally disentangles inputs with different intrinsic roles, such as the time variable from model parameters of a differential equation, while transferring them into a generic field of interest. BLNOs leverage interpretable latent outputs to enhance the learned dynamics and break the curse of dimensionality by showing excellent generalization properties with small training datasets and short training times on a single processor. Indeed, their generalization error remains comparable regardless of the adopted discretization during the testing phase. Moreover, the partial connections, in place of a fully-connected structure, significantly reduce the number of tunable parameters. We show the capabilities of BLNOs in a challenging test case involving biophysically detailed electrophysiology simulations in a biventricular cardiac model of a pediatric patient with hypoplastic left heart syndrome. The model includes a purkinje network for fast conduction and a heart-torso geometry. Specifically, we trained BLNOs on 150 in silico generated 12-lead electrocardiograms (ECGs) while spanning 7 model parameters, covering cell-scale, organ-level and electrical dyssynchrony. Although the 12-lead ECGs manifest very fast dynamics with sharp gradients, after automatic hyperparameter tuning the optimal BLNO, trained in less than 3 hours on a single CPU, retains just 7 hidden layers and 19 neurons per layer. The mean square error is on the order of $10^{-4}$ on an independent test dataset comprised of 50 additional electrophysiology simulations. This paper provides a novel computational tool to build reliable and efficient reduced-order models for digital twinning in engineering applications.
摘要
我们引入分支 latent neural operator (BLNO),以learn输入输出映射,描述复杂物理过程。BLNO是一个简单且紧凑的 feedforward partially-connected neural network,它将输入分为不同的内在角色,如时间变量和模型参数,并将它们转换为一个通用的场景。BLNOs 利用可解释的 latent output 提高学习过程中的动态,并突破维度紧缩问题,因为它们在训练阶段只需要小量的训练数据和短时间执行。此外,对维度的采取价值不同的� enters 也可以将数据分为不同的类别,进一步提高模型的精度。我们在一个具有复杂生物物理特性的心脏模型中进行了一个挑战性的测试,该模型包括一个 purkinje 网络和心脏� torso 对应。我们将 BLNOs 训练在 150 个在 silico 生成的 12 项电cardiogram (ECG) 中,涵盖 7 个模型参数,包括细胞层、器官层和电子� synchrony。although 12 项 ECG 呈现出非常快的动态和锋利的梯度,经自动优化参数后,最佳的 BLNO 在仅 3 小时内在单一 CPU 上训练,只有 7 个隐藏层和 19 个神经元。模型的平方误差在独立测试数据中是 $10^{-4}$ 阶段。这篇论文提供了一个新的计算工具,用于建立可靠且高效的削减� orden 模型,供工程应用中的数字双胞处理。
Eva: A General Vectorized Approximation Framework for Second-order Optimization
results: 对不同的模型和数据集进行了广泛的实验,结果显示,使用 Eva 可以降低总训练时间,相比于首项SGD和其他二项算法(K-FAC和Shampoo),可以提高训练效率。Abstract
Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order counterparts such as stochastic gradient descent (SGD). In this work, we present a memory- and time-efficient second-order algorithm named Eva with two novel techniques: 1) we construct the second-order information with the Kronecker factorization of small stochastic vectors over a mini-batch of training data to reduce memory consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices using the Sherman-Morrison formula. We further extend Eva to a general vectorized approximation framework to improve the compute and memory efficiency of two existing second-order algorithms (FOOF and Shampoo) without affecting their convergence performance. Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05x and 2.42x compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively.
摘要
第二顺序优化算法具有深度学习模型训练中的卓越减少性,但经常带来计算和内存开销。这可能导致训练效率较低于第一顺序对手,如随机梯度下降(SGD)。在这篇文章中,我们介绍了一种具有内存和时间有效的第二顺序算法,名为Eva,以及两种新技术:1. 我们使用小批量训练数据的Kronecker因子分解来构建第二顺序信息,以减少内存消耗。2. 我们 derivate了一个高效的更新公式,不需要直接计算矩阵的逆元,使用Sherman-Morrison公式。此外,我们还将Eva扩展到一个通用的矢量化近似框架,以提高FOOF和Shampoo等两种第二顺序算法的计算和内存效率,无需影响其减少性。我们在不同的模型和数据集上进行了广泛的实验,结果显示,Eva可以比SGD和K-FAC/Shampoo等第一顺序和第二顺序算法减少结束训练时间,具体的比例分别为2.05倍和2.42倍。
results: 我们通过在计算机视觉和自然语言处理任务上使用不同的模型、数据集和场景进行评估,以示我们的方法在准确地确定模型来源方面的效果。Abstract
Understanding the life cycle of the machine learning (ML) model is an intriguing area of research (e.g., understanding where the model comes from, how it is trained, and how it is used). This paper focuses on a novel problem within this field, namely Model Provenance (MP), which concerns the relationship between a target model and its pre-training model and aims to determine whether a source model serves as the provenance for a target model. This is an important problem that has significant implications for ensuring the security and intellectual property of machine learning models but has not received much attention in the literature. To fill in this gap, we introduce a novel concept of Model DNA which represents the unique characteristics of a machine learning model. We utilize a data-driven and model-driven representation learning method to encode the model's training data and input-output information as a compact and comprehensive representation (i.e., DNA) of the model. Using this model DNA, we develop an efficient framework for model provenance identification, which enables us to identify whether a source model is a pre-training model of a target model. We conduct evaluations on both computer vision and natural language processing tasks using various models, datasets, and scenarios to demonstrate the effectiveness of our approach in accurately identifying model provenance.
摘要
Translated into Simplified Chinese:理解机器学习模型的生命周期是一个有趣的研究领域(例如,理解模型的来源、如何训练和如何使用)。这篇论文关注一个新的问题在这个领域,即模型 происхождение(MP),它关注目标模型和其预训练模型之间的关系,并计划确定预训练模型是目标模型的来源。这是一个重要的问题,它对机器学习模型的安全和知识产权具有重要意义,但在文献中尚未得到了充分的关注。为了填补这一空白,我们提出了一个新的机器学习模型特征表示(Model DNA),它表示机器学习模型的唯一特征。我们使用数据驱动和模型驱动的表示学习方法将模型训练数据和输入输出信息编码为模型的唯一表示(DNA)。使用这个模型DNA,我们开发了一种高效的模型 происхождение标识框架,可以准确地确定预训练模型是目标模型的来源。我们在计算机视觉和自然语言处理任务上使用了多种模型、数据集和场景,以示出我们方法的准确性。
Designing a Deep Learning-Driven Resource-Efficient Diagnostic System for Metastatic Breast Cancer: Reducing Long Delays of Clinical Diagnosis and Improving Patient Survival in Developing Countries
results: 根据评估结果,MobileNetV2型别的诊断模型在准确性、数据通用性和训练效率等方面优于更复杂的VGG16、ResNet50和ResNet101模型。此外,Visual比较表明MobileNetV2模型可以实时检测小型乳癌细胞在正常细胞中的嵌入。Abstract
Breast cancer is one of the leading causes of cancer mortality. Breast cancer patients in developing countries, especially sub-Saharan Africa, South Asia, and South America, suffer from the highest mortality rate in the world. One crucial factor contributing to the global disparity in mortality rate is long delay of diagnosis due to a severe shortage of trained pathologists, which consequently has led to a large proportion of late-stage presentation at diagnosis. The delay between the initial development of symptoms and the receipt of a diagnosis could stretch upwards 15 months. To tackle this critical healthcare disparity, this research has developed a deep learning-based diagnosis system for metastatic breast cancer that can achieve high diagnostic accuracy as well as computational efficiency. Based on our evaluation, the MobileNetV2-based diagnostic model outperformed the more complex VGG16, ResNet50 and ResNet101 models in diagnostic accuracy, model generalization, and model training efficiency. The visual comparisons between the model prediction and ground truth have demonstrated that the MobileNetV2 diagnostic models can identify very small cancerous nodes embedded in a large area of normal cells which is challenging for manual image analysis. Equally Important, the light weighted MobleNetV2 models were computationally efficient and ready for mobile devices or devices of low computational power. These advances empower the development of a resource-efficient and high performing AI-based metastatic breast cancer diagnostic system that can adapt to under-resourced healthcare facilities in developing countries. This research provides an innovative technological solution to address the long delays in metastatic breast cancer diagnosis and the consequent disparity in patient survival outcome in developing countries.
摘要
乳癌是全球最主要的癌症致死原因之一,而发展中国家的乳癌患者,特别是非洲萨赫拉地区、南亚和南美地区,患者的死亡率最高。一个重要的因素导致全球医疗 disparity 是诊断延迟,因为缺乏培训的病理学家,这导致了许多晚期诊断。延迟从症状开始到诊断的时间可以达到15个月。为了解决这一重要的医疗 disparity,这项研究开发了一个基于深度学习的乳癌诊断系统,可以实现高精度和计算效率。根据我们的评估,基于 MobileNetV2 的诊断模型在诊断精度、模型泛化和模型训练效率方面都高于 VGG16、ResNet50 和 ResNet101 模型。视觉比较表明,MobileNetV2 诊断模型可以准确地识别小型乳癌细胞在大量正常细胞中的存在,这是人工图像分析困难。此外,MobileNetV2 模型轻量级,可以在移动设备或低计算能力的设备上进行计算,这使得这些技术更加可行。这些进步使得可以开发一个资源高效的 AI 基于乳癌诊断系统,可以适应发展中国家的医疗设施。这项研究提供了一种创新的技术解决方案,以减少发展中国家乳癌患者的诊断延迟和因此的生存结果差距。
VQGraph: Graph Vector-Quantization for Bridging GNNs and MLPs
results: EXTENSIVE experiments和分析表明,VQGraph可以具有更高的性能,在七个图 dataset上实现新的状态艺术性能。VQGraph能够更快地进行预测,比GNN快828倍,同时也可以提高GNN和独立MLP的准确率。Abstract
Graph Neural Networks (GNNs) conduct message passing which aggregates local neighbors to update node representations. Such message passing leads to scalability issues in practical latency-constrained applications. To address this issue, recent methods adopt knowledge distillation (KD) to learn computationally-efficient multi-layer perceptron (MLP) by mimicking the output of GNN. However, the existing GNN representation space may not be expressive enough for representing diverse local structures of the underlying graph, which limits the knowledge transfer from GNN to MLP. Here we present a novel framework VQGraph to learn a powerful graph representation space for bridging GNNs and MLPs. We adopt the encoder of a variant of a vector-quantized variational autoencoder (VQ-VAE) as a structure-aware graph tokenizer, which explicitly represents the nodes of diverse local structures as numerous discrete tokens and constitutes a meaningful codebook. Equipped with the learned codebook, we propose a new token-based distillation objective based on soft token assignments to sufficiently transfer the structural knowledge from GNN to MLP. Extensive experiments and analyses demonstrate the strong performance of VQGraph, where we achieve new state-of-the-art performance on GNN-MLP distillation in both transductive and inductive settings across seven graph datasets. We show that VQGraph with better performance infers faster than GNNs by 828x, and also achieves accuracy improvement over GNNs and stand-alone MLPs by 3.90% and 28.05% on average, respectively. Code: https://github.com/YangLing0818/VQGraph.
摘要
图 neural network (GNN) 通过消息传递来更新节点表示。这种消息传递会导致实际应用中的执行效率问题。为解决这个问题,现有方法采用知识传授 (KD) 学习计算效率高的多层感知器 (MLP),但现有的 GNN 表示空间可能不够表示多样性的本地结构,这限制了 GNN 知识的传递。我们提出一种新的框架 VQGraph,用于学习一个强大的图表示空间,以bridging GNN 和 MLP。我们采用一种变体的 vector-quantized variational autoencoder (VQ-VAE) 的Encoder作为结构意识 graph tokenizer,其将节点视为多种不同的本地结构的数据点,并构成一个有意义的代码库。准备了学习的代码库后,我们提出了一个新的 токен基本的分配目标,基于软件分配来充分传递 GNN 中的结构知识到 MLP。我们对七个图Dataset进行了广泛的实验和分析,并证明了 VQGraph 的强大性。我们在传递性下 achieved 新的状态时���的性能,在 inductive 和 transductive 设置下,VQGraph 的性能比 GNN 高出 3.90% 和 28.05% 的平均值。此外,VQGraph 能够更快地进行计算,比 GNN 快 828 倍。代码:https://github.com/YangLing0818/VQGraph。
Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network
results: 比较九种BUS分类方法,Hybrid-MT-ESTAN得到了最高的准确性、敏感度和F1分数,具体的数据为82.7%, 86.4%和86.0%。Abstract
Capturing global contextual information plays a critical role in breast ultrasound (BUS) image classification. Although convolutional neural networks (CNNs) have demonstrated reliable performance in tumor classification, they have inherent limitations for modeling global and long-range dependencies due to the localized nature of convolution operations. Vision Transformers have an improved capability of capturing global contextual information but may distort the local image patterns due to the tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation using a hybrid architecture composed of CNNs and Swin Transformer components. The proposed approach was compared to nine BUS classification methods and evaluated using seven quantitative metrics on a dataset of 3,320 BUS images. The results indicate that Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.
摘要
globally capturing contextual information plays a crucial role in breast ultrasound (BUS) image classification. Although convolutional neural networks (CNNs) have demonstrated reliable performance in tumor classification, they have inherent limitations in modeling global and long-range dependencies due to the localized nature of convolution operations. Vision Transformers have an improved capability of capturing global contextual information, but may distort local image patterns due to tokenization operations. In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation using a hybrid architecture composed of CNNs and Swin Transformer components. The proposed approach was compared to nine BUS classification methods and evaluated using seven quantitative metrics on a dataset of 3,320 BUS images. The results indicate that Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.Here's the word-for-word translation of the text into Simplified Chinese:全球捕捉上下文信息在乳腺超声图像分类中扮演了关键角色。尽管卷积神经网络(CNNs)在肿瘤分类中表现出了可靠性,但它们具有本质上的局部化限制,因为卷积操作的本地化特性。视觉 трансформа器具有提高全球上下文信息捕捉的能力,但可能因token化操作而导致本地图像模式的扭曲。在本研究中,我们提出了一种混合多任务深度神经网络,名为Hybrid-MT-ESTAN,用于实现乳腺超声图像分类和分割。我们的方法与9种BUS分类方法进行比较,并在3320个BUS图像 Dataset 上进行评估,使用7个量化指标。结果表明,Hybrid-MT-ESTAN achieved the highest accuracy, sensitivity, and F1 score of 82.7%, 86.4%, and 86.0%, respectively.
Efficient Model Adaptation for Continual Learning at the Edge
results: 这篇论文系统地评估了它的方法在多个benchmark数据集上,并与现有的OOD检测和几何学搜索算法进行比较。结果显示,这篇论文的方法具有强大的外部检测和适配能力,并能够在不同的环境下进行高效的连续学习。Abstract
Most machine learning (ML) systems assume stationary and matching data distributions during training and deployment. This is often a false assumption. When ML models are deployed on real devices, data distributions often shift over time due to changes in environmental factors, sensor characteristics, and task-of-interest. While it is possible to have a human-in-the-loop to monitor for distribution shifts and engineer new architectures in response to these shifts, such a setup is not cost-effective. Instead, non-stationary automated ML (AutoML) models are needed. This paper presents the Encoder-Adaptor-Reconfigurator (EAR) framework for efficient continual learning under domain shifts. The EAR framework uses a fixed deep neural network (DNN) feature encoder and trains shallow networks on top of the encoder to handle novel data. The EAR framework is capable of 1) detecting when new data is out-of-distribution (OOD) by combining DNNs with hyperdimensional computing (HDC), 2) identifying low-parameter neural adaptors to adapt the model to the OOD data using zero-shot neural architecture search (ZS-NAS), and 3) minimizing catastrophic forgetting on previous tasks by progressively growing the neural architecture as needed and dynamically routing data through the appropriate adaptors and reconfigurators for handling domain-incremental and class-incremental continual learning. We systematically evaluate our approach on several benchmark datasets for domain adaptation and demonstrate strong performance compared to state-of-the-art algorithms for OOD detection and few-/zero-shot NAS.
摘要
大多数机器学习(ML)系统假设训练和部署期间数据分布保持静态和匹配。这是一个不实际的假设。当 ML 模型在真实设备上部署时,数据分布经常在时间变化,这是由环境因素、感知特征和任务关注点的变化引起的。虽然可以有人在Loop(human-in-the-loop)监控数据分布的变化并根据这些变化设计新的建筑,但这种设置不是可持续的。相反,不可靠自动机器学习(AutoML)模型是需要的。这篇论文提出了 Encoder-Adaptor-Reconfigurator(EAR)框架,用于效率地进行逐步学习下降。EAR 框架使用固定的深度神经网络(DNN)特征编码器,并在编码器之上训练浅层网络来处理新的数据。EAR 框架可以1) 将新数据确定为外部数据(OOD),通过将 DNN 与高维计算(HDC)结合使用,2) 使用零参数神经适应器适应模型到OOD 数据,并3) 使用进行逐步增长的神经网络,逐步地处理域逐步学习和类逐步学习。我们系统地评估了我们的方法在域逐步学习和类逐步学习中的表现,并与当前最佳算法相比,在OOD 检测和几个/零个 NAS 方面表现出色。
Target specification bias, counterfactual prediction, and algorithmic fairness in healthcare
results: 该论文的结果表明,target specification bias是医疗领域中ML模型偏见的一种常见来源,并且可能导致医疗资源的不合理使用和伤害病人。同时,该论文还提出了一种新的方法来解决这种偏见,即通过对目标变量的操作来减少target specification bias。Abstract
Bias in applications of machine learning (ML) to healthcare is usually attributed to unrepresentative or incomplete data, or to underlying health disparities. This article identifies a more pervasive source of bias that affects the clinical utility of ML-enabled prediction tools: target specification bias. Target specification bias arises when the operationalization of the target variable does not match its definition by decision makers. The mismatch is often subtle, and stems from the fact that decision makers are typically interested in predicting the outcomes of counterfactual, rather than actual, healthcare scenarios. Target specification bias persists independently of data limitations and health disparities. When left uncorrected, it gives rise to an overestimation of predictive accuracy, to inefficient utilization of medical resources, and to suboptimal decisions that can harm patients. Recent work in metrology - the science of measurement - suggests ways of counteracting target specification bias and avoiding its harmful consequences.
摘要
Machine learning(ML)在医疗领域中的偏见通常被归结于不代表性或不完整的数据,或者基础医疗差距。这篇文章揭示了一种更普遍的偏见源,对mlenabled预测工具的临床实用性产生了影响:目标规定偏见。目标规定偏见发生在ml模型操作时,实际目标变量与决策者所定义的目标变量之间存在差异。这种差异通常是微妙的,来自于决策者通常关注预测实际医疗enario的结果,而不是实际的医疗enario。这种偏见独立于数据限制和医疗差距,并且会导致预测精度过高、医疗资源的不合理使用和伤害病人的决策。近期的metrology研究(量度科学)提供了对target specification bias的应对方法和避免其有害后果的方法。
Causality Guided Disentanglement for Cross-Platform Hate Speech Detection
results: 研究的实验结果表明, compared to 现有状态的方法,本研究的模型在多个平台上的恐吓言语检测效果明显更高。Abstract
Social media platforms, despite their value in promoting open discourse, are often exploited to spread harmful content. Current deep learning and natural language processing models used for detecting this harmful content overly rely on domain-specific terms affecting their capabilities to adapt to generalizable hate speech detection. This is because they tend to focus too narrowly on particular linguistic signals or the use of certain categories of words. Another significant challenge arises when platforms lack high-quality annotated data for training, leading to a need for cross-platform models that can adapt to different distribution shifts. Our research introduces a cross-platform hate speech detection model capable of being trained on one platform's data and generalizing to multiple unseen platforms. To achieve good generalizability across platforms, one way is to disentangle the input representations into invariant and platform-dependent features. We also argue that learning causal relationships, which remain constant across diverse environments, can significantly aid in understanding invariant representations in hate speech. By disentangling input into platform-dependent features (useful for predicting hate targets) and platform-independent features (used to predict the presence of hate), we learn invariant representations resistant to distribution shifts. These features are then used to predict hate speech across unseen platforms. Our extensive experiments across four platforms highlight our model's enhanced efficacy compared to existing state-of-the-art methods in detecting generalized hate speech.
摘要
社交媒体平台,尽管它们在促进开放的讨论方面具有价值,但它们也被滥用来传播危险内容。现有的深入学习和自然语言处理模型用于检测这种危险内容往往仅仅依赖于域内特定的术语,导致它们在泛化仇言检测方面有限制。此外,当平台缺乏高质量标注数据 для训练时,需要开发可以适应不同分布偏移的跨平台模型。我们的研究推出了一种可以在一个平台的数据上训练并在多个未看过平台上生效的恶意语言检测模型。为实现良好的泛化性 across 平台,我们可以将输入表示分解为不变和平台特定的特征。我们还 argue 了解不变的 causal 关系可以帮助理解不变的表示在仇言中。通过将输入分解为平台特定的特征(用于预测仇恨目标)和平台独立的特征(用于预测存在仇恨),我们学习了不变的表示,这些表示对各种环境的分布偏移具有抗逆性。这些特征最后被用来预测恶意语言 across 未看过平台。我们的广泛的实验 across 四个平台表明我们的模型在检测通用的仇言方面具有显著的高效性,比现有的状态的艺术方法更高效。
Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale
results: 该研究发现了 2022 年最流行的新闻 narratives,并 identificated 最具影响力的网站,这些网站通常是新闻 narative 的起点和扩大源。此外,研究还表明了该系统可以帮助 факт-检查机构如 Politifact、Reuters 和 AP News 更快地解决虚假信息故事。Abstract
Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,404 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically isolate and analyze the narratives spread within online ecosystems. Identifying 55,301 narratives on these 1,404 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and magnify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and aid fact-checkers like Politifact, Reuters, and AP News in more quickly addressing misinformation stories.
摘要
互联网上流传谣言、宣传和谎言,有些叙述可能有危害公共健康、选举和个人安全的危险。然而,研究社区在 automatized 和程序化的方面缺乏跟踪新闻叙述的方法。在这项工作中,我们使用每天抓取1,404个不可靠新闻网站,大型自然语言模型MPNet,以及DP-Means clustering,开发了一个系统来自动孤立和分析在线生态系统中流传的叙述。我们发现了2022年流传的55,301个叙述,并指出了这些叙述的最有影响力的网站。最后,我们示出了如何使用我们的系统来检测来自不可靠新闻网站的新叙述,并帮助 факт-检查机构如Politifact、Reuters 和AP News更快地解决谣言故事。
Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing with Non-Learnable Primitives
for: This paper proposes a method to mitigate task interference in multi-task learning (MTL) models by combining non-learnable primitives (NLPs) and explicit task routing (ETR).
methods: The proposed ETR-NLP model employs non-learnable primitives to extract task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task.
results: The proposed ETR-NLP model significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets.Here is the same information in Simplified Chinese text:
results: 提议的 ETR-NLP 模型在所有数据集上都达到了比基eline更高的性能,使用 fewer 的学习参数和相同的计算复杂度(FLOPs)。Abstract
Multi-task learning (MTL) seeks to learn a single model to accomplish multiple tasks by leveraging shared information among the tasks. Existing MTL models, however, have been known to suffer from negative interference among tasks. Efforts to mitigate task interference have focused on either loss/gradient balancing or implicit parameter partitioning with partial overlaps among the tasks. In this paper, we propose ETR-NLP to mitigate task interference through a synergistic combination of non-learnable primitives (NLPs) and explicit task routing (ETR). Our key idea is to employ non-learnable primitives to extract a diverse set of task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task. The non-learnable primitives and the explicit decoupling of learnable parameters into shared and task-specific ones afford the flexibility needed for minimizing task interference. We evaluate the efficacy of ETR-NLP networks for both image-level classification and pixel-level dense prediction MTL problems. Experimental results indicate that ETR-NLP significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets. Code is available at this \href{https://github.com/zhichao-lu/etr-nlp-mtl}.
摘要
In this paper, we propose ETR-NLP to mitigate task interference through a synergistic combination of non-learnable primitives (NLPs) and explicit task routing (ETR). Our key idea is to use non-learnable primitives to extract a diverse set of task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task. The non-learnable primitives and the explicit decoupling of learnable parameters into shared and task-specific ones provide the flexibility needed for minimizing task interference.We evaluate the effectiveness of ETR-NLP networks for both image-level classification and pixel-level dense prediction MTL problems. Experimental results show that ETR-NLP significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets. Code is available at this link: .
On the Biometric Capacity of Generative Face Models
results: 该论文的结果显示,StyleGAN3 和 DCFace 生成的人脸图像的生物 metric 的上限分别为 $1.43\times10^6$ 和 $1.190\times10^4$,而降低 False Acceptance Rate (FAR) 时,这些生物 metric 的上限减少至 $1.796\times10^4$ 和 $562$。此外,该论文还发现,一些生成人脸模型对年龄存在显著差异,但对 gender 没有差异。Abstract
There has been tremendous progress in generating realistic faces with high fidelity over the past few years. Despite this progress, a crucial question remains unanswered: "Given a generative face model, how many unique identities can it generate?" In other words, what is the biometric capacity of the generative face model? A scientific basis for answering this question will benefit evaluating and comparing different generative face models and establish an upper bound on their scalability. This paper proposes a statistical approach to estimate the biometric capacity of generated face images in a hyperspherical feature space. We employ our approach on multiple generative models, including unconditional generators like StyleGAN, Latent Diffusion Model, and "Generated Photos," as well as DCFace, a class-conditional generator. We also estimate capacity w.r.t. demographic attributes such as gender and age. Our capacity estimates indicate that (a) under ArcFace representation at a false acceptance rate (FAR) of 0.1%, StyleGAN3 and DCFace have a capacity upper bound of $1.43\times10^6$ and $1.190\times10^4$, respectively; (b) the capacity reduces drastically as we lower the desired FAR with an estimate of $1.796\times10^4$ and $562$ at FAR of 1% and 10%, respectively, for StyleGAN3; (c) there is no discernible disparity in the capacity w.r.t gender; and (d) for some generative models, there is an appreciable disparity in the capacity w.r.t age. Code is available at https://github.com/human-analysis/capacity-generative-face-models.
摘要
“过去几年来,生成高效精确的脸部模型有了很大的进步。然而,一个重要的问题仍然没有答案:“将生成的脸部模型如何生成多少个独特的身份?”这个问题的科学基础可以帮助评估和比较不同的生成脸部模型,并且可以定义生成脸部模型的扩展性上限。本文提出了一个统计方法来估算生成脸部模型中的生物特征容量。我们使用这个方法评估多个生成模型,包括StyleGAN、Latent Diffusion Model和“Generated Photos”等,以及DCFace,一个基于类别的生成模型。我们还估算了基于人口特征如年龄和性别的容量。我们的容量估算表明:(a)在ArcFace表现下,StyleGAN3和DCFace的容量上限为1.43×10^6和1.190×10^4,分别;(b)随着欲求False Acceptance Rate(FAR)下降,StyleGAN3的容量几乎急遽减少, estimate为1.796×10^4和562,分别;(c)与性别无显著差异;(d)某些生成模型对年龄有明显差异。相关代码可以在https://github.com/human-analysis/capacity-generative-face-models 取得。”
Accurate Neural Network Pruning Requires Rethinking Sparse Optimization
paper_authors: Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh
for: 本研究旨在探讨高缺省略对模型训练的影响,以及如何使用标准随机搜索技术来训练缺省略网络。
methods: 本研究使用了标准计算机视觉和自然语言处理缺省略标准准则进行训练,并提供了一些新的方法来 Mitigate the issue of sparse training。
results: 研究结果显示,使用标准稠密训练策略对缺省略训练是不优化的,会导致模型受训练不良。研究人员提供了一些新的方法,以实现高缺省略下的模型训练,并在计算机视觉和自然语言处理两个领域中达到了状态的训练结果。Abstract
Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, much less is known about the interaction between sparsity and the standard stochastic optimization techniques used for training sparse networks, and most existing work uses standard dense schedules and hyperparameters for training sparse networks. In this work, we examine the impact of high sparsity on model training using the standard computer vision and natural language processing sparsity benchmarks. We begin by showing that using standard dense training recipes for sparse training is suboptimal, and results in under-training. We provide new approaches for mitigating this issue for both sparse pre-training of vision models (e.g. ResNet50/ImageNet) and sparse fine-tuning of language models (e.g. BERT/GLUE), achieving state-of-the-art results in both settings in the high-sparsity regime, and providing detailed analyses for the difficulty of sparse training in both scenarios. Our work sets a new threshold in terms of the accuracies that can be achieved under high sparsity, and should inspire further research into improving sparse model training, to reach higher accuracies under high sparsity, but also to do so efficiently.
摘要
We first show that using standard dense training recipes for sparse training results in under-training, and we propose new approaches to mitigate this issue. Our methods achieve state-of-the-art results in both sparse pre-training of vision models (e.g. ResNet50/ImageNet) and sparse fine-tuning of language models (e.g. BERT/GLUE) in the high-sparsity regime. We also provide detailed analyses of the difficulty of sparse training in both scenarios. Our work sets a new threshold for the accuracies that can be achieved under high sparsity, and should inspire further research into improving sparse model training to reach higher accuracies under high sparsity, while also being efficient.
Incorporating Recklessness to Collaborative Filtering based Recommender Systems
results: 实验结果表明,recklessness 不仅可以控制风险,还可以提高推荐系统的预测量和质量。Abstract
Recommender systems that include some reliability measure of their predictions tend to be more conservative in forecasting, due to their constraint to preserve reliability. This leads to a significant drop in the coverage and novelty that these systems can provide. In this paper, we propose the inclusion of a new term in the learning process of matrix factorization-based recommender systems, called recklessness, which enables the control of the risk level desired when making decisions about the reliability of a prediction. Experimental results demonstrate that recklessness not only allows for risk regulation but also improves the quantity and quality of predictions provided by the recommender system.
摘要
<>translate "Recommender systems that include some reliability measure of their predictions tend to be more conservative in forecasting, due to their constraint to preserve reliability. This leads to a significant drop in the coverage and novelty that these systems can provide. In this paper, we propose the inclusion of a new term in the learning process of matrix factorization-based recommender systems, called recklessness, which enables the control of the risk level desired when making decisions about the reliability of a prediction. Experimental results demonstrate that recklessness not only allows for risk regulation but also improves the quantity and quality of predictions provided by the recommender system." into 简化字 Simplified Chinese.Here's the translation:<>推荐系统通常会因为保持可靠性的约束而变得更加保守,这会导致涵盖率和创新率的下降。在这篇论文中,我们提议在基于矩阵因子化的推荐系统学习过程中添加一个新的 термин,即“recklessness”,以控制预测可靠性的风险水平。实验结果表明,recklessness不仅允许风险规定,还可以提高推荐系统预测的质量和量。
Seasonality Based Reranking of E-commerce Autocomplete Using Natural Language Queries
results: 研究表明,吸收季节性信号可以提高QAC relevance和商业指标,提供了一种新的方法来评估和优化QAC模型。Abstract
Query autocomplete (QAC) also known as typeahead, suggests list of complete queries as user types prefix in the search box. It is one of the key features of modern search engines specially in e-commerce. One of the goals of typeahead is to suggest relevant queries to users which are seasonally important. In this paper we propose a neural network based natural language processing (NLP) algorithm to incorporate seasonality as a signal and present end to end evaluation of the QAC ranking model. Incorporating seasonality into autocomplete ranking model can improve autocomplete relevance and business metric.
摘要
查询自动完成(QAC)也称为预先提示,在搜索框中提供完整的查询列表,根据用户输入前缀。这是现代搜索引擎的一个关键功能,尤其在电商领域。QAC的一个目标是提供相关的查询,以便用户在不同的季节中搜索相关的内容。在这篇论文中,我们提出一种基于神经网络自然语言处理(NLP)算法的方法,以吸收季节信号并对QAC排名模型进行综合评估。在吸收季节信号的情况下,QAC排名模型的相关性和业务指标都可以得到改善。
Robust Independence Tests with Finite Sample Guarantees for Synchronous Stochastic Linear Systems
paper_authors: Ambrus Tamás, Dániel Ágoston Bálint, Balázs Csanád Csáji
for: 这个论文是为了开发一种robust independence测试,以确保 Stochastic Linear Time-Invariant(SLTI)系统中的产生的输出是独立的。
methods: 这个方法使用了 confidence region estimates 和 permutation tests,以及一些总体依赖度度量,如希尔伯特-Ш密特独立性准则和距离协方差,来检测 SLTI 系统中的非线性依赖关系。
results: 这个研究提供了不吸收准确性水平的类型一错误概率的上下文,并证明了这种假设测试的一致性,并通过对抗系统的示例进行了示例。Abstract
The paper introduces robust independence tests with non-asymptotically guaranteed significance levels for stochastic linear time-invariant systems, assuming that the observed outputs are synchronous, which means that the systems are driven by jointly i.i.d. noises. Our method provides bounds for the type I error probabilities that are distribution-free, i.e., the innovations can have arbitrary distributions. The algorithm combines confidence region estimates with permutation tests and general dependence measures, such as the Hilbert-Schmidt independence criterion and the distance covariance, to detect any nonlinear dependence between the observed systems. We also prove the consistency of our hypothesis tests under mild assumptions and demonstrate the ideas through the example of autoregressive systems.
摘要
文章介绍了一种Robust Independence Test,可以确定温馈线性时间不变系统中的独立性,无需假设抽象分布。我们的方法提供了对type I error probability的分布不受限制的 bounds,可以处理任何输出的异常分布。我们的算法结合了信任区间估计和 permutation tests,以及一些普遍的依赖度量,如希尔伯特-尚德独立性标准和距离协方差,来检测系统中的非线性依赖关系。我们还证明了我们的假设测试在某些轻度的假设下是一致的。我们通过示例描述了抽象系统的应用。
results: 对两个Difficult DLA datasets进行测试,GLAM模型在5个类型上比领先的140M+参数的计算机视ión基本模型表现更好,并且一种简单的 ensemble 模型可以在DocLayNet dataset上 achieve新的状态� slower than SOTA models, making GLAM a favorable engineering choice for DLA tasks。Abstract
Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document and correctly classifying these items into an appropriate category (e.g., text, title, figure). DLA pipelines enable users to convert documents into structured machine-readable formats that can then be used for many useful downstream tasks. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. Directly leveraging this metadata, we represent each PDF page as a structured graph and frame the DLA problem as a graph segmentation and classification problem. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network competitive with SOTA models on two challenging DLA datasets - while being an order of magnitude smaller than existing models. In particular, the 4-million parameter GLAM model outperforms the leading 140M+ parameter computer vision-based model on 5 of the 11 classes on the DocLayNet dataset. A simple ensemble of these two models achieves a new state-of-the-art on DocLayNet, increasing mAP from 76.8 to 80.8. Overall, GLAM is over 5 times more efficient than SOTA models, making GLAM a favorable engineering choice for DLA tasks.
摘要
SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents
paper_authors: Amirhossein Zolfagharian, Manel Abdellatif, Lionel C. Briand, Ramesh S
for: 这篇论文旨在解决深度权威学习算法在安全关键系统中的安全问题。
methods: 本文提出了一种基于机器学习的安全监测方法,名为SMARLA,可以用于 Deep Reinforcement Learning 代理人。SMARLA 是一个黑盒式方法(不需要访问代理人的内部),利用状态抽象来减少状态空间,从而使得学习安全违反预测模型更加容易。
results: 在两个常见的RL案例研究中,我们证明了SMARLA 可以准确预测安全违反,false positive 率很低,可以在代理人执行的前半部分预测安全违反,大约在代理人执行的第二部分时。Abstract
Deep reinforcement learning algorithms (DRL) are increasingly being used in safety-critical systems. Ensuring the safety of DRL agents is a critical concern in such contexts. However, relying solely on testing is not sufficient to ensure safety as it does not offer guarantees. Building safety monitors is one solution to alleviate this challenge. This paper proposes SMARLA, a machine learning-based safety monitoring approach designed for DRL agents. For practical reasons, SMARLA is designed to be black-box (as it does not require access to the internals of the agent) and leverages state abstraction to reduce the state space and thus facilitate the learning of safety violation prediction models from agent's states. We validated SMARLA on two well-known RL case studies. Empirical analysis reveals that SMARLA achieves accurate violation prediction with a low false positive rate, and can predict safety violations at an early stage, approximately halfway through the agent's execution before violations occur.
摘要
FuNToM: Functional Modeling of RF Circuits Using a Neural Network Assisted Two-Port Analysis Method
methods: 这个方法利用了两个 porte 分析方法,可以模型多种电路架构,并且使用神经网络来预测电路的行为。
results: 相比于现有的方法,这个方法可以将训练数据量降低至 2.8x - 10.9x,并且对于实体电路的模型化需要更少的时间(176.8x - 188.6x)。Abstract
Automatic synthesis of analog and Radio Frequency (RF) circuits is a trending approach that requires an efficient circuit modeling method. This is due to the expensive cost of running a large number of simulations at each synthesis cycle. Artificial intelligence methods are promising approaches for circuit modeling due to their speed and relative accuracy. However, existing approaches require a large amount of training data, which is still collected using simulation runs. In addition, such approaches collect a whole separate dataset for each circuit topology even if a single element is added or removed. These matters are only exacerbated by the need for post-layout modeling simulations, which take even longer. To alleviate these drawbacks, in this paper, we present FuNToM, a functional modeling method for RF circuits. FuNToM leverages the two-port analysis method for modeling multiple topologies using a single main dataset and multiple small datasets. It also leverages neural networks which have shown promising results in predicting the behavior of circuits. Our results show that for multiple RF circuits, in comparison to the state-of-the-art works, while maintaining the same accuracy, the required training data is reduced by 2.8x - 10.9x. In addition, FuNToM needs 176.8x - 188.6x less time for collecting the training set in post-layout modeling.
摘要
自动生成分析和无线频率(RF)Circuit是一种升温的方法,需要有效的电路模型方法。这是因为在每个合成周期中运行大量的 simulations 的成本高昂。人工智能方法是可靠的approach для电路模型,因为它们的速度和相对准确性。然而,现有的方法需要大量的训练数据,这些数据仍然通过 simulation runs 收集。此外,这些方法每个电路拓扑都需要一个分立的数据集,即使只是添加或删除一个元素。这些问题被加速器了,因为需要后处理模拟,这些模拟需要更长的时间。为了缓解这些缺点,在这篇论文中,我们提出了 FuNToM,一种功能模型方法 для RF Circuit。FuNToM 利用了两个端口分析方法,用于模型多种拓扑,使用单个主数据集和多个小数据集。它还利用了人工神经网络,这些神经网络在预测电路行为方面表现出色。我们的结果表明,对多个 RF Circuit,与现状的工作相比,保持同样的准确性,需要的训练数据被减少了2.8倍 - 10.9倍。此外,FuNToM 在后处理模拟中收集训练集的时间需要176.8倍 - 188.6倍。
Deep Maxout Network-based Feature Fusion and Political Tangent Search Optimizer enabled Transfer Learning for Thalassemia Detection
For: The paper is written for detecting thalassemia, a heritable blood disorder, and understanding the frequency of its occurrence and reliable mutations to prevent, control, and treat the disease.* Methods: The paper proposes a Political Tangent Search Optimizer based Transfer Learning (PTSO_TL) method for thalassemia detection, which includes data normalization, feature fusion, data augmentation, and convolutional neural network (CNN) with hyperparameters from a trained model such as Xception.* Results: The PTSO_TL method obtained maximal precision, recall, and f-measure values of about 94.3%, 96.1%, and 95.2%, respectively.Here is the simplified Chinese text for the three information points:* 用途:本研究是为了检测贫血病,一种遗传的血液疾病,以及其发生频率和可靠的突变,以便预防、控制和治疗该疾病。* 方法:本研究提出了一种基于政治折衣搜索优化器的转移学习(PTSO_TL)方法,包括数据Normalization、特征融合、数据增强和卷积神经网络(CNN)的hyperparameters。* 结果:PTSO_TL方法在识别贫血病方面取得了最高的准确率、回归率和f-度值,准确率达94.3%、回归率达96.1%和f-度值达95.2%。Abstract
Thalassemia is a heritable blood disorder which is the outcome of a genetic defect causing lack of production of hemoglobin polypeptide chains. However, there is less understanding of the precise frequency as well as sharing in these areas. Knowing about the frequency of thalassemia occurrence and dependable mutations is thus a significant step in preventing, controlling, and treatment planning. Here, Political Tangent Search Optimizer based Transfer Learning (PTSO_TL) is introduced for thalassemia detection. Initially, input data obtained from a particular dataset is normalized in the data normalization stage. Quantile normalization is utilized in the data normalization stage, and the data are then passed to the feature fusion phase, in which Weighted Euclidean Distance with Deep Maxout Network (DMN) is utilized. Thereafter, data augmentation is performed using the oversampling method to increase data dimensionality. Lastly, thalassemia detection is carried out by TL, wherein a convolutional neural network (CNN) is utilized with hyperparameters from a trained model such as Xception. TL is tuned by PTSO, and the training algorithm PTSO is presented by merging of Political Optimizer (PO) and Tangent Search Algorithm (TSA). Furthermore, PTSO_TL obtained maximal precision, recall, and f-measure values of about 94.3%, 96.1%, and 95.2%, respectively.
摘要
θαλασσημία 是一种遗传血液疾病,它的发病原因是遗传的蛋白质链缺失。然而, precis 的发病频率以及这些地区的分布不够了解。了解θαλασσημία的发病频率和可靠的突变可以是预防、控制和治疗规划的重要一步。在这里,我们引入了政治弯曲搜索优化器基于转移学习(PTSO_TL) для θαλασσημία检测。首先,输入数据从特定数据集中获取并Normalize。在数据Normalization阶段,我们使用Quantile Normalization,然后将数据传递给特征融合阶段。在这里,我们使用Weighted Euclidean Distance with Deep Maxout Network(DMN)。接着,我们使用扩展方法进行数据增强,以增加数据维度。最后,我们使用TL进行θαλασσημία检测,其中使用了一个准确网络(CNN),并将其与已训练模型Xception的超参数进行调整。PTSO_TL的最大精度、回归率和f-度分别达到了约94.3%、96.1%和95.2%。
Federated Representation Learning for Automatic Speech Recognition
results: 研究表明,使用 FL 预训练模型可以达到与中央预训练模型相同的性能水平,并且在新语言 French 中进行适应性提高了20% (WER)。Abstract
Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new language, French, and show a 20% (WER) improvement over no pre-training.
摘要
federated learning (FL) 是一种隐私保护的 paradigm,allowing edge devices 学习合作而无需共享数据。 edge devices 如 Alexa 和 Siri 是可能的无标签音频数据的来源,可以用来学习Robust audio representations。在这项工作中,我们将Self-supervised Learning (SSL) 和 FL 结合以学习保持数据隐私的 Automatic Speech Recognition 表示。我们使用 Libri-Light 无标签speech 数据集中的speaker和chapter信息来模拟非ID的speaker-siloed 数据分布,并在 FedSGD 框架中预训练 LSTM Encoder。我们发现预训练 ASR Encoder 在 FL 中表现与中央预训练模型相当,并且生成了无预训练比较的 12-15% (WER) 的改进。我们进一步适应了联邦预训练模型到一种新语言,法语,并显示了无预训练比较的 20% (WER) 的改进。
Memory capacity of two layer neural networks with smooth activations
results: 作者发现了一个下界 bound,asserting that md/2 是两层神经网络的内存容量下界,并且这个下界可以达到一个因子约为 2 的优化。这些结果比前一些研究更加广泛,并且可以推广到更深的模型和其他架构。Abstract
Determining the memory capacity of two-layer neural networks with m hidden neurons and input dimension d (i.e., md+m total trainable parameters), which refers to the largest size of general data the network can memorize, is a fundamental machine-learning question. For non-polynomial real analytic activation functions, such as sigmoids and smoothed rectified linear units (smoothed ReLUs), we establish a lower bound of md/2 and optimality up to a factor of approximately 2. Analogous prior results were limited to Heaviside and ReLU activations, with results for smooth activations suffering from logarithmic factors and requiring random data. To analyze the memory capacity, we examine the rank of the network's Jacobian by computing the rank of matrices involving both Hadamard powers and the Khati-Rao product. Our computation extends classical linear algebraic facts about the rank of Hadamard powers. Overall, our approach differs from previous works on memory capacity and holds promise for extending to deeper models and other architectures.
摘要
To analyze the memory capacity, we examine the rank of the network's Jacobian by computing the rank of matrices involving both Hadamard powers and the Khati-Rao product. Our approach differs from previous works on memory capacity and holds promise for extending to deeper models and other architectures.Here is the translation in Simplified Chinese:确定两层神经网络的内存容量(即md+m总可训练参数)是机器学习的基本问题。对于非多项实数 activation functions,如sigmoid和smoothed ReLU,我们设下md/2的下界和优化因子约为2。这与之前的结果有所不同,它们仅适用于Heaviside和ReLU激活函数,并且受到了对数因子的影响,需要随机数据。为了分析内存容量,我们研究了神经网络的雅可比矩阵的排名,通过计算包括 Hadamard powers 和 Khati-Rao 乘积的矩阵的排名。我们的方法与之前关于内存容量的工作不同,并且可能扩展到更深的模型和其他架构。
On the Transition from Neural Representation to Symbolic Knowledge
results: 我们在3个抽象compositional visual objects dataset上进行了广泛的实验,并证明了学习的表示可以准确地分解视觉输入,并且在下游任务中具有顺滑的适应性。Abstract
Bridging the huge disparity between neural and symbolic representation can potentially enable the incorporation of symbolic thinking into neural networks from essence. Motivated by how human gradually builds complex symbolic representation from the prototype symbols that are learned through perception and environmental interactions. We propose a Neural-Symbolic Transitional Dictionary Learning (TDL) framework that employs an EM algorithm to learn a transitional representation of data that compresses high-dimension information of visual parts of an input into a set of tensors as neural variables and discover the implicit predicate structure in a self-supervised way. We implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game, then learn predicates by prototype clustering. We additionally use RL enabled by the Markovian of diffusion models to further tune the learned prototypes by incorporating subjective factors. Extensive experiments on 3 abstract compositional visual objects datasets that require the model to segment parts without any visual features like texture, color, or shadows apart from shape and 3 neural/symbolic downstream tasks demonstrate the learned representation enables interpretable decomposition of visual input and smooth adaption to downstream tasks which are not available by existing methods.
摘要
bridging the huge disparity between neural and symbolic representation can potentially enable the incorporation of symbolic thinking into neural networks from essence. motivated by how human gradually builds complex symbolic representation from the prototype symbols that are learned through perception and environmental interactions. we propose a neural-symbolic transitional dictionary learning (tdl) framework that employs an EM algorithm to learn a transitional representation of data that compresses high-dimension information of visual parts of an input into a set of tensors as neural variables and discover the implicit predicate structure in a self-supervised way. we implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game, then learn predicates by prototype clustering. we additionally use rl enabled by the markovian of diffusion models to further tune the learned prototypes by incorporating subjective factors. extensive experiments on 3 abstract compositional visual objects datasets that require the model to segment parts without any visual features like texture, color, or shadows apart from shape and 3 neural/symbolic downstream tasks demonstrate the learned representation enables interpretable decomposition of visual input and smooth adaption to downstream tasks which are not available by existing methods.Here's a word-for-word translation of the text into Simplified Chinese:bridging the huge disparity between neural and symbolic representation can potentially enable the incorporation of symbolic thinking into neural networks from essence. motivated by how human gradually builds complex symbolic representation from the prototype symbols that are learned through perception and environmental interactions. we propose a neural-symbolic transitional dictionary learning (tdl) framework that employs an EM algorithm to learn a transitional representation of data that compresses high-dimension information of visual parts of an input into a set of tensors as neural variables and discover the implicit predicate structure in a self-supervised way. we implement the framework with a diffusion model by regarding the decomposition of input as a cooperative game, then learn predicates by prototype clustering. we additionally use rl enabled by the markovian of diffusion models to further tune the learned prototypes by incorporating subjective factors. extensive experiments on 3 abstract compositional visual objects datasets that require the model to segment parts without any visual features like texture, color, or shadows apart from shape and 3 neural/symbolic downstream tasks demonstrate the learned representation enables interpretable decomposition of visual input and smooth adaption to downstream tasks which are not available by existing methods.
Explainable unsupervised multi-modal image registration using deep networks
for: 这 paper 的目的是提出一种基于深度学习的多Modalities MRI 图像 регистраción方法,以便在临床应用中提高图像识别率。
methods: 这 paper 使用了一种基于深度学习的图像处理管道,包括 intra-和inter-modality MRI 图像 регистраción,以及Grad-CAM 基于的解释性框架。
results: 研究人员通过在不同的模式和时间点进行图像 registration,并在不同的组织和Modalities之间进行图像对alignment,以达到superior的表现。此外,他们还通过Grad-CAM 基于的解释性框架,可以准确地解释模型与数据之间的关系。Abstract
Clinical decision making from magnetic resonance imaging (MRI) combines complementary information from multiple MRI sequences (defined as 'modalities'). MRI image registration aims to geometrically 'pair' diagnoses from different modalities, time points and slices. Both intra- and inter-modality MRI registration are essential components in clinical MRI settings. Further, an MRI image processing pipeline that can address both afine and non-rigid registration is critical, as both types of deformations may be occuring in real MRI data scenarios. Unlike image classification, explainability is not commonly addressed in image registration deep learning (DL) methods, as it is challenging to interpet model-data behaviours against transformation fields. To properly address this, we incorporate Grad-CAM-based explainability frameworks in each major component of our unsupervised multi-modal and multi-organ image registration DL methodology. We previously demonstrated that we were able to reach superior performance (against the current standard Syn method). In this work, we show that our DL model becomes fully explainable, setting the framework to generalise our approach on further medical imaging data.
摘要
临床决策从磁共振成像(MRI)结合多种MRI序列(定义为“modalities”)的信息。MRI图像对接目标是将不同modalities、时间点和 slice中的诊断进行几何对应。在临床MRI setting中, both intra-和inter-modalities MRI对接是关键组件。此外,一个能够处理both afine和非RIGID对接的MRI图像处理管道是重要的,因为这两种类型的变形都可能出现在实际MRI数据场景中。与图像分类不同,在图像对接深度学习(DL)方法中,解释性不常被考虑,因为很难从变换场景中解释模型与数据之间的行为。为了正确地解决这个问题,我们在每个主要组件中都包含了基于Grad-CAM的解释框架。在我们的无监督多modal和多器官图像对接DL方法中,我们之前已经达到了与当前标准Syn方法相比的更高性能。在这个工作中,我们展示了我们的DL模型已经变得可解释,设置了框架来普遍化我们的方法在更多的医疗影像数据上。
CartiMorph: a framework for automated knee articular cartilage morphometrics
results: 这个论文通过对患者的膝关节照片进行自动分割、模板建立和匹配,实现了对软骨的量化测量,包括全厚度软骨损伤率(FCL)、软骨厚度、表面积和体积的测量。并且对软骨厚度图进行了比较,发现在薄和 périphériques 地区的误差较小。Abstract
We introduce CartiMorph, a framework for automated knee articular cartilage morphometrics. It takes an image as input and generates quantitative metrics for cartilage subregions, including the percentage of full-thickness cartilage loss (FCL), mean thickness, surface area, and volume. CartiMorph leverages the power of deep learning models for hierarchical image feature representation. Deep learning models were trained and validated for tissue segmentation, template construction, and template-to-image registration. We established methods for surface-normal-based cartilage thickness mapping, FCL estimation, and rule-based cartilage parcellation. Our cartilage thickness map showed less error in thin and peripheral regions. We evaluated the effectiveness of the adopted segmentation model by comparing the quantitative metrics obtained from model segmentation and those from manual segmentation. The root-mean-squared deviation of the FCL measurements was less than 8%, and strong correlations were observed for the mean thickness (Pearson's correlation coefficient $\rho \in [0.82,0.97]$), surface area ($\rho \in [0.82,0.98]$) and volume ($\rho \in [0.89,0.98]$) measurements. We compared our FCL measurements with those from a previous study and found that our measurements deviated less from the ground truths. We observed superior performance of the proposed rule-based cartilage parcellation method compared with the atlas-based approach. CartiMorph has the potential to promote imaging biomarkers discovery for knee osteoarthritis.
摘要
我们介绍CartiMorph,一个框架 для自动计算膝关节软骨cartilage的形态特征。它可以从图像中提取量化特征,包括软骨损伤率(FCL)、软骨厚度、表面积和体积。CartiMorph利用深度学习模型来表示图像特征。我们训练了和验证了深度学习模型,用于识别、模板生成和模板与图像对齐。我们开发了基于表面法则的软骨厚度映射、FCL估计和软骨分割方法。我们评估了采用的分割模型的效果,并发现其与人工分割的量化特征具有强相关性(Pearson correlation coefficient $\rho \in [0.82,0.97]$)。我们对我们的FCL测量与之前的研究中的参照值进行比较,发现我们的测量与参照值之间存在较小的差异。我们发现我们的软骨分割方法比使用 Atlases-based方法表现出更高的性能。CartiMorph具有推动膝关节风湿病影像生物标志的潜力。
Unmasking Parkinson’s Disease with Smile: An AI-enabled Screening Framework
results: 研究人员发现,基于笑脸视频的特征alone可以实现相当的性能,即使在两个外部测试集上,这些模型在训练过程中从来没有看到的数据上也能够达到类似的性能,这表明pd风险评估可能可以通过笑脸自拍视频进行。Abstract
Parkinson's disease (PD) diagnosis remains challenging due to lacking a reliable biomarker and limited access to clinical care. In this study, we present an analysis of the largest video dataset containing micro-expressions to screen for PD. We collected 3,871 videos from 1,059 unique participants, including 256 self-reported PD patients. The recordings are from diverse sources encompassing participants' homes across multiple countries, a clinic, and a PD care facility in the US. Leveraging facial landmarks and action units, we extracted features relevant to Hypomimia, a prominent symptom of PD characterized by reduced facial expressions. An ensemble of AI models trained on these features achieved an accuracy of 89.7% and an Area Under the Receiver Operating Characteristic (AUROC) of 89.3% while being free from detectable bias across population subgroups based on sex and ethnicity on held-out data. Further analysis reveals that features from the smiling videos alone lead to comparable performance, even on two external test sets the model has never seen during training, suggesting the potential for PD risk assessment from smiling selfie videos.
摘要
Domain specificity and data efficiency in typo tolerant spell checkers: the case of search in online marketplaces
paper_authors: Dayananda Ubrangala, Juhi Sharma, Ravi Prasad Kondapalli, Kiran R, Amit Agarwala, Laurent Boué
for: correction of typographical errors in online marketplaces
methods: data augmentation, training of recurrent neural network for context-limited domain-specific embeddings
results: real-time inferencing API for finding the closest match between misspelled user queries and available product names, with high accuracy using controlled and high-quality synthetic data.Abstract
Typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell cheking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network to learn context-limited domain-specific embeddings. Those embeddings are deployed in a real-time inferencing API for the Microsoft AppSource marketplace to find the closest match between a misspelled user query and the available product names. Our data efficient solution shows that controlled high quality synthetic data may be a powerful tool especially considering the current climate of large language models which rely on prohibitively huge and often uncontrolled datasets.
摘要
typographical errors are a major source of frustration for visitors of online marketplaces. Because of the domain-specific nature of these marketplaces and the very short queries users tend to search for, traditional spell cheking solutions do not perform well in correcting typos. We present a data augmentation method to address the lack of annotated typo data and train a recurrent neural network to learn context-limited domain-specific embeddings. Those embeddings are deployed in a real-time inferencing API for the Microsoft AppSource marketplace to find the closest match between a misspelled user query and the available product names. Our data efficient solution shows that controlled high quality synthetic data may be a powerful tool, especially considering the current climate of large language models which rely on prohibitively huge and often uncontrolled datasets.Note: "Microsoft AppSource" has been translated as "微软应用源" (wēi ròng yìng yuè yuán) in the text.
Synthesising Rare Cataract Surgery Samples with Guided Diffusion Models
paper_authors: Yannik Frisch, Moritz Fuchs, Antoine Sanner, Felix Anton Ucar, Marius Frenzel, Joana Wasielica-Poslednik, Adrian Gericke, Felix Mathias Wagner, Thomas Dratsch, Anirban Mukhopadhyay
for: This paper aims to address the challenges of gathering and annotating data for training automated assistance systems for cataract surgery, by analyzing cataract surgery video data and utilizing a conditional generative model to synthesize diverse, high-quality examples of surgical phases and tool use.
methods: The authors use a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG) to synthesize realistic examples of surgical phases and tool use, addressing the imbalances and data sparsity issues in the publicly available data.
results: The authors demonstrate that their approach can generate valuable unseen examples, allowing the tool classifier to improve by up to 10% for rare cases, and provide a reliable source of realistic synthetic data for the development of automated assistance systems for cataract surgery.Abstract
Cataract surgery is a frequently performed procedure that demands automation and advanced assistance systems. However, gathering and annotating data for training such systems is resource intensive. The publicly available data also comprises severe imbalances inherent to the surgical process. Motivated by this, we analyse cataract surgery video data for the worst-performing phases of a pre-trained downstream tool classifier. The analysis demonstrates that imbalances deteriorate the classifier's performance on underrepresented cases. To address this challenge, we utilise a conditional generative model based on Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG). Our model can synthesise diverse, high-quality examples based on complex multi-class multi-label conditions, such as surgical phases and combinations of surgical tools. We affirm that the synthesised samples display tools that the classifier recognises. These samples are hard to differentiate from real images, even for clinical experts with more than five years of experience. Further, our synthetically extended data can improve the data sparsity problem for the downstream task of tool classification. The evaluations demonstrate that the model can generate valuable unseen examples, allowing the tool classifier to improve by up to 10% for rare cases. Overall, our approach can facilitate the development of automated assistance systems for cataract surgery by providing a reliable source of realistic synthetic data, which we make available for everyone.
摘要
喉肢手术是一种非常常见的手术程序,需要自动化和高级帮助系统。然而,收集和标注数据用于训练这些系统是资源占用的。公共可用的数据也包含了手术过程中的严重偏好,这会导致分类器的性能下降。为了解决这个挑战,我们分析了喉肢手术视频数据,并评估了下游工具分类器的最差表现阶段。分析结果表明,偏好会使分类器对于不充分表现的情况下的性能下降。为了解决这个问题,我们使用基于减震扩散隐藏模型(DDIM)和无分类导航(CFG)的冲击型生成模型。我们的模型可以生成多样化、高质量的示例,基于复杂的多类多标签条件,如手术阶段和手术工具的组合。我们证明了生成的示例中的工具可以被分类器识别。这些示例与真实图像很难区分,即使是临床专家超过5年的经验。此外,我们通过生成的数据扩展,解决了下游任务的数据缺乏问题,从而使分类器的性能提高。评估结果表明,我们的模型可以生成价值很高的未看过的示例,使分类器在罕见情况下提高到10%。总的来说,我们的方法可以促进喉肢手术自动化的发展,提供一个可靠的实际synthetic数据源,我们将其公开给大家。
Aligning Agent Policy with Externalities: Reward Design via Bilevel RL
results: 本文提出了主体驱动策略对准(PPA-BRL),它可以有效地将代理人的策略与主体的目标相吻合。 authors还证明了PPA-BRL的收敛性,并采用了多个示例来说明该框架的优势。Abstract
In reinforcement learning (RL), a reward function is often assumed at the outset of a policy optimization procedure. Learning in such a fixed reward paradigm in RL can neglect important policy optimization considerations, such as state space coverage and safety. Moreover, it can fail to encompass broader impacts in terms of social welfare, sustainability, or market stability, potentially leading to undesirable emergent behavior and potentially misaligned policy. To mathematically encapsulate the problem of aligning RL policy optimization with such externalities, we consider a bilevel optimization problem and connect it to a principal-agent framework, where the principal specifies the broader goals and constraints of the system at the upper level and the agent solves a Markov Decision Process (MDP) at the lower level. The upper-level deals with learning a suitable reward parametrization corresponding to the broader goals and the lower-level deals with learning the policy for the agent. We propose Principal driven Policy Alignment via Bilevel RL (PPA-BRL), which efficiently aligns the policy of the agent with the principal's goals. We explicitly analyzed the dependence of the principal's trajectory on the lower-level policy, prove the convergence of PPA-BRL to the stationary point of the problem. We illuminate the merits of this framework in view of alignment with several examples spanning energy-efficient manipulation tasks, social welfare-based tax design, and cost-effective robotic navigation.
摘要
在增强学习(RL)中,常常假设一个奖金函数在政策优化过程的开始。这种固定奖金的假设可能忽略了重要的政策优化考虑因素,如状态空间覆盖率和安全性。此外,它可能不包括更广泛的影响,如社会福利、可持续发展和市场稳定性,可能导致不жела的 Emergent 行为和不一致的政策。为了数学地表达RL政策优化和这些外部因素之间的关系,我们考虑了一个双层优化问题,并将其连接到一个主体-代理人模型。在主体级别上,我们学习一个适合主体的奖金参数化,而在代理人级别上,我们学习一个Markov决策过程(MDP)。主体级别处理主体的更广泛目标和约束,而代理人级别处理代理人的政策。我们提出了由主体驱动的政策对齐方法(PPA-BRL),可以有效地将代理人的政策与主体的目标相对应。我们明确分析了主体的轨迹对下一级政策的依赖关系,并证明PPA-BRL的确 converge 到问题的站点点。我们在各种示例中,如能源减少的操作任务、基于社会福利的税制设计和cost-effective的机器人导航等方面,ILLUMINATE 了该框架的优势。
Reasoning in Large Language Models Through Symbolic Math Word Problems
results: 研究者发现,使用自我提示法可以使 LLM 提供一个准确和可证明的推理,并且自我提示还可以提高符号准确率,从而实现一种拓展效果。Abstract
Large language models (LLMs) have revolutionized NLP by solving downstream tasks with little to no labeled data. Despite their versatile abilities, the larger question of their ability to reason remains ill-understood. This paper addresses reasoning in math word problems (MWPs) by studying symbolic versions of the numeric problems, since a symbolic expression is a "concise explanation" of the numeric answer. We create and use a symbolic version of the SVAMP dataset and find that GPT-3's davinci-002 model also has good zero-shot accuracy on symbolic MWPs. To evaluate the faithfulness of the model's reasoning, we go beyond accuracy and additionally evaluate the alignment between the final answer and the outputted reasoning, which correspond to numeric and symbolic answers respectively for MWPs. We explore a self-prompting approach to encourage the symbolic reasoning to align with the numeric answer, thus equipping the LLM with the ability to provide a concise and verifiable reasoning and making it more interpretable. Surprisingly, self-prompting also improves the symbolic accuracy to be higher than both the numeric and symbolic accuracies, thus providing an ensembling effect. The SVAMP_Sym dataset will be released for future research on symbolic math problems.
摘要
Revisiting Deformable Convolution for Depth Completion
results: 在大规模的 KITTI dataset 上测试,模型实现了状态机器人级别的性能,并且在准确率和推理速度两个方面均达到了新高度。Abstract
Depth completion, which aims to generate high-quality dense depth maps from sparse depth maps, has attracted increasing attention in recent years. Previous work usually employs RGB images as guidance, and introduces iterative spatial propagation to refine estimated coarse depth maps. However, most of the propagation refinement methods require several iterations and suffer from a fixed receptive field, which may contain irrelevant and useless information with very sparse input. In this paper, we address these two challenges simultaneously by revisiting the idea of deformable convolution. We propose an effective architecture that leverages deformable kernel convolution as a single-pass refinement module, and empirically demonstrate its superiority. To better understand the function of deformable convolution and exploit it for depth completion, we further systematically investigate a variety of representative strategies. Our study reveals that, different from prior work, deformable convolution needs to be applied on an estimated depth map with a relatively high density for better performance. We evaluate our model on the large-scale KITTI dataset and achieve state-of-the-art level performance in both accuracy and inference speed. Our code is available at https://github.com/AlexSunNik/ReDC.
摘要
“深度补充”,即从粗略深度图生成高质量的稠密深度图,在过去几年内吸引了越来越多的关注。先前的工作通常使用RGB图像作为引导,并通过迭代的空间填充来精细化估算的粗略深度图。然而,大多数的填充精细方法需要几个迭代,并且受到固定的见识场的限制,可能包含无关和无用的信息,特别是在非常罕见的输入中。在这篇论文中,我们解决了这两个挑战,并同时提出了一种有效的架构。我们提议使用可变核函数卷积,作为单 passes 精细化模块,并经验证了其优越性。为了更好地理解可变核函数的作用,并在深度补充中利用它,我们进一步系统地研究了一些代表性的策略。我们的研究发现,与之前的工作不同,可变核函数需要在估算的深度图上进行高密度应用,以获得更好的性能。我们在大规模的KITTI dataset上评估了我们的模型,并 achieved state-of-the-art 水平的准确率和执行速度。我们的代码可以在https://github.com/AlexSunNik/ReDC中找到。
How many preprints have actually been printed and why: a case study of computer science preprints on arXiv
results: 研究发现,66% 的预印被发布 unter changed titles,11% 被发布 unter different titles 和其他修改。 further analysis 表明,已出版的预印具有充分的修订、多个作者、详细的摘要和引言、广泛的参考文献和可用的源代码等特征。Abstract
Preprints play an increasingly critical role in academic communities. There are many reasons driving researchers to post their manuscripts to preprint servers before formal submission to journals or conferences, but the use of preprints has also sparked considerable controversy, especially surrounding the claim of priority. In this paper, a case study of computer science preprints submitted to arXiv from 2008 to 2017 is conducted to quantify how many preprints have eventually been printed in peer-reviewed venues. Among those published manuscripts, some are published under different titles and without an update to their preprints on arXiv. In the case of these manuscripts, the traditional fuzzy matching method is incapable of mapping the preprint to the final published version. In view of this issue, we introduce a semantics-based mapping method with the employment of Bidirectional Encoder Representations from Transformers (BERT). With this new mapping method and a plurality of data sources, we find that 66% of all sampled preprints are published under unchanged titles and 11% are published under different titles and with other modifications. A further analysis was then performed to investigate why these preprints but not others were accepted for publication. Our comparison reveals that in the field of computer science, published preprints feature adequate revisions, multiple authorship, detailed abstract and introduction, extensive and authoritative references and available source code.
摘要
《Preprints在学术社区中发挥越来越重要的作用。有很多原因使研究者们将 manuscrips 上载到 preprint 服务器之前,而不是正式提交到期刊或会议,但使用 preprints 也引发了一些争议,特别是在优先权方面。本文通过对 computer science 领域自2008年至2017年的 arXiv 上的 preprints 进行 caso study,以计算这些投稿 eventually 被 peer-reviewed 出版的数量。 Among 发表的投稿中,一些发表于不同的标题和未更新 arXiv 上的投稿。在这些投稿中,传统的杂合匹配方法无法将投稿映射到最终发表的版本。为解决这个问题,我们引入 semantics-based 映射方法,使用 Bidirectional Encoder Representations from Transformers (BERT)。With 这种新的映射方法和多种数据源,我们发现:66% 的投稿被发表于不变的标题下,11% 的投稿被发表于不同的标题和其他修改。进一步的分析表明,在 computer science 领域中,发表的 preprints 具有充分的修改、多个作者、详细的摘要和引言、详细的参考文献和可用的源代码。》
Improving Replay Sample Selection and Storage for Less Forgetting in Continual Learning
results: 这篇研究获得了一些有用的结果,包括提出了一种新的储存数量选择方法,并进行了详细的分析,以帮助找到最佳的储存数量和储存样本。Abstract
Continual learning seeks to enable deep learners to train on a series of tasks of unknown length without suffering from the catastrophic forgetting of previous tasks. One effective solution is replay, which involves storing few previous experiences in memory and replaying them when learning the current task. However, there is still room for improvement when it comes to selecting the most informative samples for storage and determining the optimal number of samples to be stored. This study aims to address these issues with a novel comparison of the commonly used reservoir sampling to various alternative population strategies and providing a novel detailed analysis of how to find the optimal number of stored samples.
摘要
Simplified Chinese:深度学习探索可以让深度学习者在不知道任务数量的情况下接受多个任务,而不会出现过去任务的恶化。一种有效的解决方案是重温,即将一些过去经验存储在内存中,并在学习当前任务时重温。然而,还有很多可以提高的空间,包括选择存储的最有用样本和确定存储样本的优化数量。这个研究旨在通过对常用的储存抽样与其他人口策略进行比较,以及提供一种细化的分析,以找到最佳存储样本数量。
Exact identification of nonlinear dynamical systems by Trimmed Lasso
results: 研究表明,trimmed Lasso方法可以在finite和噪声数据下提供精确的模型预测结果,并且可以处理多列性问题,而SINDy和reweighted $\ell_1$ minimization方法则有些问题。Abstract
Identification of nonlinear dynamical systems has been popularized by sparse identification of the nonlinear dynamics (SINDy) via the sequentially thresholded least squares (STLS) algorithm. Many extensions SINDy have emerged in the literature to deal with experimental data which are finite in length and noisy. Recently, the computationally intensive method of ensembling bootstrapped SINDy models (E-SINDy) was proposed for model identification, handling finite, highly noisy data. While the extensions of SINDy are numerous, their sparsity-promoting estimators occasionally provide sparse approximations of the dynamics as opposed to exact recovery. Furthermore, these estimators suffer under multicollinearity, e.g. the irrepresentable condition for the Lasso. In this paper, we demonstrate that the Trimmed Lasso for robust identification of models (TRIM) can provide exact recovery under more severe noise, finite data, and multicollinearity as opposed to E-SINDy. Additionally, the computational cost of TRIM is asymptotically equal to STLS since the sparsity parameter of the TRIM can be solved efficiently by convex solvers. We compare these methodologies on challenging nonlinear systems, specifically the Lorenz 63 system, the Bouc Wen oscillator from the nonlinear dynamics benchmark of No\"el and Schoukens, 2016, and a time delay system describing tool cutting dynamics. This study emphasizes the comparisons between STLS, reweighted $\ell_1$ minimization, and Trimmed Lasso in identification with respect to problems faced by practitioners: the problem of finite and noisy data, the performance of the sparse regression of when the library grows in dimension (multicollinearity), and automatic methods for choice of regularization parameters.
摘要
非线性动力系统的标识已经得到了广泛的普及,通过稀疏标识非线性动力学(SINDy)viaSequentially Thresholded Least Squares(STLS)算法。在文献中,许多SINDy的扩展出现了,以处理实验数据的限制和噪声。最近,对Bootstrapped SINDy模型的ensemble(E-SINDy)计算昂贵方法被提出,用于模型标识,面对限制、高噪声数据。然而,SINDy扩展的 sparse 估计器 occasional 提供稀疏approximation of the dynamics 而不是精确的回归。此外,这些估计器会在多icollinearity 下表现不佳,例如lasso 的不可 Representable condition。在这篇文章中,我们展示了Trimmed Lasso 可以在更严重的噪声、限制和多icollinearity下提供精确的回归,而不是E-SINDy。此外,TRIM 的计算成本与 STLS 相同,可以由 convex 算法有效地解决约束参数。我们在非线性系统中进行了对抗样本,包括 Lorenz 63 系统、Bouc Wen 振荡器和时延系统,以及2016年非线性动力学 benchmark 中的 No\"el 和 Schoukens 的测试集。本研究强调了 STLS、重量化 $\ell_1$ 最小化和 Trimmed Lasso 在实际应用中遇到的问题的比较:噪声和限制的数据 finite 问题、约束参数的自动选择问题以及库存在多个参数时的多icollinearity 问题。
DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations
methods: 我们利用了 millions of auxiliary image-text pairs 预训练的强大对应关系,并提出了一种高效的框架,即 Evidence-guided Dual Context Optimization (DualCoOp++),用于解决 partial-label 和 zero-shot multi-label recognition 问题。
results: 我们在标准的多Label图像识别benchmark上进行了实验,并证明了我们的方法在低标签场景下的表现superiority,比state-of-the-art方法更高。Abstract
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.
摘要
多标签图像识别在低标签场景下是一项具有挑战性和实际意义的任务。先前的工作强调学习图像和文本空间之间的对应关系,以做到因为有限的多标签注释而减少精度。在本研究中,我们利用已经预训练的图像和文本特征之间的强大对应关系,提出一种高效的框架called Evidence-guided Dual Context Optimization (DualCoOp++)。DualCoOp++是一种统一的方法,用于解决 partial-label 和 zero-shot 多标签识别问题。在 DualCoOp++ 中,我们将目标类的 evidential、正例和负例上下文分别编码为文本输入(即提示)的 parametric 组件。evidential 上下文的目的是找到target类相关的所有视觉内容,并作为指导将正例和负例上下文从图像空间的空间域聚合,以提高类别之间的区分。此外,我们还引入了一个 Winner-Take-All 模块,通过在训练时间提高 между类交互,以避免添加额外参数和成本。由于 DualCoOp++ 对已经预训练的视觉语言框架做出最小的额外学习压力,因此它可以快速适应多标签识别任务,即使是有限的注释和未知类。在标准多标签识别benchmark上,我们的方法与状态之前的方法相比,显示出更高的性能。
Cream Skimming the Underground: Identifying Relevant Information Points from Online Forums
paper_authors: Felipe Moreno-Vera, Mateus Nogueira, Cainã Figueiredo, Daniel Sadoc Menasché, Miguel Bicudo, Ashton Woiwood, Enrico Lovat, Anton Kocheturov, Leandro Pfleger de Aguiar
results: 研究发现,可以通过使用random forest算法,实现对线程和帖子的自动分类,并且准确率、精度和准确率都高于0.99。此外,研究还提供了针对武器化和利用之间的差异分析,以及黑客社区的其他方面的分析。Abstract
This paper proposes a machine learning-based approach for detecting the exploitation of vulnerabilities in the wild by monitoring underground hacking forums. The increasing volume of posts discussing exploitation in the wild calls for an automatic approach to process threads and posts that will eventually trigger alarms depending on their content. To illustrate the proposed system, we use the CrimeBB dataset, which contains data scraped from multiple underground forums, and develop a supervised machine learning model that can filter threads citing CVEs and label them as Proof-of-Concept, Weaponization, or Exploitation. Leveraging random forests, we indicate that accuracy, precision and recall above 0.99 are attainable for the classification task. Additionally, we provide insights into the difference in nature between weaponization and exploitation, e.g., interpreting the output of a decision tree, and analyze the profits and other aspects related to the hacking communities. Overall, our work sheds insight into the exploitation of vulnerabilities in the wild and can be used to provide additional ground truth to models such as EPSS and Expected Exploitability.
摘要
Statistical Estimation Under Distribution Shift: Wasserstein Perturbations and Minimax Theory
results: 研究发现,在平均估计和线性回归中,采用样本均值和最小二乘估计器是最优的。但是,在其他问题上,提供了近似估计器和精确的finite-sample bound。此外,本文还介绍了一些用于下界分布变换的工具,如缓和技术和经典工具的推广。Abstract
Distribution shifts are a serious concern in modern statistical learning as they can systematically change the properties of the data away from the truth. We focus on Wasserstein distribution shifts, where every data point may undergo a slight perturbation, as opposed to the Huber contamination model where a fraction of observations are outliers. We formulate and study shifts beyond independent perturbations, exploring Joint Distribution Shifts, where the per-observation perturbations can be coordinated. We analyze several important statistical problems, including location estimation, linear regression, and non-parametric density estimation. Under a squared loss for mean estimation and prediction error in linear regression, we find the exact minimax risk, a least favorable perturbation, and show that the sample mean and least squares estimators are respectively optimal. This holds for both independent and joint shifts, but the least favorable perturbations and minimax risks differ. For other problems, we provide nearly optimal estimators and precise finite-sample bounds. We also introduce several tools for bounding the minimax risk under distribution shift, such as a smoothing technique for location families, and generalizations of classical tools including least favorable sequences of priors, the modulus of continuity, Le Cam's, Fano's, and Assouad's methods.
摘要
“分布shift是现代统计学中的一项重要问题,因为它可能会系统性地改变数据的性质,从而导致统计分析的结果不准确。我们主要关注 Wasserstein 分布shift,其中每个数据点都可能发生轻微的扰动,而不是 Hubert 杂入模型,其中一部分观察值是异常值。我们将分布shift beyond independent perturbations 探讨,包括联合分布shift,其中每个观察值的扰动可以协调。我们分析了一些重要的统计问题,包括位置估计、线性回归和非参数密度估计。使用平方损失函数 для均值估计和线性回归预测误差,我们找到了最佳的最小值风险,最不利的扰动和证明样本均值和最小二乘估计器是最佳的。这些结果适用于独立分布shift和联合分布shift,但最佳扰动和最小值风险不同。对于其他问题,我们提供了近似最佳估计器和精确的 finite-sample 上限。我们还介绍了一些用于下界最小值风险的工具,包括分布 families 的平滑技术和经典工具的普遍化,如最不利序列的假设、模度Continuity、Le Cam 和 Fano 的方法。”
Curricular Transfer Learning for Sentence Encoded Tasks
results: 在 MultiWoZ 任务上实现了显著的改进,舒展比其他已知预训练方法。Abstract
Fine-tuning language models in a downstream task is the standard approach for many state-of-the-art methodologies in the field of NLP. However, when the distribution between the source task and target task drifts, \textit{e.g.}, conversational environments, these gains tend to be diminished. This article proposes a sequence of pre-training steps (a curriculum) guided by "data hacking" and grammar analysis that allows further gradual adaptation between pre-training distributions. In our experiments, we acquire a considerable improvement from our method compared to other known pre-training approaches for the MultiWoZ task.
摘要
文本翻译为简化中文:在自然语言处理领域的许多状态体验中,训练语言模型在下游任务中进行微调是标准的方法。然而,当源任务和目标任务的分布发生变化,例如对话环境,这些提高往往减少。这篇文章提议一系列的预训练步骤(课程),受到“数据黑客”和语法分析的引导,以进一步适应预训练分布的变化。在我们的实验中,我们获得了与其他已知预训练方法相比的显著改进,用于多语言对话任务。Note: "数据黑客" (data hacker) is a term used in China to refer to someone who is skilled at finding and exploiting vulnerabilities in data or systems. In the context of this article, it seems to refer to the idea of using data from a variety of sources to "hack" or adapt the pre-training process to better suit the target task.
Probabilistic Deep Supervision Network: A Noise-Resilient Approach for QoS Prediction
results: 对两个实际的质量服务数据集进行了实验评估,并证明了我们的方法的有效性,比对 estado-of-the-art 基elines 高效。Abstract
Quality of Service (QoS) prediction is an essential task in recommendation systems, where accurately predicting unknown QoS values can improve user satisfaction. However, existing QoS prediction techniques may perform poorly in the presence of noise data, such as fake location information or virtual gateways. In this paper, we propose the Probabilistic Deep Supervision Network (PDS-Net), a novel framework for QoS prediction that addresses this issue. PDS-Net utilizes a Gaussian-based probabilistic space to supervise intermediate layers and learns probability spaces for both known features and true labels. Moreover, PDS-Net employs a condition-based multitasking loss function to identify objects with noise data and applies supervision directly to deep features sampled from the probability space by optimizing the Kullback-Leibler distance between the probability space of these objects and the real-label probability space. Thus, PDS-Net effectively reduces errors resulting from the propagation of corrupted data, leading to more accurate QoS predictions. Experimental evaluations on two real-world QoS datasets demonstrate that the proposed PDS-Net outperforms state-of-the-art baselines, validating the effectiveness of our approach.
摘要
quality of service(QoS)预测是推荐系统中的一项重要任务,可以提高用户满意度。然而,现有的QoS预测技术可能在噪声数据存在时表现不佳。在这篇论文中,我们提出了可靠的深度监督网络(PDS-Net)框架,用于QoS预测。PDS-Net使用 Gaussian 基于的 probabilistic 空间来监督中间层,并学习 probabilities 空间 для知道特征和真实标签。此外,PDS-Net 使用 condition-based multitasking 损失函数来标识含噪数据对象,并直接将深度特征从probability空间中抽取到损失函数中进行超vision。因此,PDS-Net 可以有效地减少噪声数据的传播错误,从而提高 QoS 预测的准确性。实验评估在两个真实 QoS 数据集上表明,提出的 PDS-Net 超过了状态的基eline,证明了我们的方法的有效性。
results: 我们在多种不同的机器学习任务和各种输入表现上验证了我们的方法,并证明了生成攻击例子的重要性,以便实现安全和可靠的 AI 系统。Abstract
Machine learning models are known to be vulnerable to adversarial evasion attacks as illustrated by image classification models. Thoroughly understanding such attacks is critical in order to ensure the safety and robustness of critical AI tasks. However, most evasion attacks are difficult to deploy against a majority of AI systems because they have focused on image domain with only few constraints. An image is composed of homogeneous, numerical, continuous, and independent features, unlike many other input types to AI systems used in practice. Furthermore, some input types include additional semantic and functional constraints that must be observed to generate realistic adversarial inputs. In this work, we propose a new framework to enable the generation of adversarial inputs irrespective of the input type and task domain. Given an input and a set of pre-defined input transformations, our framework discovers a sequence of transformations that result in a semantically correct and functional adversarial input. We demonstrate the generality of our approach on several diverse machine learning tasks with various input representations. We also show the importance of generating adversarial examples as they enable the deployment of mitigation techniques.
摘要
To address these challenges, we propose a new framework for generating adversarial inputs that can be applied to any input type and task domain. Given an input and a set of predefined input transformations, our framework discovers a sequence of transformations that result in a semantically correct and functional adversarial input. We demonstrate the versatility of our approach on several diverse machine learning tasks with various input representations. We also show the importance of generating adversarial examples, as they enable the deployment of mitigation techniques.Here is the translation in Simplified Chinese:机器学习模型知道会受到攻击,图像分类模型的攻击示例。理解这些攻击非常重要,以确保AI任务的安全性和可靠性。然而,大多数攻击都难以应用于大多数AI系统,因为它们都是图像领域的,具有有限的约束。图像由数字、连续、独立的特征组成,与其他AI系统的输入类型不同。此外,一些输入类型还有semantic和functional约束,需要在生成攻击输入时考虑。为解决这些挑战,我们提出了一个新的框架,可以应用于任何输入类型和任务领域。给定一个输入和一组预定的输入变换,我们的框架可以找到一个符号正确、功能正确的攻击输入序列。我们在多种多样的机器学习任务上展示了这种方法的通用性,并显示了生成攻击示例的重要性,以便应用 Mitigation 技术。