results: 研究发现,这个synthetic dataset可以保持与原始 VoxCeleb2 集的相似性,并且可以用于下游的 speaker verification 任务中,但是也存在一些挑战,需要进一步的研究和改进。Abstract
The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using large-scale datasets of natural speech collected from real human speakers. For example, the widely-used VoxCeleb2 dataset for speaker recognition is no longer accessible from the official website. To mitigate these concerns, this work presents an initiative to generate a privacy-friendly synthetic VoxCeleb2 dataset that ensures the quality of the generated speech in terms of privacy, utility, and fairness. We also discuss the challenges of using synthetic data for the downstream task of speaker verification.
摘要
成功的深度学习在人脸识别中受到大量数据的支持。然而,深度学习方法的数据吃杂性已经被质疑,因为使用大规模自然语音收集到的真正人类说话者的数据会产生伦理、隐私和法律问题。例如,广泛使用的VoxCeleb2数据集已经从官方网站上下载不可达。为解决这些问题,本研究提出了一项隐私友好的synthetic VoxCeleb2数据生成Initative,确保生成的speech质量符合隐私、有用性和公平原则。我们还讨论了使用生成数据进行下游任务的说话识别的挑战。
Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?
results: 实验表明,使用 vocoded 数据进行SSL模型的连续训练可以提高CM的总性能,并且使用新的SSL模型(即两个SSL模型的distilled)可以进一步提高CM的性能,特别是在面临未经见过的测试集上。Abstract
A speech spoofing countermeasure (CM) that discriminates between unseen spoofed and bona fide data requires diverse training data. While many datasets use spoofed data generated by speech synthesis systems, it was recently found that data vocoded by neural vocoders were also effective as the spoofed training data. Since many neural vocoders are fast in building and generation, this study used multiple neural vocoders and created more than 9,000 hours of vocoded data on the basis of the VoxCeleb2 corpus. This study investigates how this large-scale vocoded data can improve spoofing countermeasures that use data-hungry self-supervised learning (SSL) models. Experiments demonstrated that the overall CM performance on multiple test sets improved when using features extracted by an SSL model continually trained on the vocoded data. Further improvement was observed when using a new SSL distilled from the two SSLs before and after the continual training. The CM with the distilled SSL outperformed the previous best model on challenging unseen test sets, including the ASVspoof 2019 logical access, WaveFake, and In-the-Wild.
摘要
一种演讲 spoofing 防范措施(CM)需要多样化的训练数据。许多数据集使用由speech synthesis系统生成的假数据,但最近发现,由神经 vocoder生成的数据也是有效的假数据。由于神经 vocoder快速生成和生成,这项研究使用多个神经 vocoder,生成了基于 VoxCeleb2 库的 más than 9,000 小时的 vocoded 数据。这项研究研究如何使用这些大规模的 vocoded 数据提高 spoofing 防范措施,使用需要大量自我supervised learning(SSL)模型。实验表明,使用由 SSL 模型不断地训练于 vocoded 数据中提取的特征,可以提高 CM 的总体性能。此外,使用一个新的 SSL 模型,其中两个 SSL 模型在 перед和后 continual training 中分别被训练,可以进一步提高 CM 的性能。该 CM 在不同的难度测试集上,包括 ASVspoof 2019 逻辑访问、WaveFake 和 In-the-Wild,都表现出优于前一代最佳模型。
results: Achieves a 95% reduction in mean absolute error with a minimal increase in model size compared to the baseline model, PhonMatchNet.Here’s the text in Simplified Chinese:
for: 用于解决人机交互中的冲突场景。
methods: 利用隐式音频反射抑制(iAEC)技术提高用户定义关键词检测模型的效率。
results: 与基线模型PhonMatchNet相比,实现了95%的精度差异减少,模型大小增加0.13%。Abstract
In response to the increasing interest in human--machine communication across various domains, this paper introduces a novel approach called iPhonMatchNet, which addresses the challenge of barge-in scenarios, wherein user speech overlaps with device playback audio, thereby creating a self-referencing problem. The proposed model leverages implicit acoustic echo cancellation (iAEC) techniques to increase the efficiency of user-defined keyword spotting models, achieving a remarkable 95% reduction in mean absolute error with a minimal increase in model size (0.13%) compared to the baseline model, PhonMatchNet. We also present an efficient model structure and demonstrate its capability to learn iAEC functionality without requiring a clean signal. The findings of our study indicate that the proposed model achieves competitive performance in real-world deployment conditions of smart devices.
摘要
为了应对不同领域的人机交互需求,这篇论文提出了一种新的方法called iPhonMatchNet,用于解决撞壳场景,在用户语音与设备播放声音之间存在自referencing问题。提出的模型利用隐式音频降噪技术(iAEC)提高用户定义关键词检测模型的效率,实现了95%的平均绝对错误减少,同时模型体积增加0.13%,与基线模型PhonMatchNet相比。我们还提出了一种高效的模型结构,并证明它可以不需要干净信号来学习iAEC功能。我们的研究发现,该模型在智能设备实际应用中的实际应用中达到了竞争水平。
results: 根据实验结果,这篇论文获得了$1.25 \times$的速度增幅,并且仅受到$1.1%$的精度下降,对于ResNet18模型在ImageNet dataset上进行显像识别任务。此外,当与现有的结构化剔除技术结合使用时,所得到的模型具有良好的时间-精度贡献对话。Abstract
The demand for efficient processing of deep neural networks (DNNs) on embedded devices is a significant challenge limiting their deployment. Exploiting sparsity in the network's feature maps is one of the ways to reduce its inference latency. It is known that unstructured sparsity results in lower accuracy degradation with respect to structured sparsity but the former needs extensive inference engine changes to get latency benefits. To tackle this challenge, we propose a solution to induce semi-structured activation sparsity exploitable through minor runtime modifications. To attain high speedup levels at inference time, we design a sparse training procedure with awareness of the final position of the activations while computing the General Matrix Multiplication (GEMM). We extensively evaluate the proposed solution across various models for image classification and object detection tasks. Remarkably, our approach yields a speed improvement of $1.25 \times$ with a minimal accuracy drop of $1.1\%$ for the ResNet18 model on the ImageNet dataset. Furthermore, when combined with a state-of-the-art structured pruning method, the resulting models provide a good latency-accuracy trade-off, outperforming models that solely employ structured pruning techniques.
摘要
需求深度神经网络(DNNs)在嵌入式设备上高效处理是一大挑战,限制其部署。利用网络特征图中的稀疏性可以减少归并时间。然而,不结构化稀疏性会导致减少准确性,而结构化稀疏性则需要大量的执行引擎修改来获得时间优化。为解决这个挑战,我们提出了一种使用部分结构化活化稀疏性的解决方案。通过小改动 runtime,我们设计了一种稀疏训练方法,其中考虑了最终 activation 的位置,以实现高速度启发。我们对多种图像分类和物体检测任务进行了广泛的评估,并观察到了以下结果:我们的方法可以在 ResNet18 模型上 ImageNet dataset 上提供 $1.25 \times$ 的速度提升,同时减少 $1.1\%$ 的准确性下降。此外,当与现有的结构化剪枝方法相结合时,得到的模型可以提供优秀的时间-准确性质量平衡,超越具有结构化剪枝技术的模型。
Multi-dimensional Fusion and Consistency for Semi-supervised Medical Image Segmentation
results: 我们在多个常用的数据集上进行了广泛的实验,结果显示了我们的方法的优律性。Abstract
In this paper, we introduce a novel semi-supervised learning framework tailored for medical image segmentation. Central to our approach is the innovative Multi-scale Text-aware ViT-CNN Fusion scheme. This scheme adeptly combines the strengths of both ViTs and CNNs, capitalizing on the unique advantages of both architectures as well as the complementary information in vision-language modalities. Further enriching our framework, we propose the Multi-Axis Consistency framework for generating robust pseudo labels, thereby enhancing the semi-supervised learning process. Our extensive experiments on several widely-used datasets unequivocally demonstrate the efficacy of our approach.
摘要
在本文中,我们介绍了一种新的半监督学习框架,特化于医疗图像分割。我们的方法的核心是创新的多级文本意识混合方案。这种方案能够有效地结合维特和CNN的优势,同时利用视力语言模式中的补充信息。此外,我们还提出了多轴一致框架,以生成可靠的假标签,从而提高半监督学习过程的稳定性。我们对多个广泛使用的数据集进行了广泛的实验,结果表明我们的方法有效性。
Harmonic-NAS: Hardware-Aware Multimodal Neural Architecture Search on Resource-constrained Devices
results: 实验结果表明,与州态艺术 compare to现有方法,Harmonic-NAS可以 achieved up to 10.9%的准确率提升,1.91x的执行时间减少和2.14x的能效性提升。Abstract
The recent surge of interest surrounding Multimodal Neural Networks (MM-NN) is attributed to their ability to effectively process and integrate information from diverse data sources. In MM-NN, features are extracted and fused from multiple modalities using adequate unimodal backbones and specific fusion networks. Although this helps strengthen the multimodal information representation, designing such networks is labor-intensive. It requires tuning the architectural parameters of the unimodal backbones, choosing the fusing point, and selecting the operations for fusion. Furthermore, multimodality AI is emerging as a cutting-edge option in Internet of Things (IoT) systems where inference latency and energy consumption are critical metrics in addition to accuracy. In this paper, we propose Harmonic-NAS, a framework for the joint optimization of unimodal backbones and multimodal fusion networks with hardware awareness on resource-constrained devices. Harmonic-NAS involves a two-tier optimization approach for the unimodal backbone architectures and fusion strategy and operators. By incorporating the hardware dimension into the optimization, evaluation results on various devices and multimodal datasets have demonstrated the superiority of Harmonic-NAS over state-of-the-art approaches achieving up to 10.9% accuracy improvement, 1.91x latency reduction, and 2.14x energy efficiency gain.
摘要
最近,多模态神经网络(MM-NN)的兴趣增长可以追溯到它们能够有效地处理和融合来自多个数据源的信息。在 MM-NN 中,来自多个模态的特征被提取并融合使用适当的单模态背bone和特定的融合网络。虽然这有助于强化多模态信息表示,但设计这些网络是劳动密集的。需要调整网络的建筑 Parameters,选择融合点,并选择融合操作。此外,多模态 AI 在互联网东西(IoT)系统中出现为一个潮流选项,因为在这些系统中,推理延迟和能耗是关键指标,而不仅仅是准确率。在这篇论文中,我们提出了 Harmonic-NAS,一个框架,用于同时优化单模态背bone和多模态融合网络,并考虑硬件维度。Harmonic-NAS 使用两层优化方法,包括单模态背bone 架构和融合策略和操作的优化。通过将硬件维度integrated到优化中,对多个设备和多模态数据集进行评估,研究结果表明,Harmonic-NAS 比现有方法提高了10.9%的准确率,1.91x的推理延迟,和2.14x的能耗效率。
Zero-Shot Visual Classification with Guided Cropping
results: 我们的实验结果显示,GC-CLIP可以提高零 shot 分类结果,尤其是当物品覆盖小区域时。Abstract
Pretrained vision-language models, such as CLIP, show promising zero-shot performance across a wide variety of datasets. For closed-set classification tasks, however, there is an inherent limitation: CLIP image encoders are typically designed to extract generic image-level features that summarize superfluous or confounding information for the target tasks. This results in degradation of classification performance, especially when objects of interest cover small areas of input images. In this work, we propose CLIP with Guided Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest and minimize influence of extraneous image regions. We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.
摘要
预训义视语模型,如CLIP,在各种数据集上显示出扩展性的 zeroshot性能。然而,对于封闭类型的分类任务,存在内在的限制:CLIP的图像编码器通常是为抽取普适图像水平特征,这些特征可能会损害目标任务的分类性能。这会导致分类性能下降,特别是当目标对象占输入图像中小区域时。在这种情况下,我们提出了GC-CLIP方法,其中我们使用一个预置的零shot对象检测模型来增强零shot分类器对目标对象的专注度,并最小化不相关图像区域的影响。我们经验表明,我们的方法可以提高零shot分类结果,并且对于小对象而言,效果更加明显。
AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving
paper_authors: Ahmed Rida Sekkat, Rohit Mohan, Oliver Sawade, Elmar Matthes, Abhinav Valada
for: AmodalSynthDrive is a synthetic multi-task multi-modal amodal perception dataset for autonomous driving research.
methods: The dataset provides multi-view camera images, 3D bounding boxes, LiDAR data, and odometry for 150 driving sequences with over 1M object annotations in diverse traffic, weather, and lighting conditions.
results: The dataset supports multiple amodal scene understanding tasks, including amodal depth estimation for enhanced spatial understanding. Several baselines are evaluated for each task to illustrate the challenges and set up public benchmarking servers.Abstract
Unlike humans, who can effortlessly estimate the entirety of objects even when partially occluded, modern computer vision algorithms still find this aspect extremely challenging. Leveraging this amodal perception for autonomous driving remains largely untapped due to the lack of suitable datasets. The curation of these datasets is primarily hindered by significant annotation costs and mitigating annotator subjectivity in accurately labeling occluded regions. To address these limitations, we introduce AmodalSynthDrive, a synthetic multi-task multi-modal amodal perception dataset. The dataset provides multi-view camera images, 3D bounding boxes, LiDAR data, and odometry for 150 driving sequences with over 1M object annotations in diverse traffic, weather, and lighting conditions. AmodalSynthDrive supports multiple amodal scene understanding tasks including the introduced amodal depth estimation for enhanced spatial understanding. We evaluate several baselines for each of these tasks to illustrate the challenges and set up public benchmarking servers. The dataset is available at http://amodalsynthdrive.cs.uni-freiburg.de.
摘要
不同于人类,现代计算机视觉算法仍然很难处理部分遮挡的物体。利用这种无模态感知对自动驾驶仍然具有很大的潜在优势,但由于缺乏适当的数据集,这一点仍未得到充分利用。为解决这些限制,我们介绍了AmodalSynthDrive,一个人工合成的多任务多模态无模态感知数据集。该数据集提供了多视图摄像头图像、3D包围框、LiDAR数据和卫星坐标数据,涵盖了150条驾驶序列,共计超过100万个物体标注,在多样化的交通、天气和照明条件下进行了多种交通情况。AmodalSynthDrive支持多种无模态场景理解任务,包括引入的无模态深度估计,以提高空间理解。我们评估了多个基线,以 Illustrate the challenges and set up public benchmarking servers。数据集可以在http://amodalsynthdrive.cs.uni-freiburg.de上下载。
Strong-Weak Integrated Semi-supervision for Unsupervised Single and Multi Target Domain Adaptation
results: 实验结果显示,SWISS学习策略可以在三个benchmark(Office-31、Office-Home和DomainNet)上实现高准确率。Abstract
Unsupervised domain adaptation (UDA) focuses on transferring knowledge learned in the labeled source domain to the unlabeled target domain. Despite significant progress that has been achieved in single-target domain adaptation for image classification in recent years, the extension from single-target to multi-target domain adaptation is still a largely unexplored problem area. In general, unsupervised domain adaptation faces a major challenge when attempting to learn reliable information from a single unlabeled target domain. Increasing the number of unlabeled target domains further exacerbate the problem rather significantly. In this paper, we propose a novel strong-weak integrated semi-supervision (SWISS) learning strategy for image classification using unsupervised domain adaptation that works well for both single-target and multi-target scenarios. Under the proposed SWISS-UDA framework, a strong representative set with high confidence but low diversity target domain samples and a weak representative set with low confidence but high diversity target domain samples are updated constantly during the training process. Both sets are fused to generate an augmented strong-weak training batch with pseudo-labels to train the network during every iteration. The extension from single-target to multi-target domain adaptation is accomplished by exploring the class-wise distance relationship between domains and replacing the strong representative set with much stronger samples from peer domains via peer scaffolding. Moreover, a novel adversarial logit loss is proposed to reduce the intra-class divergence between source and target domains, which is back-propagated adversarially with a gradient reverse layer between the classifier and the rest of the network. Experimental results based on three benchmarks, Office-31, Office-Home, and DomainNet, show the effectiveness of the proposed SWISS framework.
摘要
Unsupervised domain adaptation (UDA) 专注于将源领域中标注的知识传递到目标领域中无标注的数据上。虽然在最近几年内对单目标领域 adaptation 进行了显著的进步,但将单目标领域 adaptation 扩展到多目标领域仍然是一个未探索的问题领域。总的来说,无监督领域 adaptation 面临着在单无标注目标领域中学习可靠信息的主要挑战。增加目标领域的数量更加扩大了问题。在本文中,我们提出了一种基于强弱结合 semi-supervision(SWISS)学习策略,用于无监督领域 adaptation 的图像分类。该策略在单目标和多目标场景中都能够工作良好。在我们提出的SWISS-UDA框架中,一个强大的代表集(strong representative set)和一个弱小的代表集(weak representative set)在训练过程中不断更新。两个集合被融合以生成一个增强的强弱训练集(augmented strong-weak training batch),用于在每次迭代中训练网络。在扩展到多目标领域时,我们利用类别之间的距离关系来替换强大的代表集,并通过对应的骨干来提高强大的样本。此外,我们还提出了一种新的对抗LOGIT损失函数,用于减少源领域和目标领域之间的内类差。这种损失函数通过对抗反向传播来降低内类差。实验结果基于Office-31、Office-Home和DomainNet三个标准测试集,证明了我们提出的SWISS框架的效果。
Using Unsupervised and Supervised Learning and Digital Twin for Deep Convective Ice Storm Classification
results: 论文的实验结果显示,使用这些方法可以在两个不同的 dataset 上达到识别风暴云和不同云类型的高精度,包括在热带区域上达到80%以上的准确率,在非热带区域上达到90%以上的准确率和40%以上的准确率。此外,这些方法还能够抗耗器雷达噪声。Abstract
Smart Ice Cloud Sensing (SMICES) is a small-sat concept in which a primary radar intelligently targets ice storms based on information collected by a lookahead radiometer. Critical to the intelligent targeting is accurate identification of storm/cloud types from eight bands of radiance collected by the radiometer. The cloud types of interest are: clear sky, thin cirrus, cirrus, rainy anvil, and convection core. We describe multi-step use of Machine Learning and Digital Twin of the Earth's atmosphere to derive such a classifier. First, a digital twin of Earth's atmosphere called a Weather Research Forecast (WRF) is used generate simulated lookahead radiometer data as well as deeper "science" hidden variables. The datasets simulate a tropical region over the Caribbean and a non-tropical region over the Atlantic coast of the United States. A K-means clustering over the scientific hidden variables was utilized by human experts to generate an automatic labelling of the data - mapping each physical data point to cloud types by scientists informed by mean/centroids of hidden variables of the clusters. Next, classifiers were trained with the inputs of the simulated radiometer data and its corresponding label. The classifiers of a random decision forest (RDF), support vector machine (SVM), Gaussian na\"ive bayes, feed forward artificial neural network (ANN), and a convolutional neural network (CNN) were trained. Over the tropical dataset, the best performing classifier was able to identify non-storm and storm clouds with over 80% accuracy in each class for a held-out test set. Over the non-tropical dataset, the best performing classifier was able to classify non-storm clouds with over 90% accuracy and storm clouds with over 40% accuracy. Additionally both sets of classifiers were shown to be resilient to instrument noise.
摘要
智能冰云感测(SMICES)是一种小卫星概念,其主要采用智能针对方式对冰风暴进行感测,基于预测器收集的八个频率强度数据进行准确识别风暴/云类型。我们使用多个步骤的机器学习和地球大气数字模型来 derivation 这种分类器。首先,我们使用地球大气数字模型(WRF)生成了模拟的预测器数据,以及更深入的科学隐藏变量。这些数据模拟了一个热带地区在加勒比海和一个非热带地区在美国东岸。然后,我们使用K-means归一化将数据分为不同的云类型,并由人工智能经过标注来自动将物理数据点映射到云类型。接着,我们将模拟预测器数据作为输入,并使用Random Decision Forest(RDF)、支持向量机(SVM)、泊松隐藏Naive Bayes(Gaussian Naive Bayes)、Feed Forward Artificial Neural Network(ANN)和卷积神经网络(CNN)等五种分类器进行训练。在热带数据集上,最佳表现的分类器可以识别非风暴云和风暴云的准确率高于80%。在非热带数据集上,最佳表现的分类器可以识别非风暴云的准确率高于90%,并且识别风暴云的准确率高于40%。此外,这些分类器都能够抗耗rument噪声。
Ethnicity and Biometric Uniqueness: Iris Pattern Individuality in a West African Database
results: 研究发现,尽管非洲人种群的眼睛特征具有较为粗糙的文本特征大小,但是可以通过调整操作决策门槛来减少眼睛识别错误的风险。这表明,尽管人种群的差异,但是在这个西非人种群中,个体可以通过比较眼睛图像来减少识别错误。Abstract
We conducted more than 1.3 million comparisons of iris patterns encoded from images collected at two Nigerian universities, which constitute the newly available African Human Iris (AFHIRIS) database. The purpose was to discover whether ethnic differences in iris structure and appearance such as the textural feature size, as contrasted with an all-Chinese image database or an American database in which only 1.53% were of African-American heritage, made a material difference for iris discrimination. We measured a reduction in entropy for the AFHIRIS database due to the coarser iris features created by the thick anterior layer of melanocytes, and we found stochastic parameters that accurately model the relevant empirical distributions. Quantile-Quantile analysis revealed that a very small change in operational decision thresholds for the African database would compensate for the reduced entropy and generate the same performance in terms of resistance to False Matches. We conclude that despite demographic difference, individuality can be robustly discerned by comparison of iris patterns in this West African population.
摘要
Note: The original text is written in English and the translation is in Simplified Chinese. Please note that the translation is based on the standard language and may not reflect the exact nuances of the original text.
DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention
results: 与现有多模式深伪检测技术相比,实现了更高的 F-1 和每个视频 AUC 得分Abstract
With the rise in manipulated media, deepfake detection has become an imperative task for preserving the authenticity of digital content. In this paper, we present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network. Subsequently, a transformer encoder network is employed to perform facial self-attention. We conduct multiple ablation studies highlighting different strengths of our approach. Our multi-modal methodology outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and per-video AUC scores.
摘要
随着伪造媒体的普及,深刻检测深伪成为保持数字内容真实性的必要任务。在这篇论文中,我们提出了一种新的多Modal音频视频框架,用于同时处理音频和视频输入,以检测深伪。我们的模型利用输入音频的唇同步通过交叉注意机制,同时通过精心调整的 VGG-16 网络提取视觉cue。然后,我们使用 transformer 编码器网络进行面部自注意。我们进行了多个减少研究,描述了不同方法的优势。我们的多Modal方法在 F-1 和每个视频 AUC 分数上超过了当前最佳多Modal深伪检测技术。
Attention De-sparsification Matters: Inducing Diversity in Digital Pathology Representation Learning
results: 我们的分析发现,vanilla SSL 预训练模型的注意力分布存在一个有趣的观察:模型倾向于在图像中局部化注意力,即模型强调一些图像中的一些明确的模式。然而,这种注意力稀采可能不适合数字病理学图像,因为这些图像不同于自然图像,不是中心对象所在的物体。因此,我们提出了一种利用细胞 segmentation 提取多种 Histopathology 特定表征,然后提出一种带有先导的积极预text任务,以使模型学习多种视图之间的匹配。这种方法使得模型能够更均匀地注意多种组件,从而引入多样化注意力,以捕捉更加 ricgh的上下文表征。我们通过多种任务的量化和质量分析,证明了我们的方法的效果,并发现了模型的注意力更加广泛分布。Abstract
We propose DiRL, a Diversity-inducing Representation Learning technique for histopathology imaging. Self-supervised learning techniques, such as contrastive and non-contrastive approaches, have been shown to learn rich and effective representations of digitized tissue samples with limited pathologist supervision. Our analysis of vanilla SSL-pretrained models' attention distribution reveals an insightful observation: sparsity in attention, i.e, models tends to localize most of their attention to some prominent patterns in the image. Although attention sparsity can be beneficial in natural images due to these prominent patterns being the object of interest itself, this can be sub-optimal in digital pathology; this is because, unlike natural images, digital pathology scans are not object-centric, but rather a complex phenotype of various spatially intermixed biological components. Inadequate diversification of attention in these complex images could result in crucial information loss. To address this, we leverage cell segmentation to densely extract multiple histopathology-specific representations, and then propose a prior-guided dense pretext task for SSL, designed to match the multiple corresponding representations between the views. Through this, the model learns to attend to various components more closely and evenly, thus inducing adequate diversification in attention for capturing context rich representations. Through quantitative and qualitative analysis on multiple tasks across cancer types, we demonstrate the efficacy of our method and observe that the attention is more globally distributed.
摘要
我们提出了DiRL,一种基于多样性的表示学习技术,用于 histopathology 图像。无监督学习技术,如对照和非对照方法,可以学习照片样本中的丰富和有效表示,只需限制的病理师指导。我们对vanilla SSL 预训练模型的注意力分布进行分析,发现一个有趣的观察结果:模型倾向于在图像中LOCALIZATION 的一些显著特征上集中大量的注意力。although attention sparsity can be beneficial in natural images due to these prominent patterns being the object of interest itself, this can be sub-optimal in digital pathology; this is because, unlike natural images, digital pathology scans are not object-centric, but rather a complex phenotype of various spatially intermixed biological components. Inadequate diversification of attention in these complex images could result in crucial information loss. To address this, we leverage cell segmentation to densely extract multiple histopathology-specific representations, and then propose a prior-guided dense pretext task for SSL, designed to match the multiple corresponding representations between the views. Through this, the model learns to attend to various components more closely and evenly, thus inducing adequate diversification in attention for capturing context rich representations. Through quantitative and qualitative analysis on multiple tasks across cancer types, we demonstrate the efficacy of our method and observe that the attention is more globally distributed.
Exploring Non-additive Randomness on ViT against Query-Based Black-Box Attacks
results: 研究表明,通过非加性随机性来防范QBBA攻击可以获得有效的防御效果,而且不会带来过多的性能损失。Abstract
Deep Neural Networks can be easily fooled by small and imperceptible perturbations. The query-based black-box attack (QBBA) is able to create the perturbations using model output probabilities of image queries requiring no access to the underlying models. QBBA poses realistic threats to real-world applications. Recently, various types of robustness have been explored to defend against QBBA. In this work, we first taxonomize the stochastic defense strategies against QBBA. Following our taxonomy, we propose to explore non-additive randomness in models to defend against QBBA. Specifically, we focus on underexplored Vision Transformers based on their flexible architectures. Extensive experiments show that the proposed defense approach achieves effective defense, without much sacrifice in performance.
摘要
深度神经网络容易受到小型、难以察觉的变化的影响。Query-based黑盒攻击(QBBA)可以通过使用图像查询输出概率来生成变化,无需访问基础模型。QBBA对实际应用场景 pose 了真实的威胁。在这种情况下,我们首先将随机防御策略对QBBA进行分类。根据我们的分类,我们提议利用模型中的非加性随机性来防御QBBA。specifically,我们关注未得到足够关注的视觉转换器,因为它们具有灵活的体系。我们的实验表明,我们提议的防御策略可以有效地防止攻击,而不会影响性能的减少。
paper_authors: Syed Waleed Hyder, Muhammad Usama, Anas Zafar, Muhammad Naufil, Fawad Javed Fateh, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran
for: 人体活动识别 task 的细致分割,可以提高人体活动识别的精度。
methods: 使用2Dskeleton热图序列作为输入,并使用TCN进行时空特征提取。
results: 相比前方法,本方法具有更好的稳定性和抗失败性,并且可以在action segmentation datasets上达到相似或更好的性能。此外,通过将2Dskeleton热图序列和RGB视频作为输入进行联合学习,可以进一步提高性能。Abstract
This paper presents a 2D skeleton-based action segmentation method with applications in fine-grained human activity recognition. In contrast with state-of-the-art methods which directly take sequences of 3D skeleton coordinates as inputs and apply Graph Convolutional Networks (GCNs) for spatiotemporal feature learning, our main idea is to use sequences of 2D skeleton heatmaps as inputs and employ Temporal Convolutional Networks (TCNs) to extract spatiotemporal features. Despite lacking 3D information, our approach yields comparable/superior performances and better robustness against missing keypoints than previous methods on action segmentation datasets. Moreover, we improve the performances further by using both 2D skeleton heatmaps and RGB videos as inputs. To our best knowledge, this is the first work to utilize 2D skeleton heatmap inputs and the first work to explore 2D skeleton+RGB fusion for action segmentation.
摘要
AGMDT: Virtual Staining of Renal Histology Images with Adjacency-Guided Multi-Domain Transfer
results: 实验结果表明,提出的 AGMDT 方法可以很好地平衡精确的像素级匹配和无需对数据进行对应的预处理,同时利用多域序列切片图像之间的相关性,以提高诊断的准确性和效率。Abstract
Renal pathology, as the gold standard of kidney disease diagnosis, requires doctors to analyze a series of tissue slices stained by H&E staining and special staining like Masson, PASM, and PAS, respectively. These special staining methods are costly, time-consuming, and hard to standardize for wide use especially in primary hospitals. Advances of supervised learning methods have enabled the virtually conversion of H&E images into special staining images, but achieving pixel-to-pixel alignment for training remains challenging. In contrast, unsupervised learning methods regarding different stains as different style transfer domains can utilize unpaired data, but they ignore the spatial inter-domain correlations and thus decrease the trustworthiness of structural details for diagnosis. In this paper, we propose a novel virtual staining framework AGMDT to translate images into other domains by avoiding pixel-level alignment and meanwhile utilizing the correlations among adjacent tissue slices. We first build a high-quality multi-domain renal histological dataset where each specimen case comprises a series of slices stained in various ways. Based on it, the proposed framework AGMDT discovers patch-level aligned pairs across the serial slices of multi-domains through glomerulus detection and bipartite graph matching, and utilizes such correlations to supervise the end-to-end model for multi-domain staining transformation. Experimental results show that the proposed AGMDT achieves a good balance between the precise pixel-level alignment and unpaired domain transfer by exploiting correlations across multi-domain serial pathological slices, and outperforms the state-of-the-art methods in both quantitative measure and morphological details.
摘要
肾脏病学(为肾脏病诊断的标准)需要医生分析一系列染料处理后的组织切片,如H&E染色和特殊染色如Masson、PASM和PAS等。这些特殊染色方法是费时费力,难以普及使用,特别是在初级医院。随着超级vised学习方法的发展,可以将H&E图像转化为特殊染色图像,但在训练时Pixel-to-Pixel对齐仍然是挑战。相反,不监督学习方法将不同染色视为不同的风格传输领域,可以使用无对数据,但忽略了组织间空间相关性,因此降低了诊断结果的可靠性。在这篇论文中,我们提出了一种新的虚拟染色框架AGMDT,可以将图像转化为其他领域,而不需Pixel-to-Pixel对齐。我们首先构建了高质量多频道肾脏 histological 数据集,每个样本 случа包括一系列不同染色的组织切片。基于这个数据集,我们的框架AGMDT可以在序列切片中找到适合的 patch-level 对应对,并使用这些对应关系来监督终端模型进行多频道染色变换。实验结果表明,我们的 AGMDT 可以很好地平衡精确的Pixel-to-Pixel对齐和无对频道传输,并且在量度测试和结构细节方面都超过了当前的状况。
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
results: 这篇论文提出了一种基于Diffusion模型的一步文本到图像生成器,使用了新的文本条件管道,并使用Rectified Flow方法来提高采样速度和计算成本。实验结果表明,这种一步模型可以在MS COCO 2017-5k上达到SD级别的图像质量,FID值为23.3,比前一个状态的技术(进行分布式采样)有了较大的提升(FID值为37.2)。此外,通过使用扩展网络和1.7B参数, authors还提高了FID值到22.4。Abstract
Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~\url{https://github.com/gnobitab/InstaFlow}.
摘要
Diffusion模型已经革命化了文本到图像生成,其品质和创新性都非常出色。然而,它的多步骤采样过程相对较慢,经常需要数十个推理步骤以获得满意的结果。以前的尝试通过热化来提高采样速度和计算成本,但是没有实现了一个功能的一步模型。在这篇论文中,我们研究了一种名为Rectified Flow的方法,它只在小数据集上应用。Rectified Flow的核心在于它的“重定向”过程,它将概率流的轨迹正则化,提高图像和噪声之间的协同关系,并且使学生模型的整体准确性得到提高。我们提出了一种基于Stable Diffusion(SD)的文本受控管道,将SD转化为一个超快一步模型,并发现重定向在填充关系中发挥了关键作用。通过我们的新管道,我们创造了,到我们所知道的最高水平,第一个基于Diffusion的文本到图像生成器,其FID(Frechet Inception Distance)为23.3,超过了之前的最佳技术进步distillation,并且在MS COCO 2017-5k上达到了37.2$\Rightarrow$23.3的FID提升。通过使用扩展的网络和1.7B参数,我们进一步提高了FID到22.4。我们称我们的一步模型为InstaFlow。在MS COCO 2014-30k上,InstaFlow在0.09秒内达到了FID13.1,在$\leq 0.1$秒的 régime中表现最佳,比StyleGAN-T(13.9在0.1秒)更高。另外,我们在训练InstaFlow时只需要199个A100 GPU天。更多信息请访问我们的项目页面:.
Padding-free Convolution based on Preservation of Differential Characteristics of Kernels
results: 在图像分类、Semantic segmentation和超分辨重建任务中显著超过比较方法的性能Abstract
Convolution is a fundamental operation in image processing and machine learning. Aimed primarily at maintaining image size, padding is a key ingredient of convolution, which, however, can introduce undesirable boundary effects. We present a non-padding-based method for size-keeping convolution based on the preservation of differential characteristics of kernels. The main idea is to make convolution over an incomplete sliding window "collapse" to a linear differential operator evaluated locally at its central pixel, which no longer requires information from the neighbouring missing pixels. While the underlying theory is rigorous, our final formula turns out to be simple: the convolution over an incomplete window is achieved by convolving its nearest complete window with a transformed kernel. This formula is computationally lightweight, involving neither interpolation or extrapolation nor restrictions on image and kernel sizes. Our method favours data with smooth boundaries, such as high-resolution images and fields from physics. Our experiments include: i) filtering analytical and non-analytical fields from computational physics and, ii) training convolutional neural networks (CNNs) for the tasks of image classification, semantic segmentation and super-resolution reconstruction. In all these experiments, our method has exhibited visible superiority over the compared ones.
摘要
convolution 是图像处理和机器学习中的基本操作。 padding 是 convolution 的关键组分,但是它可能会导致边界效应。我们提出了不使用 padding 的方法,以保持图像大小。我们的主要想法是使 convolution 操作在不完整的滑块上进行,将滑块中心像素的线性算法视为局部的差分特征。这样就不需要从邻近缺失像素中获取信息了。我们的理论基础是严谨的,但是我们的最终公式却很简单:在不完整的滑块上进行 convolution 操作,只需要将完整的滑块与变换后的核函数进行卷积即可。这种方法不需要 interpolate 或 extrapolate 像素值,也不需要图像和核函数的大小受限。我们的方法更适合图像边缘平滑的数据,如高分辨率图像和物理学中的场景。我们的实验包括:i) 对计算物理学中的分析和非分析场景进行滤波处理,ii) 使用卷积神经网络(CNNs)进行图像分类、semantic segmentation 和超分辨重建任务。在这些实验中,我们的方法都有可见的优越性。
Efficient Graphics Representation with Differentiable Indirection
results: 该论文在各种图形任务中,如几何和图像表示、纹理映射、照明和辐射场表示等,都能够轻松地 интеGRATE到现有的架构中,快速训练并提供高效和灵活的结果。Abstract
We introduce differentiable indirection -- a novel learned primitive that employs differentiable multi-scale lookup tables as an effective substitute for traditional compute and data operations across the graphics pipeline. We demonstrate its flexibility on a number of graphics tasks, i.e., geometric and image representation, texture mapping, shading, and radiance field representation. In all cases, differentiable indirection seamlessly integrates into existing architectures, trains rapidly, and yields both versatile and efficient results.
摘要
我们介绍了微调irection -- 一种新的学习 primitives,它使用可微分多尺度lookup表作为图形管道中的有效替代方法。我们在几个图形任务中展示了它的灵活性,包括几何和图像表示、纹理映射、照明和辐射场表示。在所有情况下,微调irection顺利地集成到现有的架构中,快速训练,并产生了高效和多元的结果。
Exploring Flat Minima for Domain Generalization with Large Learning Rates
results: 在分类和semantic segmentation领域的频率不同的泛化 benchmark 上达到了状态机器的性能Here’s a more detailed explanation of each point:
for: The paper aims to improve the generalization ability of deep learning models in different domains, which is a key problem in domain generalization (DG).
methods: The authors propose a new training strategy called Lookahead, which leverages a large learning rate to promote weight diversity and facilitate the identification of flat regions in the loss landscape. To prevent overfitting during training, they also propose two variants of weight regularization.
results: The proposed method achieves state-of-the-art performance on both classification and semantic segmentation domain generalization benchmarks.I hope this helps! Let me know if you have any further questions.Abstract
Domain Generalization (DG) aims to generalize to arbitrary unseen domains. A promising approach to improve model generalization in DG is the identification of flat minima. One typical method for this task is SWAD, which involves averaging weights along the training trajectory. However, the success of weight averaging depends on the diversity of weights, which is limited when training with a small learning rate. Instead, we observe that leveraging a large learning rate can simultaneously promote weight diversity and facilitate the identification of flat regions in the loss landscape. However, employing a large learning rate suffers from the convergence problem, which cannot be resolved by simply averaging the training weights. To address this issue, we introduce a training strategy called Lookahead which involves the weight interpolation, instead of average, between fast and slow weights. The fast weight explores the weight space with a large learning rate, which is not converged while the slow weight interpolates with it to ensure the convergence. Besides, weight interpolation also helps identify flat minima by implicitly optimizing the local entropy loss that measures flatness. To further prevent overfitting during training, we propose two variants to regularize the training weight with weighted averaged weight or with accumulated history weight. Taking advantage of this new perspective, our methods achieve state-of-the-art performance on both classification and semantic segmentation domain generalization benchmarks. The code is available at https://github.com/koncle/DG-with-Large-LR.
摘要
SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image
results: 该方法在KITTI dataset上对大规模的户外场景进行了单张图像新视图合成,并在未seen Tanks and Temples dataset上进行了推广。结果表明该方法具有显著的性能提升。Abstract
Recent novel view synthesis methods obtain promising results for relatively small scenes, e.g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input. In this paper, we introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image based on improved multiplane images (MPI). Observing that depth distribution varies significantly for unbounded outdoor scenes, we employ an adaptive-bins strategy for MPI to arrange planes in accordance with each scene image. To represent intricate geometry and multi-scale details, we further introduce a hierarchical refinement branch, which results in high-quality synthesized novel views. Our method demonstrates considerable performance gains in synthesizing large-scale unbounded outdoor scenes using a single image on the KITTI dataset and generalizes well to the unseen Tanks and Temples dataset.The code and models will soon be made available.
摘要
新的小说视图合成方法在相对较小的场景中,如室内环境和一些物品的场景中,取得了可喜的结果,但在卷积图片作为输入的场景中很容易失败。在这篇论文中,我们介绍了采样,一种基于改进的多平面图像(MPI)的Scene-adaptive Hierarchical Multiplane Images Representation дляNovel View Synthesis from a Single Image。我们发现了不同的景象中的深度分布变化很大,因此我们采用了适应性的桶策略来安排MPI中的平面。为了表示复杂的几何结构和多尺度细节,我们还引入了层次修复分支,从而实现高质量的合成新视图。我们的方法在KITTI数据集上Synthesizing large-scale unbounded outdoor scenes using a single image demonstrates considerable performance gains and generalizes well to the unseen Tanks and Temples dataset.我们很快将代码和模型公开。
Semantic and Articulated Pedestrian Sensing Onboard a Moving Vehicle
results: 研究发现,利用LiDAR数据进行人体探测和预测可以提高交通安全性,尤其是在车辆前进速度较快的情况下。Abstract
It is difficult to perform 3D reconstruction from on-vehicle gathered video due to the large forward motion of the vehicle. Even object detection and human sensing models perform significantly worse on onboard videos when compared to standard benchmarks because objects often appear far away from the camera compared to the standard object detection benchmarks, image quality is often decreased by motion blur and occlusions occur often. This has led to the popularisation of traffic data-specific benchmarks. Recently Light Detection And Ranging (LiDAR) sensors have become popular to directly estimate depths without the need to perform 3D reconstructions. However, LiDAR-based methods still lack in articulated human detection at a distance when compared to image-based methods. We hypothesize that benchmarks targeted at articulated human sensing from LiDAR data could bring about increased research in human sensing and prediction in traffic and could lead to improved traffic safety for pedestrians.
摘要
“对于车辆上收集的影像进行3D重建很难,因为车辆前进速度很快,使得物体识别和人体感应模型在车辆上影像上表现比标准参考模型更差,物体常常在摄像头相比较远,影像质量受到运动模糊和遮挡的影响。这导致了交通数据特有的参考标准的出现。然而,激光探测(LiDAR)仪仍然缺乏在距离中探测人体的细节,与影像基本方法相比。我们假设,对于LiDAR数据的人体探测参考标准可能会带来更多的人体感应和预测研究,从而提高了交通安全性。”
Towards High-Quality Specular Highlight Removal by Leveraging Large-Scale Synthetic Data
results: 对于实际图像,网络能够广泛适用并且能够生成高质量的抛光自由图像。Abstract
This paper aims to remove specular highlights from a single object-level image. Although previous methods have made some progresses, their performance remains somewhat limited, particularly for real images with complex specular highlights. To this end, we propose a three-stage network to address them. Specifically, given an input image, we first decompose it into the albedo, shading, and specular residue components to estimate a coarse specular-free image. Then, we further refine the coarse result to alleviate its visual artifacts such as color distortion. Finally, we adjust the tone of the refined result to match that of the input as closely as possible. In addition, to facilitate network training and quantitative evaluation, we present a large-scale synthetic dataset of object-level images, covering diverse objects and illumination conditions. Extensive experiments illustrate that our network is able to generalize well to unseen real object-level images, and even produce good results for scene-level images with multiple background objects and complex lighting.
摘要
这个论文目标是从单个物体图像中去除反射高光。先前的方法有一定的进步,但其性能仍然有一定的限制,特别是对于真实的图像来说,它们的反射高光非常复杂。为此,我们提议一个三个阶段的网络来解决这个问题。具体来说,给定一个输入图像,我们首先将其分解为反射、阴影和反射剩余组成来估算一个粗略的反射减少后的图像。然后,我们进一步修正这个粗略结果,以消除它的视觉artefacts,如颜色扭曲。最后,我们将修正后的结果与输入图像的颜色匹配到一样多少可能。此外,为了训练网络和量化评估,我们提供了一个大规模的Synthetic数据集,覆盖了多种物体和照明条件。广泛的实验表明,我们的网络能够通过训练来泛化到未经看过的真实物体图像,甚至对多个背景物体和复杂的照明条件进行好的处理。
Self-Training and Multi-Task Learning for Limited Data: Evaluation Study on Object Detection
results: 实验结果显示,在没有教师训练数据的情况下,使用弱教师和未见数据进行自动学习可以提高性能。此外,在部分注释的数据情况下,多任务学习也可以提高性能。Abstract
Self-training allows a network to learn from the predictions of a more complicated model, thus often requires well-trained teacher models and mixture of teacher-student data while multi-task learning jointly optimizes different targets to learn salient interrelationship and requires multi-task annotations for each training example. These frameworks, despite being particularly data demanding have potentials for data exploitation if such assumptions can be relaxed. In this paper, we compare self-training object detection under the deficiency of teacher training data where students are trained on unseen examples by the teacher, and multi-task learning with partially annotated data, i.e. single-task annotation per training example. Both scenarios have their own limitation but potentially helpful with limited annotated data. Experimental results show the improvement of performance when using a weak teacher with unseen data for training a multi-task student. Despite the limited setup we believe the experimental results show the potential of multi-task knowledge distillation and self-training, which could be beneficial for future study. Source code is at https://lhoangan.github.io/multas.
摘要
自我训练允许网络从更复杂的模型的预测中学习,因此通常需要Well-trained教师模型和学生数据的混合,而多任务学习则是同时优化不同目标以学习突出关系的方法。这些框架,尽管特别是数据具有潜在的可优化假设,但它们具有潜在的数据利用潜力。在这篇论文中,我们比较了无教师训练数据的情况下,学生通过不види的教师训练来进行自我训练,以及具有部分注释数据的多任务学习。两种情况都具有自己的局限性,但它们可能帮助处理有限的注释数据。实验结果表明,使用弱教师和未见数据进行学生训练可以提高性能。尽管我们的设置有限,但我们认为实验结果表明了多任务知识储存和自我训练的潜力,这可能对未来的研究有所帮助。源代码可以在 中找到。
Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model
results: 实验表明,我们的方法在 HumanML3D 和 KIT 测试集上比文本驱动动作生成方法表现出色,生成更好的视觉确认动作序列。Abstract
Text-driven human motion generation in computer vision is both significant and challenging. However, current methods are limited to producing either deterministic or imprecise motion sequences, failing to effectively control the temporal and spatial relationships required to conform to a given text description. In this work, we propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference. Experiments show that our approach outperforms text-driven motion generation methods on HumanML3D and KIT test sets and generates better visually confirmed motion to the text conditions.
摘要
文本驱动人体动作生成在计算机视觉中具有重要性和挑战。然而,当前的方法仅能生成决定性或不准确的动作序列,无法有效控制文本描述中的时间和空间关系。在这项工作中,我们提出了细化的方法,用于生成高质量、条件人体动作序列,支持精确的文本描述。我们的方法包括两个关键组件:1)语言结构协助模块,通过准确地构建语言特征来完全利用文本信息;2)上下文感知进程理解模块,通过浅混合图神经网络和深度图神经网络学习上下文和整体语义特征,实现多步推理。实验表明,我们的方法在人体ML3D和KIT测试集上超越文本驱动动作生成方法,并生成更好的视觉确认的动作序列。
IBAFormer: Intra-batch Attention Transformer for Domain Generalized Semantic Segmentation
paper_authors: Qiyu Sun, Huilin Chen, Meng Zheng, Ziyan Wu, Michael Felsberg, Yang Tang for:* 这个论文旨在解决领域总体 semantic segmentation (DGSS) 问题,即使用无法访问目标数据的情况下,使模型具备更好的泛化能力。methods:* 该论文提出了两种替代性内批注机制:mean-based intra-batch attention (MIBA) 和 element-wise intra-batch attention (EIBA),以捕捉不同批次之间的相关性,提高特征表示和泛化能力。results:* 该论文提出的 IBAFormer 模型在 DGSS 问题上实现了最佳性能,并通过精心设计的实验证明了每个引入的组件的效果。Abstract
Domain generalized semantic segmentation (DGSS) is a critical yet challenging task, where the model is trained only on source data without access to any target data. Despite the proposal of numerous DGSS strategies, the generalization capability remains limited in CNN architectures. Though some Transformer-based segmentation models show promising performance, they primarily focus on capturing intra-sample attentive relationships, disregarding inter-sample correlations which can potentially benefit DGSS. To this end, we enhance the attention modules in Transformer networks for improving DGSS by incorporating information from other independent samples in the same batch, enriching contextual information, and diversifying the training data for each attention block. Specifically, we propose two alternative intra-batch attention mechanisms, namely mean-based intra-batch attention (MIBA) and element-wise intra-batch attention (EIBA), to capture correlations between different samples, enhancing feature representation and generalization capabilities. Building upon intra-batch attention, we introduce IBAFormer, which integrates self-attention modules with the proposed intra-batch attention for DGSS. Extensive experiments demonstrate that IBAFormer achieves SOTA performance in DGSS, and ablation studies further confirm the effectiveness of each introduced component.
摘要
域面泛化 semantic segmentation (DGSS) 是一项关键但受挑战的任务,模型只在源数据上训练,无法访问目标数据。 DESPITE numerous DGSS strategies 的提议,模型的泛化能力仍然有限。 虽然一些基于 Transformer 的 segmentation 模型表现出色,但它们主要是 capture 内样关系,忽略了 между样关系,这可能会增强 DGSS。为此,我们在 Transformer 网络中增强注意模块,以便提高 DGSS 的泛化能力。 Specifically, we propose two alternative intra-batch attention mechanisms, namely mean-based intra-batch attention (MIBA) and element-wise intra-batch attention (EIBA), to capture correlations between different samples, enrich contextual information, and diversify the training data for each attention block。 Building upon intra-batch attention, we introduce IBAFormer, which integrates self-attention modules with the proposed intra-batch attention for DGSS。 Extensive experiments demonstrate that IBAFormer achieves SOTA performance in DGSS, and ablation studies further confirm the effectiveness of each introduced component。
OTAS: Unsupervised Boundary Detection for Object-Centric Temporal Action Segmentation
for: 本研究旨在提高视觉 temporal action segmentation 的精度和效率,通过发现 global visual descriptors 中的剧烈变化。
methods: 本研究提出了一种不需要监督的 Object-centric Temporal Action Segmentation (OTAS) 框架,包括自动提取 global 和 local feature 模块以及一个综合Feature fusion 模块,用于检测行为分割的精度。
results: 对比之前的状态对抗方法,OTAS 在我们建议的 F1 Score 上提高了41%的平均值,并且在用户研究中还能够超越人工标注。此外,OTAS 具有实时推理的能力。Abstract
Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by $41\%$ on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.
摘要
通常情况下,时间动作分割通过发现全局视觉特征的异常变化来实现。在这篇论文中,我们探讨了本地特征的优势,并提出了无监督的对象中心时间动作分割(OTAS)框架。OTAS包括自我监督全局和本地特征提取模块以及边界选择模块,这些模块共同协同做出动作分割。作为第二个贡献,我们评估了现有帧级和边界级评价指标的优缺点。经过广泛的实验,我们发现OTAS在我们建议的F1分数上比前一代方法提高41%的平均值。更 surprisngly,OTAS甚至超过了用户研究中的人工标注。此外,OTAS具有实时推理的能力。
Modality Unifying Network for Visible-Infrared Person Re-Identification
results: 对多个公共数据集进行了广泛的实验,结果表明,提出的方法可以在可见光和无可见光人识别任务中显著超越当前状态的方法。Abstract
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. As a result, the learned feature emphasizes the common patterns across modalities while suppressing modality-specific and identity-aware information that is valuable for Re-ID. To address these issues, we propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID. First, the auxiliary modality is generated by combining the proposed cross-modality learner and intra-modality learner, which can dynamically model the modality-specific and modality-shared representations to alleviate both cross-modality and intra-modality variations. Second, by aligning identity centres across the three modalities, an identity alignment loss function is proposed to discover the discriminative feature representations. Third, a modality alignment loss is introduced to consistently reduce the distribution distance of visible and infrared images by modality prototype modeling. Extensive experiments on multiple public datasets demonstrate that the proposed method surpasses the current state-of-the-art methods by a significant margin.
摘要
visible-infrared人识别(VI-ReID)是一个具有大量交叉模式差异和内类变化的挑战任务。现有方法主要集中于学习共同特征空间中的模式共享表示。这使得学习的特征强调共同模式 across modalities 而忽略模式特定和身份意识的信息,这些信息对于 Re-ID 是有价值的。为解决这些问题,我们提出了一种新的 Modality Unifying Network(MUN),以探索一种robust的辅助模式 для VI-ReID。首先,辅助模式是通过我们提出的交叉模式学习器和内模式学习器组合生成的,这些学习器可以动态模型模式特定和共同模式,从而缓解交叉模式和内模式变化。其次,通过将身份中心对三个模式进行对应,我们提出了一个身份对齐loss函数,以发现特征表示。最后,我们引入了一个模式对齐损失,以均衡可见和红外图像的分布距离,通过模式prototype模型。我们在多个公共数据集上进行了广泛的实验,结果表明,我们的方法在与当前状态艺术方法相比,具有显著的超越。
Use neural networks to recognize students’ handwritten letters and incorrect symbols
results: 该论文通过使用图像多类别化技术,能够准确地纠正学生的多选题答案。Abstract
Correcting students' multiple-choice answers is a repetitive and mechanical task that can be considered an image multi-classification task. Assuming possible options are 'abcd' and the correct option is one of the four, some students may write incorrect symbols or options that do not exist. In this paper, five classifications were set up - four for possible correct options and one for other incorrect writing. This approach takes into account the possibility of non-standard writing options.
摘要
更正学生的多项选择答案是一项 repetitive 和机械化的任务,可以被视为一种图像多类别化任务。假设可能的选项是 'abcdef',而正确的选项是其中之一,一些学生可能写入错误的符号或不存在的选项。在这篇论文中,设置了五个分类 - 四个可能正确的选项和一个其他错误写入。这种方法考虑了非标准写入选项的可能性。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
SGFeat: Salient Geometric Feature for Point Cloud Registration
results: 在3DMatch/3DLoMatch和KITTI等知名数据集上进行的实验表明,该方法在精度和稳定性方面具有显著的优势,证明了该方法的效iveness。Abstract
Point Cloud Registration (PCR) is a critical and challenging task in computer vision. One of the primary difficulties in PCR is identifying salient and meaningful points that exhibit consistent semantic and geometric properties across different scans. Previous methods have encountered challenges with ambiguous matching due to the similarity among patch blocks throughout the entire point cloud and the lack of consideration for efficient global geometric consistency. To address these issues, we propose a new framework that includes several novel techniques. Firstly, we introduce a semantic-aware geometric encoder that combines object-level and patch-level semantic information. This encoder significantly improves registration recall by reducing ambiguity in patch-level superpoint matching. Additionally, we incorporate a prior knowledge approach that utilizes an intrinsic shape signature to identify salient points. This enables us to extract the most salient super points and meaningful dense points in the scene. Secondly, we introduce an innovative transformer that encodes High-Order (HO) geometric features. These features are crucial for identifying salient points within initial overlap regions while considering global high-order geometric consistency. To optimize this high-order transformer further, we introduce an anchor node selection strategy. By encoding inter-frame triangle or polyhedron consistency features based on these anchor nodes, we can effectively learn high-order geometric features of salient super points. These high-order features are then propagated to dense points and utilized by a Sinkhorn matching module to identify key correspondences for successful registration. In our experiments conducted on well-known datasets such as 3DMatch/3DLoMatch and KITTI, our approach has shown promising results, highlighting the effectiveness of our novel method.
摘要
点云注册(PCR)是计算机视觉中的一项关键和挑战性任务。一个主要的困难在PCR中是在不同扫描中标识符合Semantic和Geometric性质的精彩点。先前的方法在patch块之间的模糊匹配和缺乏全局高级几何一致性导致困难。为解决这些问题,我们提出了一个新的框架,包括多种新技术。首先,我们引入了一个Semantic-aware geometric encoder,该encoder结合物体层和patch层Semantic信息。这使得精彩点匹配减少了模糊性,提高了注册回归率。其次,我们采用了一种基于内在形状签名的高级知识方法,以便在场景中提取最精彩的超点和有意义的稠密点。其次,我们引入了一种创新的变换器,用于编码高级几何特征。这些特征在初始重叠区域内识别精彩点,同时考虑全局高级几何一致性。为了进一步优化这种高级变换器,我们引入了一种锚节选择策略。通过基于锚节的inter-frame三角形或多面体一致性特征编码,我们可以有效地学习高级几何特征。这些高级特征然后被传递到稠密点,并由Sinkhorn匹配模块进行成功注册。在我们对3DMatch/3DLoMatch和KITTI等知名数据集进行的实验中,我们的方法表现出了扎实的效果,证明了我们的新方法的效果。
Fast Sparse PCA via Positive Semidefinite Projection for Unsupervised Feature Selection
results: 本研究提出了一种基于PSD约束的半监督特征选择方法,并提供了一种加速方法和一种参数设置策略。经过实验 validate that the proposed method is effective and efficient on synthetic and real-world datasets.Abstract
In the field of unsupervised feature selection, sparse principal component analysis (SPCA) methods have attracted more and more attention recently. Compared to spectral-based methods, SPCA methods don't rely on the construction of a similarity matrix and show better feature selection ability on real-world data. The original SPCA formulates a nonconvex optimization problem. Existing convex SPCA methods reformulate SPCA as a convex model by regarding the reconstruction matrix as an optimization variable. However, they are lack of constraints equivalent to the orthogonality restriction in SPCA, leading to larger solution space. In this paper, it's proved that the optimal solution to a convex SPCA model falls onto the Positive Semidefinite (PSD) cone. A standard convex SPCA-based model with PSD constraint for unsupervised feature selection is proposed. Further, a two-step fast optimization algorithm via PSD projection is presented to solve the proposed model. Two other existing convex SPCA-based models are also proven to have their solutions optimized on the PSD cone in this paper. Therefore, the PSD versions of these two models are proposed to accelerate their convergence as well. We also provide a regularization parameter setting strategy for our proposed method. Experiments on synthetic and real-world datasets demonstrate the effectiveness and efficiency of the proposed methods.
摘要
在无监督特征选择领域,贪婪原始Component分析(SPCA)方法在最近吸引了越来越多的注意。相比spectral-based方法,SPCA方法不需要建立一个相似矩阵,在实际数据上表现更好的特征选择能力。原始SPCA方法转化为一个非凸优化问题。现有的凸SPCA方法将SPCA转化为一个凸模型,但是它们缺乏与SPCA中的正交约束相当的约束,导致更大的解空间。在这篇论文中,证明了凸SPCA模型的优解在Positive Semidefinite(PSD)树上。我们提出了一种标准的凸SPCA基于PSD约束的模型,并提供了一种快速优化算法via PSD投影来解决该模型。此外,我们还证明了其他两种现有的凸SPCA基于模型在PSD树上有优解。因此,我们提出了PSD版本的这两个模型,以加速它们的收敛。我们还提供了我们提posed方法的正则化参数设定策略。实验表明,我们的提posed方法在synthetic和实际数据上具有良好的效果和效率。
Computer Vision Pipeline for Automated Antarctic Krill Analysis
results: 研究人员在krill实例分割方面获得了77.28%的AP分数,并可以分别估计krill的成熟阶段和长度,并且Length error为1.96 mm。Abstract
British Antarctic Survey (BAS) researchers launch annual expeditions to the Antarctic in order to estimate Antarctic Krill biomass and assess the change from previous years. These comparisons provide insight into the effects of the current environment on this key component of the marine food chain. In this work we have developed tools for automating the data collection and analysis process, using web-based image annotation tools and deep learning image classification and regression models. We achieve highly accurate krill instance segmentation results with an average 77.28% AP score, as well as separate maturity stage and length estimation of krill specimens with 62.99% accuracy and a 1.96 mm length error respectively.
摘要
英国南极调查(BAS)研究人员每年到南极进行调查,以估计南极 krill 生物质量和评估过去年的变化。这些比较提供有关当前环境对这一关键海洋食物链成分的影响的见解。在这个工作中,我们已经开发了自动数据收集和分析工具,使用网络上的图像标注工具和深度学习图像分类和回归模型。我们得到了高度精确的 krill 实体分 segmentation 结果,具体的话是平均77.28% AP 分数,以及分类 krill 虫的成熟阶段和长度估计,具体的话是62.99% 的准确率和1.96 mm 的长度误差。
Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding
paper_authors: Jiaxiu Li, Kun Li, Jia Li, Guoliang Chen, Dan Guo, Meng Wang
for: 本研究 targets 化妆活动的视频地理Localization, aiming to accurately locate the target video segment that is semantically related to a sentence describing a make-up activity, given a long video.
methods: 该 paper proposes a novel framework named Dual-Path Temporal Map Optimization Network (DPTMO), which utilizes both query-agnostic and query-guided features to construct two proposal sets and uses specific evaluation methods for the two sets. The dual-path structure mines more semantic information in make-up videos and distinguishes fine-grained actions well.
results: compared with existing methods, the proposed DPTMO framework demonstrates superior performance in fine-grained semantic comprehension, as shown in comprehensive experiments on the YouMakeup dataset.Abstract
Make-up temporal video grounding (MTVG) aims to localize the target video segment which is semantically related to a sentence describing a make-up activity, given a long video. Compared with the general video grounding task, MTVG focuses on meticulous actions and changes on the face. The make-up instruction step, usually involving detailed differences in products and facial areas, is more fine-grained than general activities (e.g, cooking activity and furniture assembly). Thus, existing general approaches cannot locate the target activity effectually. More specifically, existing proposal generation modules are not yet fully developed in providing semantic cues for the more fine-grained make-up semantic comprehension. To tackle this issue, we propose an effective proposal-based framework named Dual-Path Temporal Map Optimization Network (DPTMO) to capture fine-grained multimodal semantic details of make-up activities. DPTMO extracts both query-agnostic and query-guided features to construct two proposal sets and uses specific evaluation methods for the two sets. Different from the commonly used single structure in previous methods, our dual-path structure can mine more semantic information in make-up videos and distinguish fine-grained actions well. These two candidate sets represent the cross-modal makeup video-text similarity and multi-modal fusion relationship, complementing each other. Each set corresponds to its respective optimization perspective, and their joint prediction enhances the accuracy of video timestamp prediction. Comprehensive experiments on the YouMakeup dataset demonstrate our proposed dual structure excels in fine-grained semantic comprehension.
摘要
make-up temporal video grounding (MTVG) 目标是将视频中相关的目标视频段本地化,给定一个长视频。相比通用视频固定任务,MTVG更关注面部细节和变化。make-up instruction step 通常包括细节的产品和面部区域差异,比如制作餐食和家具组装。因此,现有的通用方法无法准确定位目标活动。更 Specifically,现有的提议生成模块没有充分发展出提供semantic cue для更细化的化妆 semantic comprehension。为解决这个问题,我们提出了一种有效的提议基于框架,named Dual-Path Temporal Map Optimization Network (DPTMO),用于捕捉化妆活动的细化多模式semantic detail。DPTMO提取了both query-agnostic和query-guided特征,并使用特定的评估方法来构建两个提议集。与之前的方法不同,我们的双路结构可以在化妆视频中挖掘更多的semantic information,并且能够区分细化的动作。这两个候选集表示cross-modal makeup video-text similarity和多模式融合关系,互补 Each other。每个集对应了自己的优化视角,并且他们的联合预测可以提高视频时间戳预测的准确性。通过对YouMakeup dataset的全面实验,我们的提议双结构表明其在细化semantic comprehension中表现出色。
Elucidating the solution space of extended reverse-time SDE for diffusion models
paper_authors: Qinpeng Cui, Xinyi Zhang, Zongqing Lu, Qingmin Liao for:这种论文旨在提高Diffusion Model(DM)的采样速度,使其在不同的生成模型任务中表现更加突出。methods:这篇论文使用了扩展的反时域SDE(ER SDE)来表述采样过程,并提出了精确解和高级别approx解方法。results:实验结果表明,ER SDE Solvers可以减少采样时间,达到过去无法达到的水平,例如在ImageNet 64×64 dataset上取得2.24 FID在50次评估中,和3.45 FID在20次评估中。Abstract
Diffusion models (DMs) demonstrate potent image generation capabilities in various generative modeling tasks. Nevertheless, their primary limitation lies in slow sampling speed, requiring hundreds or thousands of sequential function evaluations through large neural networks to generate high-quality images. Sampling from DMs can be seen as solving corresponding stochastic differential equations (SDEs) or ordinary differential equations (ODEs). In this work, we formulate the sampling process as an extended reverse-time SDE (ER SDE), unifying prior explorations into ODEs and SDEs. Leveraging the semi-linear structure of ER SDE solutions, we offer exact solutions and arbitrarily high-order approximate solutions for VP SDE and VE SDE, respectively. Based on the solution space of the ER SDE, we yield mathematical insights elucidating the superior performance of ODE solvers over SDE solvers in terms of fast sampling. Additionally, we unveil that VP SDE solvers stand on par with their VE SDE counterparts. Finally, we devise fast and training-free samplers, ER-SDE Solvers, elevating the efficiency of stochastic samplers to unprecedented levels. Experimental results demonstrate achieving 3.45 FID in 20 function evaluations and 2.24 FID in 50 function evaluations on the ImageNet 64$\times$64 dataset.
摘要
Diffusion models (DMs) 表现出了杰出的图像生成能力在各种生成模型任务中。然而,它们的主要限制在于慢的采样速度,需要数百或千次的顺序函数评估通过大 neural network 来生成高质量图像。从 DMs 的采样角度来看,可以看作解决对应的随机 diffeomorphism equation (SDE) 或 ordinary differential equation (ODE)。在这种工作中,我们将采样过程推广为延长的反时间 SDE (ER SDE),整合先前的探索 ODE 和 SDE 中。利用 ER SDE 解的半线性结构,我们提供了精确解和高阶近似解 для VP SDE 和 VE SDE,分别。基于 ER SDE 的解空间,我们获得了数学意义,解释了 ODE 解算法在采样速度方面的超越性。此外,我们发现 VP SDE 解算法与其对应的 VE SDE 解算法一样强。最后,我们设计了快速和无需训练的采样器,ER-SDE Solvers,使得随机采样的效率提升到了历史最高水平。实验结果表明在 ImageNet 64$\times$64 数据集上,我们可以在 20 次函数评估中达到 3.45 FID,并在 50 次函数评估中达到 2.24 FID。
Certified Robust Models with Slack Control and Large Lipschitz Constants
results: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上,模型consistently outperform其他损失函数,并在CIFAR-100和Tiny-ImageNet上提高了状态之最佳 deterministic $L_2$ 鲁棒精度。Abstract
Despite recent success, state-of-the-art learning-based models remain highly vulnerable to input changes such as adversarial examples. In order to obtain certifiable robustness against such perturbations, recent work considers Lipschitz-based regularizers or constraints while at the same time increasing prediction margin. Unfortunately, this comes at the cost of significantly decreased accuracy. In this paper, we propose a Calibrated Lipschitz-Margin Loss (CLL) that addresses this issue and improves certified robustness by tackling two problems: Firstly, commonly used margin losses do not adjust the penalties to the shrinking output distribution; caused by minimizing the Lipschitz constant $K$. Secondly, and most importantly, we observe that minimization of $K$ can lead to overly smooth decision functions. This limits the model's complexity and thus reduces accuracy. Our CLL addresses these issues by explicitly calibrating the loss w.r.t. margin and Lipschitz constant, thereby establishing full control over slack and improving robustness certificates even with larger Lipschitz constants. On CIFAR-10, CIFAR-100 and Tiny-ImageNet, our models consistently outperform losses that leave the constant unattended. On CIFAR-100 and Tiny-ImageNet, CLL improves upon state-of-the-art deterministic $L_2$ robust accuracies. In contrast to current trends, we unlock potential of much smaller models without $K=1$ constraints.
摘要
Despite recent success, state-of-the-art learning-based models remain highly vulnerable to input changes such as adversarial examples. In order to obtain certifiable robustness against such perturbations, recent work considers Lipschitz-based regularizers or constraints while at the same time increasing prediction margin. Unfortunately, this comes at the cost of significantly decreased accuracy. In this paper, we propose a Calibrated Lipschitz-Margin Loss (CLL) that addresses this issue and improves certified robustness by tackling two problems: Firstly, commonly used margin losses do not adjust the penalties to the shrinking output distribution; caused by minimizing the Lipschitz constant $K$. Secondly, and most importantly, we observe that minimization of $K$ can lead to overly smooth decision functions. This limits the model's complexity and thus reduces accuracy. Our CLL addresses these issues by explicitly calibrating the loss w.r.t. margin and Lipschitz constant, thereby establishing full control over slack and improving robustness certificates even with larger Lipschitz constants. On CIFAR-10, CIFAR-100 and Tiny-ImageNet, our models consistently outperform losses that leave the constant unattended. On CIFAR-100 and Tiny-ImageNet, CLL improves upon state-of-the-art deterministic $L_2$ robust accuracies. In contrast to current trends, we unlock potential of much smaller models without $K=1$ constraints.Here's the translation in Traditional Chinese:虽然最近的成果优秀,但现代学习型模型仍然具有对输入更改的高度敏感性,如攻击例子。以获取可认证的防护性,现在的工作会考虑使用Liп希茨基于的正则化或限制,同时增加预测margin。然而,这会导致减少精度。在这篇论文中,我们提出了单位Lipschitz-Margin损失函数(CLL),以解决这个问题,并提高认证性质证明。我们的CLL通过调整损失函数对margin和Lipschitz常数的控制,以确保全面控制 sobre slack,并提高防护性证明,即使Lipschitz常数较大。在CIFAR-10、CIFAR-100和Tiny-ImageNet上,我们的模型一致地超过不考虑K常数的损失函数。在CIFAR-100和Tiny-ImageNet上,CLL超过了现有的决定性$L_2$防护精度。相比于现有的趋势,我们解锁了许多小型模型的潜力,不需要$K=1$的限制。
Active Label Refinement for Semantic Segmentation of Satellite Images
results: 我们通过使用印度بengooru的卫星图像,发现活动标注更新可以提高semantic segmentation网络的性能。Abstract
Remote sensing through semantic segmentation of satellite images contributes to the understanding and utilisation of the earth's surface. For this purpose, semantic segmentation networks are typically trained on large sets of labelled satellite images. However, obtaining expert labels for these images is costly. Therefore, we propose to rely on a low-cost approach, e.g. crowdsourcing or pretrained networks, to label the images in the first step. Since these initial labels are partially erroneous, we use active learning strategies to cost-efficiently refine the labels in the second step. We evaluate the active learning strategies using satellite images of Bengaluru in India, labelled with land cover and land use labels. Our experimental results suggest that an active label refinement to improve the semantic segmentation network's performance is beneficial.
摘要
通过卫星图像 semantics 分割来理解和利用地球表面,这需要 semantic 分割网络在大量标注卫星图像上进行训练。然而,获得专业标注是成本高昂的。因此,我们提议使用低成本方法,如招募或预训练网络,来标注图像。由于这些初始标注有误,我们使用活动学习策略来经济性地修正标注。我们通过印度本革的孟买诺亚卫星图像进行实验,发现活动标注修正可以提高 semantic 分割网络的性能。
Improving Generalization Capability of Deep Learning-Based Nuclei Instance Segmentation by Non-deterministic Train Time and Deterministic Test Time Stain Normalization
for: This paper aims to improve the generalization capability of a deep learning-based automatic segmentation approach for nuclei instance segmentation in digital histopathological images.
methods: The proposed method incorporates non-deterministic train time and deterministic test time stain normalization, and uses one single training set to evaluate the segmentation performance on seven test datasets.
results: The proposed method provides up to 5.77%, 5.36%, and 5.27% better performance in segmenting nuclei based on Dice score, aggregated Jaccard index, and panoptic quality score, respectively, compared to the baseline segmentation model.Here’s the Chinese translation of the three key points:
results: 提议的方法在分割核实体方面提供了5.77%, 5.36%, 和5.27%的提升,根据 dice 分数、总和 Jacard 指标和精确性分数。Abstract
With the advent of digital pathology and microscopic systems that can scan and save whole slide histological images automatically, there is a growing trend to use computerized methods to analyze acquired images. Among different histopathological image analysis tasks, nuclei instance segmentation plays a fundamental role in a wide range of clinical and research applications. While many semi- and fully-automatic computerized methods have been proposed for nuclei instance segmentation, deep learning (DL)-based approaches have been shown to deliver the best performances. However, the performance of such approaches usually degrades when tested on unseen datasets. In this work, we propose a novel approach to improve the generalization capability of a DL-based automatic segmentation approach. Besides utilizing one of the state-of-the-art DL-based models as a baseline, our method incorporates non-deterministic train time and deterministic test time stain normalization. We trained the model with one single training set and evaluated its segmentation performance on seven test datasets. Our results show that the proposed method provides up to 5.77%, 5.36%, and 5.27% better performance in segmenting nuclei based on Dice score, aggregated Jaccard index, and panoptic quality score, respectively, compared to the baseline segmentation model.
摘要
To address this challenge, we propose a novel approach to improve the generalization capability of a DL-based automatic segmentation method. Our approach incorporates non-deterministic train time and deterministic test time stain normalization. We trained the model with a single training set and evaluated its segmentation performance on seven test datasets. Our results show that the proposed method provides up to 5.77%, 5.36%, and 5.27% better performance in segmenting nuclei based on Dice score, aggregated Jaccard index, and panoptic quality score, respectively, compared to the baseline segmentation model.
Towards Reliable Domain Generalization: A New Dataset and Evaluations
results: 现有方法在PaHCC dataset上的性能不满足,只有离开一个领域的协议是可靠的。我们的 dataset 和评估将为领域特征学社区带来新的视角,推动更大的进步。Abstract
There are ubiquitous distribution shifts in the real world. However, deep neural networks (DNNs) are easily biased towards the training set, which causes severe performance degradation when they receive out-of-distribution data. Many methods are studied to train models that generalize under various distribution shifts in the literature of domain generalization (DG). However, the recent DomainBed and WILDS benchmarks challenged the effectiveness of these methods. Aiming at the problems in the existing research, we propose a new domain generalization task for handwritten Chinese character recognition (HCCR) to enrich the application scenarios of DG method research. We evaluate eighteen DG methods on the proposed PaHCC (Printed and Handwritten Chinese Characters) dataset and show that the performance of existing methods on this dataset is still unsatisfactory. Besides, under a designed dynamic DG setting, we reveal more properties of DG methods and argue that only the leave-one-domain-out protocol is unreliable. We advocate that researchers in the DG community refer to dynamic performance of methods for more comprehensive and reliable evaluation. Our dataset and evaluations bring new perspectives to the community for more substantial progress. We will make our dataset public with the article published to facilitate the study of domain generalization.
摘要
“世界上存在普遍的分布Shift。然而,深度神经网络(DNNs)容易受到训练集的偏袋影响,导致接收不同分布的数据时表现下降。许多领域通用化(Domain Generalization,DG)的方法已经被研究,但是最近的DomainBed和WILDS benchmarks表明这些方法的有效性存在问题。为了解决现有研究中的问题,我们提出了一个新的领域通用化任务:手写中文字识别(HCCR),以推广DG方法的应用场景。我们在提出的PaHCC(印刷和手写中文字)数据集上评估了 eighteen 个 DG 方法的性能,并发现现有方法在这个数据集上的性能仍然不满足。此外,在我们设计的动态 DG 设定下,我们发现了更多的 DG 方法的性能特性,并认为只有离开一个领域的协议是不可靠的。我们建议研究人员在 DG 社区中参考方法的动态性能进行更全面和可靠的评估。我们将在发表文章时公开我们的数据集,以便研究领域通用化的人们进行更进一步的研究。”
Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning
results: 实验结果显示,DVPT在许多下游识别任务上表现更好,甚至超过了完整的精确调整方法。另外,DVPT可以保持高度的参数效率,并且在17个中的19个下游任务上表现更好。我们将代码发布 soon。Abstract
Parameter efficient transfer learning (PETL) is an emerging research spot that aims to adapt large-scale pre-trained models to downstream tasks. Recent advances have achieved great success in saving storage and computation costs. However, these methods do not take into account instance-specific visual clues for visual tasks. In this paper, we propose a Dynamic Visual Prompt Tuning framework (DVPT), which can generate a dynamic instance-wise token for each image. In this way, it can capture the unique visual feature of each image, which can be more suitable for downstream visual tasks. We designed a Meta-Net module that can generate learnable prompts based on each image, thereby capturing dynamic instance-wise visual features. Extensive experiments on a wide range of downstream recognition tasks show that DVPT achieves superior performance than other PETL methods. More importantly, DVPT even outperforms full fine-tuning on 17 out of 19 downstream tasks while maintaining high parameter efficiency. Our code will be released soon.
摘要
Parameter efficient transfer learning (PETL) 是一个快速发展的研究领域,旨在适应大规模预训练模型下推断任务。近期的进展有效地降低了存储和计算成本。然而,这些方法没有考虑每个图像的特定视觉特征。在这篇论文中,我们提出了一种动态视觉提示调整框架(DVPT),可以生成每个图像的动态实例化 tokens。这种方法可以捕捉每个图像独特的视觉特征,更适合下游视觉任务。我们设计了一个 Meta-Net 模块,可以基于每个图像生成学习的提示,以捕捉动态实例化视觉特征。我们进行了广泛的实验,证明 DVPT 在多种下游认知任务上超过其他 PETL 方法表现。此外,DVPT 甚至超过了全部精细调整,在 17 个下游任务中表现优于其他方法。我们即将发布代码。
C-RITNet: Set Infrared and Visible Image Fusion Free from Complementary Information Mining
paper_authors: Yafei Zhang, Keying Du, Huafeng Li, Zhengtao Yu, Yu Liu
for: 本研究的目的是提出一种新的图像融合方法,以提高图像融合的质量和精度。
methods: 本研究使用了一种名为Complementary-Redundant Information Transfer Network(C-RITNet)的新网络模型,该模型可以有效地传输两个模式之间的相似和不同的信息,从而生成高质量的融合图像。
results: 对比于传统图像融合方法,C-RITNet可以更好地保留图像的细节和结构信息,同时也可以提高图像的明暗ratio和对比度。这意味着C-RITNet可以生成更加自然和真实的融合图像。Abstract
Infrared and visible image fusion (IVIF) aims to extract and integrate the complementary information in two different modalities to generate high-quality fused images with salient targets and abundant texture details. However, current image fusion methods go to great lengths to excavate complementary features, which is generally achieved through two efforts. On the one hand, the feature extraction network is expected to have excellent performance in extracting complementary information. On the other hand, complex fusion strategies are often designed to aggregate the complementary information. In other words, enabling the network to perceive and extract complementary information is extremely challenging. Complicated fusion strategies, while effective, still run the risk of losing weak edge details. To this end, this paper rethinks the IVIF outside the box, proposing a complementary-redundant information transfer network (C-RITNet). It reasonably transfers complementary information into redundant one, which integrates both the shared and complementary features from two modalities. Hence, the proposed method is able to alleviate the challenges posed by the complementary information extraction and reduce the reliance on sophisticated fusion strategies. Specifically, to skillfully sidestep aggregating complementary information in IVIF, we first design the mutual information transfer (MIT) module to mutually represent features from two modalities, roughly transferring complementary information into redundant one. Then, a redundant information acquisition supervised by source image (RIASSI) module is devised to further ensure the complementary-redundant information transfer after MIT. Meanwhile, we also propose a structure information preservation (SIP) module to guarantee that the edge structure information of the source images can be transferred to the fusion results.
摘要
infrared和可见图像融合(IVIF)目的是抽取并融合两个不同模式的补充信息,以生成高质量融合图像,具有突出的目标和丰富的текстура细节。然而,当前的图像融合方法通常会努力挖掘补充特征,通常通过两种方法来实现。一方面,特征提取网络应该具有出色的性能,以抽取补充信息。另一方面,复杂的融合策略通常会用于聚合补充信息。也就是说,使网络感受到和抽取补充信息是极其困难的。复杂的融合策略,虽然有效,仍然存在脆弱的边缘细节的风险。为了解决这些问题,本文尝试外部思考IVIF,提出一种补充-重复信息传输网络(C-RITNet)。它有理性地将补充信息转换为重复信息,并将两种模式的共享和补充特征融合在一起。因此,提出的方法可以减轻IVIF中补充信息抽取的挑战,降低复杂的融合策略的依赖。具体来说,为了绕过IVIF中补充信息的聚合,我们首先设计了共享信息传输模块(MIT),以便互相表示两种模式的特征,粗略地将补充信息转换为重复信息。然后,我们设计了源图像领导的重复信息获取模块(RIASSI),以确保补充-重复信息传输后MIT。同时,我们还提出了保持结构信息模块(SIP),以确保源图像的边缘结构信息可以传递到融合结果中。
HOC-Search: Efficient CAD Model and Pose Retrieval from RGB-D Scans
paper_authors: Stefan Ainetter, Sinisa Stekovic, Friedrich Fraundorfer, Vincent Lepetit for: 本文提出了一种自动化和高效的方法,用于从移动RGB-D摄像头捕获场景中的物体和其pose的高质量CAD模型。methods: 本文首先研究了不同的目标函数,用于度量候选CAD模型和可用数据之间的相似性,并发现最佳目标函数是一种”渲染并比较”方法, comparing depth和mask渲染。本文引入了一种快速搜索方法,该方法通过这个目标函数同时 retrieve object category, CAD模型和object pose,给出了一个approximate 3D bounding box。这种方法基于一个搜索树,该树将CAD模型和物体属性,包括物体类别和pose,组织成fast retrieval。results: 本文表明,这种方法可以很好地适应实际情况,与极限搜索相比,提高了10倍至120倍的速度。Abstract
We present an automated and efficient approach for retrieving high-quality CAD models of objects and their poses in a scene captured by a moving RGB-D camera. We first investigate various objective functions to measure similarity between a candidate CAD object model and the available data, and the best objective function appears to be a "render-and-compare" method comparing depth and mask rendering. We thus introduce a fast-search method that approximates an exhaustive search based on this objective function for simultaneously retrieving the object category, a CAD model, and the pose of an object given an approximate 3D bounding box. This method involves a search tree that organizes the CAD models and object properties including object category and pose for fast retrieval and an algorithm inspired by Monte Carlo Tree Search, that efficiently searches this tree. We show that this method retrieves CAD models that fit the real objects very well, with a speed-up factor of 10x to 120x compared to exhaustive search.
摘要
我们提出了一种自动化和高效的方法,用于从移动RGB-D摄像头捕捉的场景中检索高质量的CAD模型和其姿态。我们首先调查了各种目标函数,用于测量候选CAD对象模型和可用数据之间的相似性,最佳目标函数则是一种“渲染并比较”方法,该方法 Compares the depth and mask rendering of the candidate CAD object model with the available data. We therefore introduce a fast-search method that approximates an exhaustive search based on this objective function, which simultaneously retrieves the object category, a CAD model, and the pose of an object given an approximate 3D bounding box. This method uses a search tree that organizes the CAD models and object properties, including object category and pose, for fast retrieval, and an algorithm inspired by Monte Carlo Tree Search, which efficiently searches this tree. We show that this method retrieves CAD models that fit the real objects very well, with a speed-up factor of 10x to 120x compared to exhaustive search.
Can we predict the Most Replayed data of video streaming platforms?
results: 我们的结果显示,虽然只有在Random predictions之上微弱,但所有评估DL模型都超过了人类性能。此外,它们还超过了人类水平的准确率。这表明预测MR数据是一项具有挑战性的任务,可以通过DL的帮助进一步提高。最后,我们认为DL在MR数据预测方面的性能还可以进一步提高,例如通过多Modal learning。我们鼓励研究者使用我们的benchmark dataset进一步调查自动MR数据预测。Abstract
Predicting which specific parts of a video users will replay is important for several applications, including targeted advertisement placement on video platforms and assisting video creators. In this work, we explore whether it is possible to predict the Most Replayed (MR) data from YouTube videos. To this end, we curate a large video benchmark, the YTMR500 dataset, which comprises 500 YouTube videos with MR data annotations. We evaluate Deep Learning (DL) models of varying complexity on our dataset and perform an extensive ablation study. In addition, we conduct a user study to estimate the human performance on MR data prediction. Our results show that, although by a narrow margin, all the evaluated DL models outperform random predictions. Additionally, they exceed human-level accuracy. This suggests that predicting the MR data is a difficult task that can be enhanced through the assistance of DL. Finally, we believe that DL performance on MR data prediction can be further improved, for example, by using multi-modal learning. We encourage the research community to use our benchmark dataset to further investigate automatic MR data prediction.
摘要
预测视频用户会重复观看哪些 especific parts是重要的,有多种应用场景,如视频平台上的受argeted广告推送和助手视频创作者。在这种工作中,我们是否可以预测YouTube视频中的 Most Replayed(MR)数据?为此,我们筹集了一个大型视频 benchmark,即 YTMR500 数据集,该数据集包含 500 个 YouTube 视频MR数据注解。我们评估了不同复杂度的深度学习(DL)模型在我们的数据集上,并进行了广泛的减少研究。此外,我们还进行了用户研究,以估计人类在 MR 数据预测上的性能。我们的结果显示,虽然只有通过一定的窄 margins,所有评估的 DL 模型都高于随机预测。此外,它们还超过了人类水平的准确率。这表明预测 MR 数据是一个具有挑战性的任务,可以通过DL的协助进行提高。最后,我们认为DL在 MR 数据预测中的性能可以进一步提高,例如通过多Modal learning。我们鼓励研究社区使用我们的 benchmark 数据集进一步调查自动 MR 数据预测。
Estimating exercise-induced fatigue from thermal facial images
paper_authors: Manuel Lage Cañellas, Constantino Álvarez Casado, Le Nguyen, Miguel Bordallo López
for: 预测运动训练过程中的肥胖疲劳水平
methods: 使用深度学习模型和热成像技术对用户的面部和热图进行自动分析
results: 结果表明,只需要一幅热图,可以准确预测运动训练过程中的肥胖疲劳水平,误差在15%以下。这些结果表明使用热成像技术和深度学习模型可以实现可靠的肥胖疲劳预测。Abstract
Exercise-induced fatigue resulting from physical activity can be an early indicator of overtraining, illness, or other health issues. In this article, we present an automated method for estimating exercise-induced fatigue levels through the use of thermal imaging and facial analysis techniques utilizing deep learning models. Leveraging a novel dataset comprising over 400,000 thermal facial images of rested and fatigued users, our results suggest that exercise-induced fatigue levels could be predicted with only one static thermal frame with an average error smaller than 15\%. The results emphasize the viability of using thermal imaging in conjunction with deep learning for reliable exercise-induced fatigue estimation.
摘要
физи活动引起的疲劳可能是过度训练、疾病或其他健康问题的早期指标。在这篇文章中,我们提出了一种自动化的方法,通过使用热成像和面部分析技术,使用深度学习模型来估计 физи活动引起的疲劳水平。利用一个新的数据集,包括超过400,000个热成像的休息和疲劳用户的面部图像,我们的结果表明,只需要一帧热成像,可以准确地预测Physical activity-induced fatigue level,误差在15%以下。结果表明,使用热成像和深度学习可靠地估计Physical activity-induced fatigue level。
Plasticity-Optimized Complementary Networks for Unsupervised Continual Learning
results: 本研究的实验结果表明,提出的方法在 few-task 和 many-task 分割Setting中都能够达到其他自由 exemplar-free 方法的高度表现,并且在 semi-supervised continual learning Setting中也能够达到其他 exemplar-free 方法的表现水平。Abstract
Continuous unsupervised representation learning (CURL) research has greatly benefited from improvements in self-supervised learning (SSL) techniques. As a result, existing CURL methods using SSL can learn high-quality representations without any labels, but with a notable performance drop when learning on a many-tasks data stream. We hypothesize that this is caused by the regularization losses that are imposed to prevent forgetting, leading to a suboptimal plasticity-stability trade-off: they either do not adapt fully to the incoming data (low plasticity), or incur significant forgetting when allowed to fully adapt to a new SSL pretext-task (low stability). In this work, we propose to train an expert network that is relieved of the duty of keeping the previous knowledge and can focus on performing optimally on the new tasks (optimizing plasticity). In the second phase, we combine this new knowledge with the previous network in an adaptation-retrospection phase to avoid forgetting and initialize a new expert with the knowledge of the old network. We perform several experiments showing that our proposed approach outperforms other CURL exemplar-free methods in few- and many-task split settings. Furthermore, we show how to adapt our approach to semi-supervised continual learning (Semi-SCL) and show that we surpass the accuracy of other exemplar-free Semi-SCL methods and reach the results of some others that use exemplars.
摘要
A2V: A Semi-Supervised Domain Adaptation Framework for Brain Vessel Segmentation via Two-Phase Training Angiography-to-Venography Translation
results: 这个方法在磁共振影像和 Venography 上评估,实现了源领域中的最佳性能,并在目标领域中仅有8.9%的Dice分数下降,显示其在不同类型的脑血管图像分类中的可靠性和可行性。Abstract
We present a semi-supervised domain adaptation framework for brain vessel segmentation from different image modalities. Existing state-of-the-art methods focus on a single modality, despite the wide range of available cerebrovascular imaging techniques. This can lead to significant distribution shifts that negatively impact the generalization across modalities. By relying on annotated angiographies and a limited number of annotated venographies, our framework accomplishes image-to-image translation and semantic segmentation, leveraging a disentangled and semantically rich latent space to represent heterogeneous data and perform image-level adaptation from source to target domains. Moreover, we reduce the typical complexity of cycle-based architectures and minimize the use of adversarial training, which allows us to build an efficient and intuitive model with stable training. We evaluate our method on magnetic resonance angiographies and venographies. While achieving state-of-the-art performance in the source domain, our method attains a Dice score coefficient in the target domain that is only 8.9% lower, highlighting its promising potential for robust cerebrovascular image segmentation across different modalities.
摘要
我们提出了一种半监督频道适应框架,用于从不同影像模式下的脑血管分 segmentation。现有的state-of-the-art方法偏向单一模式,即使有许多可用的脑血管成像技术。这会导致分布shift的问题,这会负面影响 across modalities。我们的框架通过使用注解的 angiographies 和有限多个注解 venographies,实现了图像到图像的翻译和 semantics 的 segmentation,利用分离和具有 semantic richness的秘密空间来表示不同数据,并在目标频道上进行图像级适应。此外,我们减少了典型的 cycle-based 架构的复杂性,并减少了对 adversarial training 的使用,这使得我们可以构建一个高效和直观的模型,并保持稳定的训练。我们在核磁共振 angiographies 和 venographies 上评估了我们的方法,在源频道中达到了 state-of-the-art 性能,而在目标频道中,我们的方法实现了 Dice 分数系数,与目标频道的性能差异为8.9%,这表明我们的方法在不同模式下的脑血管图像分 segmentation中具有良好的潜力。
Batch Implicit Neural Representation for MRI Parallel Reconstruction
results: 提出了一种新的MRI重建方法,能够在不同的抽样率下实现自适应的图像重建,并且在公共可用的MRI数据集上进行了实验和比较,表明该方法在重建图像方面具有优于其他方法。Abstract
Magnetic resonance imaging (MRI) always suffered from the problem of long acquisition time. MRI reconstruction is one solution to reduce scan time by skipping certain phase-encoding lines and then restoring high-quality images from undersampled measurements. Recently, implicit neural representation (INR) has emerged as a new deep learning method that represents an object as a continuous function of spatial coordinates, and this function is normally parameterized by a multilayer perceptron (MLP). In this paper, we propose a novel MRI reconstruction method based on INR, which represents the fully-sampled images as the function of pixel coordinates and prior feature vectors of undersampled images for overcoming the generalization problem of INR. Specifically, we introduce a scale-embedded encoder to produce scale-independent pixel-specific features from MR images with different undersampled scales and then concatenate with coordinates vectors to recover fully-sampled MR images via an MLP, thus achieving arbitrary scale reconstruction. The performance of the proposed method was assessed by experimenting on publicly available MRI datasets and compared with other reconstruction methods. Our quantitative evaluation demonstrates the superiority of the proposed method over alternative reconstruction methods.
摘要
magnetic resonance imaging (MRI) 总是受到长时间探测问题的困扰。MRI重建是一种解决减少扫描时间的方法,通过跳过certain phase-encoding line并从不完全探测的数据中重建高质量图像。最近,implicit neural representation (INR) emerged as a new deep learning method, representing an object as a continuous function of spatial coordinates, and this function is normally parameterized by a multilayer perceptron (MLP).在这篇论文中,我们提出了一种基于 INR 的新的 MRI重建方法。specifically, we introduce a scale-embedded encoder to produce scale-independent pixel-specific features from MR images with different undersampled scales and then concatenate with coordinates vectors to recover fully-sampled MR images via an MLP, thus achieving arbitrary scale reconstruction.我们对公共可用的 MRI 数据集进行实验,与其他重建方法进行比较,并评估了我们的方法的性能。我们的量化评估表明了我们的方法的优越性。
Selection of contributing factors for predicting landslide susceptibility using machine learning and deep learning models
results: investigate the impact of selecting contributing factors on the accuracy of landslide susceptibility predictions using ML and DL models, and compare the performances of four methods for selecting contributing factors, including Information Gain Ratio (IGR), Recursive Feature Elimination (RFE), Particle Swarm Optimization (PSO), Least Absolute Shrinkage and Selection Operators (LASSO) and Harris Hawk Optimization (HHO), as well as autoencoder-based factor selection methods for DL models.Abstract
Landslides are a common natural disaster that can cause casualties, property safety threats and economic losses. Therefore, it is important to understand or predict the probability of landslide occurrence at potentially risky sites. A commonly used means is to carry out a landslide susceptibility assessment based on a landslide inventory and a set of landslide contributing factors. This can be readily achieved using machine learning (ML) models such as logistic regression (LR), support vector machine (SVM), random forest (RF), extreme gradient boosting (Xgboost), or deep learning (DL) models such as convolutional neural network (CNN) and long short time memory (LSTM). As the input data for these models, landslide contributing factors have varying influences on landslide occurrence. Therefore, it is logically feasible to select more important contributing factors and eliminate less relevant ones, with the aim of increasing the prediction accuracy of these models. However, selecting more important factors is still a challenging task and there is no generally accepted method. Furthermore, the effects of factor selection using various methods on the prediction accuracy of ML and DL models are unclear. In this study, the impact of the selection of contributing factors on the accuracy of landslide susceptibility predictions using ML and DL models was investigated. Four methods for selecting contributing factors were considered for all the aforementioned ML and DL models, which included Information Gain Ratio (IGR), Recursive Feature Elimination (RFE), Particle Swarm Optimization (PSO), Least Absolute Shrinkage and Selection Operators (LASSO) and Harris Hawk Optimization (HHO). In addition, autoencoder-based factor selection methods for DL models were also investigated. To assess their performances, an exhaustive approach was adopted,...
摘要
landalide是一种常见的自然灾害,可以导致人员伤亡、财产安全威胁和经济损失。因此,了解或预测潜在风险地点附近垃圾滥致的可能性非常重要。一种常用的方法是通过垃圾滥致评估基于垃圾发现和一组垃圾带动因素。这可以通过机器学习(ML)模型 such as logistic regression(LR)、support vector machine(SVM)、Random Forest(RF)、extreme gradient boosting(Xgboost)或深度学习(DL)模型 such as convolutional neural network(CNN)和long short time memory(LSTM)来实现。作为输入数据,垃圾带动因素具有不同的影响度,因此可以选择更重要的带动因素,并将不重要的带动因素排除,以提高这些模型的预测精度。然而,选择更重要的因素仍然是一项具有挑战性的任务,并没有一般accepted方法。此外,不同方法选择垃圾带动因素的影响于ML和DL模型的预测精度也未得到了清楚的回答。本研究旨在调查垃圾滥致可能性预测使用ML和DL模型时,选择垃圾带动因素的影响。本研究考虑了四种方法选择垃圾带动因素,包括Information Gain Ratio(IGR)、Recursive Feature Elimination(RFE)、Particle Swarm Optimization(PSO)和Least Absolute Shrinkage and Selection Operators(LASSO)以及Harris Hawk Optimization(HHO)。此外,对于深度学习模型,也调查了基于自适应器的因素选择方法。为了评估它们的表现,本研究采用了一种极客方法。
Real-Time Semantic Segmentation: A Brief Survey & Comparative Study in Remote Sensing
results: 实验结果表明,大多数现有的高效深度学习方法具有良好的分割质量,但它们具有较高的执行速率(即延迟率),这可能限制了它们在实时应用中的可行性。Abstract
Real-time semantic segmentation of remote sensing imagery is a challenging task that requires a trade-off between effectiveness and efficiency. It has many applications including tracking forest fires, detecting changes in land use and land cover, crop health monitoring, and so on. With the success of efficient deep learning methods (i.e., efficient deep neural networks) for real-time semantic segmentation in computer vision, researchers have adopted these efficient deep neural networks in remote sensing image analysis. This paper begins with a summary of the fundamental compression methods for designing efficient deep neural networks and provides a brief but comprehensive survey, outlining the recent developments in real-time semantic segmentation of remote sensing imagery. We examine several seminal efficient deep learning methods, placing them in a taxonomy based on the network architecture design approach. Furthermore, we evaluate the quality and efficiency of some existing efficient deep neural networks on a publicly available remote sensing semantic segmentation benchmark dataset, the OpenEarthMap. The experimental results of an extensive comparative study demonstrate that most of the existing efficient deep neural networks have good segmentation quality, but they suffer low inference speed (i.e., high latency rate), which may limit their capability of deployment in real-time applications of remote sensing image segmentation. We provide some insights into the current trend and future research directions for real-time semantic segmentation of remote sensing imagery.
摘要
实时 semantic segmentation of remote sensing imagery 是一项复杂的任务,需要考虑效率和准确性之间的平衡。它有许多应用,如追踪森林火灾、检测土地用途和土地覆盖变化、耕地健康监测等。随着计算机视觉中高效的深度学习方法(i.e., 高效的深度神经网络)的成功,研究人员在 remote sensing 图像分析中采用了这些高效的深度神经网络。本文从基本压缩方法的概述开始,并提供了一个简短但全面的报告,概述了最近的 remote sensing 图像semantic segmentation 技术的发展。我们对几种具有代表性的高效深度学习方法进行了分类,根据网络架构设计方法。此外,我们对一个公共可用的 remote sensing semantic segmentation 数据集(OpenEarthMap)进行了评估,并对一些现有的高效深度神经网络进行了比较性研究。实验结果表明,大多数现有的高效深度神经网络具有良好的分割质量,但它们具有低执行速度(即高延迟率),这可能限制它们在实时应用中的可用性。我们提供了一些关于当前趋势和未来研究方向的思考。
Federated Learning for Large-Scale Scene Modeling with Neural Radiance Fields
results: 在Mill19大规模景景集中进行实验,显示了提案的姿态调整和联邦学习管线的效果。Abstract
We envision a system to continuously build and maintain a map based on earth-scale neural radiance fields (NeRF) using data collected from vehicles and drones in a lifelong learning manner. However, existing large-scale modeling by NeRF has problems in terms of scalability and maintainability when modeling earth-scale environments. Therefore, to address these problems, we propose a federated learning pipeline for large-scale modeling with NeRF. We tailor the model aggregation pipeline in federated learning for NeRF, thereby allowing local updates of NeRF. In the aggregation step, the accuracy of the clients' global pose is critical. Thus, we also propose global pose alignment to align the noisy global pose of clients before the aggregation step. In experiments, we show the effectiveness of the proposed pose alignment and the federated learning pipeline on the large-scale scene dataset, Mill19.
摘要
我们想要构建一个系统,通过持续地使用地球规模神经辐射场(NeRF)收集自汽车和无人机数据来建立和维护地球规模环境的地球规模神经辐射场地图。然而,现有的大规模模型化使用NeRF时存在扩展性和可维护性问题,因此我们提议一种联邦学习管道来解决这些问题。我们修改了NeRF模型聚合管道,以便在联邦学习中进行本地更新。在聚合步骤中,客户端的全球姿态精度是关键的,因此我们还提议用于客户端全球姿态的对齐。在实验中,我们证明了我们提议的姿态对齐和联邦学习管道在大规模场景数据集Mill19上的效果。
A new meteor detection application robust to camera movements
results: 研究人员成功开发了一个能够在25帧每秒的实时运算下,实现10瓦的电力耗用的流星检测工具组合。Abstract
This article presents a new tool for the automatic detection of meteors. Fast Meteor Detection Toolbox (FMDT) is able to detect meteor sightings by analyzing videos acquired by cameras onboard weather balloons or within airplane with stabilization. The challenge consists in designing a processing chain composed of simple algorithms, that are robust to the high fluctuation of the videos and that satisfy the constraints on power consumption (10 W) and real-time processing (25 frames per second).
摘要
这篇文章介绍了一种新的自动探测陨星工具。它可以通过气球上或飞机内的摄像机记录的视频来探测陨星现象,并且可以通过稳定化来减少视频的波动。挑战在于设计一个简单的处理链,可以对视频进行稳定化,并且满足功耗控制(10 W)和实时处理(25帧每秒)的要求。
Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration
paper_authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu
for: 高级视觉任务的contrastive学习方法,以及低级视觉任务的启发式学习方法
methods: 自动生成适应性负样本,通过目标模型自身来准确地预测负样本
results: 对各种任务和架构进行重新训练,显著提高图像修复效果,比如FFANet和DehazeFormer在RESIDEindoordataset上的图像震扫比进行3.41dB的提升,同样在SPA-Data上的图像雨减比进行0.47dB的提升,以及在Manga109上的4倍缩放超解像比进行0.12dB的提升。Abstract
Contrastive learning has emerged as a prevailing paradigm for high-level vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined, task-oriented negatives, which often exhibit pronounced task-specific biases. In this paper, we propose a innovative approach for the adaptive generation of negative samples directly from the target model itself, called ``learning from history``. We introduce the Self-Prior guided Negative loss for image restoration (SPNIR) to enable this approach. Our approach is task-agnostic and generic, making it compatible with any existing image restoration method or task. We demonstrate the effectiveness of our approach by retraining existing models with SPNIR. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPNIR outperform the original FFANet and DehazeFormer by 3.41 dB and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at https://github.com/Aitical/Task-agnostic_Model_Contrastive_Learning_Image_Restoration.
摘要
对比学习在高级视觉任务中成为主流方法,并且运用到低级视觉任务中以获得一个较为紧密的优化空间,以应对它们的不确定性。然而,现有的方法通常靠manual地定义任务特有的负面样本,这些样本经常会受到任务特有的偏见。在这篇论文中,我们提出了一种创新的方法,即“学习历史”,可以自动生成适当的负面样本。我们称之为Self-Prior guided Negative loss for image restoration(SPNIR)。我们的方法是任务不特定的和通用的,因此可以与任何现有的图像修复方法或任务搭配使用。我们通过 retrained 现有模型使用 SPNIR,以证明我们的方法的有效性。结果显示,使用 SPNIR retrained 模型可以在不同任务和架构上实现明显的图像修复改善。例如,使用 SPNIR retrained FFANet 和 DehazeFormer 可以在 RESIDE 室内dataset 上实现这些模型的3.41 dB和0.57 dB的视觉修复提升。同样地,它们在 SPA-Data 上运用 IDT 的图像雨侦测任务上也可以获得0.47 dB的提升。代码和 retrained 模型可以在 https://github.com/Aitical/Task-agnostic_Model_Contrastive_Learning_Image_Restoration 上找到。
Feature Aggregation Network for Building Extraction from High-resolution Remote Sensing Images
results: 广泛的实验表明,FANet在高分辨率卫星遥感图像中提取表面特征时表现出色,代表了Remote Sensing图像处理领域的一大突破。Abstract
The rapid advancement in high-resolution satellite remote sensing data acquisition, particularly those achieving submeter precision, has uncovered the potential for detailed extraction of surface architectural features. However, the diversity and complexity of surface distributions frequently lead to current methods focusing exclusively on localized information of surface features. This often results in significant intraclass variability in boundary recognition and between buildings. Therefore, the task of fine-grained extraction of surface features from high-resolution satellite imagery has emerged as a critical challenge in remote sensing image processing. In this work, we propose the Feature Aggregation Network (FANet), concentrating on extracting both global and local features, thereby enabling the refined extraction of landmark buildings from high-resolution satellite remote sensing imagery. The Pyramid Vision Transformer captures these global features, which are subsequently refined by the Feature Aggregation Module and merged into a cohesive representation by the Difference Elimination Module. In addition, to ensure a comprehensive feature map, we have incorporated the Receptive Field Block and Dual Attention Module, expanding the receptive field and intensifying attention across spatial and channel dimensions. Extensive experiments on multiple datasets have validated the outstanding capability of FANet in extracting features from high-resolution satellite images. This signifies a major breakthrough in the field of remote sensing image processing. We will release our code soon.
摘要
高解度卫星遥感数据获取的快速进步,尤其是实现 submeter 精度,揭示了表面建筑特征的详细EXTRACT的 potential.然而,表面分布的多样性和复杂性常常导致当前方法仅专注于本地特征信息。这经常导致边界认知的内部多样性和建筑之间的差异。因此,从高解度卫星遥感图像中细化EXTRACT表面特征成为了远程感知图像处理中的核心挑战。在这种情况下,我们提出了特征聚合网络(FANet),旨在EXTRACT高解度卫星遥感图像中的全球和本地特征。Pyramid Vision Transformer 捕捉到全球特征,然后通过特征聚合模块和差异消除模块将其融合成一个准确的表征。此外,为确保全面的表征地图,我们采用了抽象块和双向注意模块,扩展了受感器场和精细注意力。广泛的实验表明FANet在高解度卫星图像中EXTRACT特征的能力强大,这标志着远程感知图像处理领域的一次重要突破。我们即将发布代码。
TSSAT: Two-Stage Statistics-Aware Transformation for Artistic Style Transfer
paper_authors: Haibo Chen, Lei Zhao, Jun Li, Jian Yang
For: 这个论文的目的是提出一种基于人类绘画过程的艺术风格传递方法,以便创造出新的艺术图像。* Methods: 这个方法使用了两个新的损失函数:一个关注内容损失函数和一个块基于风格损失函数,以及一个两个阶段统计意识转换模块(TSSAT),用于在patchwise进行风格细节的替换,从而提高风格效果。* Results: 对比其他方法,这个方法可以更好地捕捉到艺术风格的多样性和多样性,同时保持内容图像的semantic关系。广泛的量化和质量实验证明了我们的方法的有效性。Abstract
Artistic style transfer aims to create new artistic images by rendering a given photograph with the target artistic style. Existing methods learn styles simply based on global statistics or local patches, lacking careful consideration of the drawing process in practice. Consequently, the stylization results either fail to capture abundant and diversified local style patterns, or contain undesired semantic information of the style image and deviate from the global style distribution. To address this issue, we imitate the drawing process of humans and propose a Two-Stage Statistics-Aware Transformation (TSSAT) module, which first builds the global style foundation by aligning the global statistics of content and style features and then further enriches local style details by swapping the local statistics (instead of local features) in a patch-wise manner, significantly improving the stylization effects. Moreover, to further enhance both content and style representations, we introduce two novel losses: an attention-based content loss and a patch-based style loss, where the former enables better content preservation by enforcing the semantic relation in the content image to be retained during stylization, and the latter focuses on increasing the local style similarity between the style and stylized images. Extensive qualitative and quantitative experiments verify the effectiveness of our method.
摘要
To address this issue, we propose a Two-Stage Statistics-Aware Transformation (TSSAT) module that imitates the drawing process of humans. The first stage builds a global style foundation by aligning the global statistics of content and style features. The second stage enriches local style details by swapping local statistics in a patch-wise manner, significantly improving stylization effects.Moreover, we introduce two novel losses to enhance both content and style representations: an attention-based content loss that enforces the semantic relation in the content image to be retained during stylization, and a patch-based style loss that focuses on increasing local style similarity between the style and stylized images. Extensive experiments demonstrate the effectiveness of our method.
ATTA: Anomaly-aware Test-Time Adaptation for Out-of-Distribution Detection in Segmentation
results: 我们在多个 OOD 分割benchmark上验证了我们的提议方法,包括具有显著领域变换的benchmark,并观察到了不同基础模型的表现改善。Abstract
Recent advancements in dense out-of-distribution (OOD) detection have primarily focused on scenarios where the training and testing datasets share a similar domain, with the assumption that no domain shift exists between them. However, in real-world situations, domain shift often exits and significantly affects the accuracy of existing out-of-distribution (OOD) detection models. In this work, we propose a dual-level OOD detection framework to handle domain shift and semantic shift jointly. The first level distinguishes whether domain shift exists in the image by leveraging global low-level features, while the second level identifies pixels with semantic shift by utilizing dense high-level feature maps. In this way, we can selectively adapt the model to unseen domains as well as enhance model's capacity in detecting novel classes. We validate the efficacy of our proposed method on several OOD segmentation benchmarks, including those with significant domain shifts and those without, observing consistent performance improvements across various baseline models.
摘要
最近的高精度异类检测技术主要集中在同一个领域下进行训练和测试数据集的情况下,假设没有领域转移。然而,在实际情况下,领域转移经常存在并对现有的异类检测模型的准确性产生重要影响。在这种情况下,我们提出了一种双级异类检测框架,同时处理领域转移和 semantic shift。第一级检测图像中是否存在领域转移,通过利用全像的低级别特征;第二级通过使用高级别特征图像来标识具有semantic shift的像素。这种方法可以使我们在未看过的领域中选择性地适应模型,同时提高模型的检测新类的能力。我们在多个OOD分割标准评测上验证了我们的提议方法,包括具有显著领域转移的和无领域转移的多个基eline模型,并观察到了一致的性能提高。
FLDNet: A Foreground-Aware Network for Polyp Segmentation Leveraging Long-Distance Dependencies
results: 比较现有方法表现出色,在常用的评价指标上达到了最高水平。Abstract
Given the close association between colorectal cancer and polyps, the diagnosis and identification of colorectal polyps play a critical role in the detection and surgical intervention of colorectal cancer. In this context, the automatic detection and segmentation of polyps from various colonoscopy images has emerged as a significant problem that has attracted broad attention. Current polyp segmentation techniques face several challenges: firstly, polyps vary in size, texture, color, and pattern; secondly, the boundaries between polyps and mucosa are usually blurred, existing studies have focused on learning the local features of polyps while ignoring the long-range dependencies of the features, and also ignoring the local context and global contextual information of the combined features. To address these challenges, we propose FLDNet (Foreground-Long-Distance Network), a Transformer-based neural network that captures long-distance dependencies for accurate polyp segmentation. Specifically, the proposed model consists of three main modules: a pyramid-based Transformer encoder, a local context module, and a foreground-Aware module. Multilevel features with long-distance dependency information are first captured by the pyramid-based transformer encoder. On the high-level features, the local context module obtains the local characteristics related to the polyps by constructing different local context information. The coarse map obtained by decoding the reconstructed highest-level features guides the feature fusion process in the foreground-Aware module of the high-level features to achieve foreground enhancement of the polyps. Our proposed method, FLDNet, was evaluated using seven metrics on common datasets and demonstrated superiority over state-of-the-art methods on widely-used evaluation measures.
摘要
giventext close association colorectal cancer polyps diagnosis identification colorectal polyps critical role detection surgical intervention context automatic detection segmentation various colonoscopy images emerged significant problem attracted broad attention current polyp segmentation techniques challenges firstly polyps vary size texture color pattern secondly boundaries polyps mucosa blurred existing studies focused learning local features ignoring long-range dependencies features ignoring local context global contextual information combined features address challenges propose FLDNet (Foreground-Long-Distance Network) Transformer-based neural network captures long-distance dependencies accurate polyp segmentation specifically proposed model consists three main modules pyramid-based Transformer encoder local context module foreground-Aware module multilevel features long-distance dependency information first captured pyramid-based transformer encoder high-level features local context module obtains local characteristics related polyps constructing different local context information coarse map obtained decoding reconstructed highest-level features guides feature fusion process foreground-Aware module achieve foreground enhancement polyps our proposed method FLDNet evaluated seven metrics common datasets demonstrated superiority state-of-art methods widely-used evaluation measures.
Self-supervised Extraction of Human Motion Structures via Frame-wise Discrete Features
results: 通过使用这些缺省的运动码,可以将人体运动的关系图示出来,并通过线性探测发现这些码在不同序列之间的差异。同时,这些运动码也可以用于多个认知任务中,并且可以达到与任务优化方法相同的性能水平。Abstract
The present paper proposes an encoder-decoder model for extracting the structures of human motions represented by frame-wise discrete features in a self-supervised manner. In the proposed method, features are extracted as codes in a motion codebook without the use of human knowledge, and the relationship between these codes can be visualized on a graph. Since the codes are expected to be temporally sparse compared to the captured frame rate and can be shared by multiple sequences, the proposed network model also addresses the need for training constraints. Specifically, the model consists of self-attention layers and a vector clustering block. The attention layers contribute to finding sparse keyframes and discrete features as motion codes, which are then extracted by vector clustering. The constraints are realized as training losses so that the same motion codes can be as contiguous as possible and can be shared by multiple sequences. In addition, we propose the use of causal self-attention as a method by which to calculate attention for long sequences consisting of numerous frames. In our experiments, the sparse structures of motion codes were used to compile a graph that facilitates visualization of the relationship between the codes and the differences between sequences. We then evaluated the effectiveness of the extracted motion codes by applying them to multiple recognition tasks and found that performance levels comparable to task-optimized methods could be achieved by linear probing.
摘要
现在的论文提出了一种encoder-decoder模型,用于自主监测人体动作的结构,通过frame-wise不同特征来自动提取这些结构。在提posed方法中,特征被视为码在动作码本中,而这些码之间的关系可以在图上可视化。由于码预计会比捕捉的帧率更加稀疏,因此提出的网络模型也解决了训练约束的需求。具体来说,模型包括自我注意层和 вектор聚合块。注意层帮助找到简短的关键帧和特征码,然后通过 вектор聚合提取这些码。约束被实现为训练损失,以便在多个序列之间共享相同的动作码。此外,我们还提出了使用 causal self-attention 来计算注意力的方法,以便对长序列中的多个帧进行注意力计算。在我们的实验中,通过使用稀疏的动作码,编译了一个图,方便了动作码之间和序列之间的关系的可视化。然后,我们通过将这些动作码应用到多个认知任务中,发现可以达到与任务优化方法相同的性能水平。
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
for: This paper proposes a new approach to generate training data with accurate labels at scale for object detection and segmentation tasks.
methods: The proposed approach decouples training data generation into foreground object generation and contextually coherent background generation, using text-to-image synthesis frameworks such as DALL-E and Stable Diffusion.
results: The authors demonstrate the advantages of their approach on five object detection and segmentation datasets, including Pascal VOC and COCO, and show that detectors trained solely on synthetic data produced by their method achieve performance comparable to those trained on real data. Additionally, the authors emphasize the compositional nature of their data generation approach in out-of-distribution and zero-shot data generation scenarios.Abstract
We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-to-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template, incorporating the object class name as input prompts. This is fed into a text-to-image synthesis framework, producing various foreground images set against isolated backgrounds. A foreground-background segmentation algorithm is then used to generate foreground object masks. To generate context images, we begin by creating language descriptions of the context. This is achieved by applying an image captioning method to a small set of images representing the desired context. These textual descriptions are then transformed into a diverse array of context images via a text-to-image synthesis framework. Subsequently, we composite these with the foreground object masks produced in the initial step, utilizing a cut-and-paste method, to formulate the training data. We demonstrate the advantages of our approach on five object detection and segmentation datasets, including Pascal VOC and COCO. We found that detectors trained solely on synthetic data produced by our method achieve performance comparable to those trained on real data (Fig. 1). Moreover, a combination of real and synthetic data yields even much better results. Further analysis indicates that the synthetic data distribution complements the real data distribution effectively. Additionally, we emphasize the compositional nature of our data generation approach in out-of-distribution and zero-shot data generation scenarios. We open-source our code at https://github.com/gyhandy/Text2Image-for-Detection
摘要
我们提出一种新的思路,用于自动生成具有准确标签的训练数据量化,使用文本到图像合成框架(如DALL-E、稳定扩散等)。我们的方法将训练数据生成划分为前景对象生成和文脉凝合的背景生成。为前景对象生成,我们使用直观文本模板,将对象类名作为输入提示。这些文本模板被 feed into 文本到图像合成框架,生成多种前景图像,其中每个图像具有隔离的背景。然后,我们使用前景-背景分割算法生成前景对象mask。为生成背景图像,我们首先创建语言描述,用于描述愿意的背景。我们使用一小组图像的图像描述算法来生成这些语言描述。这些文本描述被转换成一个多样化的背景图像数组via 文本到图像合成框架。最后,我们将这些背景图像与前景对象mask进行组合,使用剪辑方法,以生成训练数据。我们在 Pascal VOC 和 COCO 等五个对象检测和分割数据集上示出了我们的方法的优势(图1)。我们发现,由我们的方法生成的假数据可以与实际数据一样高效地训练检测器。此外,我们发现在将实际数据和假数据混合在一起时,性能更好。我们还发现,我们的数据生成方法在尝试数据和零shot数据生成方面具有组合性。我们将代码开源在 GitHub 上,可以在 中下载。
Introducing Shape Prior Module in Diffusion Model for Medical Image Segmentation
results: compared to其他现有方法,VerseDiff-UNet 在精度方面表现出色,同时保持了医疗影像中自然的特征和变化。Abstract
Medical image segmentation is critical for diagnosing and treating spinal disorders. However, the presence of high noise, ambiguity, and uncertainty makes this task highly challenging. Factors such as unclear anatomical boundaries, inter-class similarities, and irrational annotations contribute to this challenge. Achieving both accurate and diverse segmentation templates is essential to support radiologists in clinical practice. In recent years, denoising diffusion probabilistic modeling (DDPM) has emerged as a prominent research topic in computer vision. It has demonstrated effectiveness in various vision tasks, including image deblurring, super-resolution, anomaly detection, and even semantic representation generation at the pixel level. Despite the robustness of existing diffusion models in visual generation tasks, they still struggle with discrete masks and their various effects. To address the need for accurate and diverse spine medical image segmentation templates, we propose an end-to-end framework called VerseDiff-UNet, which leverages the denoising diffusion probabilistic model (DDPM). Our approach integrates the diffusion model into a standard U-shaped architecture. At each step, we combine the noise-added image with the labeled mask to guide the diffusion direction accurately towards the target region. Furthermore, to capture specific anatomical a priori information in medical images, we incorporate a shape a priori module. This module efficiently extracts structural semantic information from the input spine images. We evaluate our method on a single dataset of spine images acquired through X-ray imaging. Our results demonstrate that VerseDiff-UNet significantly outperforms other state-of-the-art methods in terms of accuracy while preserving the natural features and variations of anatomy.
摘要
医疗图像分割是诊断和治疗脊椎疾病的关键。然而,高噪音、不确定性和潜在的混淆使得这项任务具有极高的挑战性。因为脊椎图像中的解剖边界不够明确,同类图像之间存在相似性,而且标注不具有合理性,这些因素都使得图像分割变得更加困难。为了支持医生在临床实践中,获得高精度和多样化的图像分割模板是必要的。在过去几年,推 diffusion probabilistic modeling(DDPM)在计算机视觉领域得到了广泛的关注。DDPM在多种视觉任务中表现出色,包括图像滤清、超分解、异常检测和甚至像素级semantic representation生成。虽然现有的推散模型在视觉生成任务中具有强大的稳定性,但它们仍然在精度束和其多种效果上困难。为了解决医疗图像分割中的精度和多样化问题,我们提出了一种综合框架called VerseDiff-UNet。我们的方法将推散模型与标准的U型架构结合起来。在每个步骤中,我们将噪声添加的图像和标注推散向导准确地指向目标区域。此外,为了从医疗图像中提取高效的结构semantic信息,我们添加了一个形状先验模块。这个模块高效地提取了输入脊椎图像中的结构semantic信息。我们在X射线成像获取的一个单个数据集上进行了测试。我们的结果表明,VerseDiff-UNet在精度方面与其他状态对比方法有显著的优势,同时保持了自然的特征和变化。
Deep evidential fusion with uncertainty quantification and contextual discounting for multimodal medical image segmentation
paper_authors: Ling Huang, Su Ruan, Pierre Decazes, Thierry Denoeux for:多modal医疗影像通常无法提供充分的信息,因此医生通常基于多modal影像进行诊断,如PET/CT。将多modal信息有效融合是重要的,以确保做出可靠的判断并解释judgment的决策过程。methods:本文提出了基于深度学习和Dempster-Shafer理论的多modal医疗影像分类框架。在这个框架中,每个单 modal 影像中的可靠性在分类不同的物体时被考虑。各个modal 影像的折扣 pieces of evidence 然后由Dempster’s rule 联合,以获得最终的决策。results:实验结果显示,我们的方法在PET-CT dataset中与悉尼肿瘤检测中的lymphomas,以及多modal MRI dataset中的脑肿瘤检测中,具有较高的精度和可靠性。Abstract
Single-modality medical images generally do not contain enough information to reach an accurate and reliable diagnosis. For this reason, physicians generally diagnose diseases based on multimodal medical images such as, e.g., PET/CT. The effective fusion of multimodal information is essential to reach a reliable decision and explain how the decision is made as well. In this paper, we propose a fusion framework for multimodal medical image segmentation based on deep learning and the Dempster-Shafer theory of evidence. In this framework, the reliability of each single modality image when segmenting different objects is taken into account by a contextual discounting operation. The discounted pieces of evidence from each modality are then combined by Dempster's rule to reach a final decision. Experimental results with a PET-CT dataset with lymphomas and a multi-MRI dataset with brain tumors show that our method outperforms the state-of-the-art methods in accuracy and reliability.
摘要
单Modal的医疗影像通常不含足够的信息以确定精确和可靠的诊断。为此, Physician 通常根据多Modal的医疗影像,如 PET/CT,诊断疾病。多Modal信息的有效融合是必要的,以确定可靠的决策并解释如何做出这个决策。在这篇论文中,我们提出了基于深度学习和德мп斯特-沙佛理论的融合框架,用于多Modal医疗影像分割。在这个框架中,每种单Modal图像在分割不同对象时的可靠性被考虑。每种模式的减少的证据后,通过德мп斯特规则组合,以达到最终决策。实验结果表明,使用 PET-CT 数据集和多MRI 数据集,我们的方法在准确性和可靠性方面都高于当前的方法。
Medical Image Segmentation with Belief Function Theory and Deep Learning
paper_authors: Ling Huang for:This paper focuses on medical image segmentation approaches using belief function theory and deep learning, specifically addressing the challenges of reasoning with imperfect information.methods:The paper presents a semi-supervised medical image segmentation framework that incorporates evidential segmentation and evidence fusion to reduce uncertainty caused by limited annotations. The authors also compare two evidential classifiers, evidential neural network and radial basis function network, and use them with deep neural networks to construct deep evidential models for lymphoma segmentation.results:The paper presents a multimodal medical image fusion framework that takes into account the reliability of each MR image source when performing different segmentation tasks using mass functions and contextual discounting. The authors show the effectiveness of belief function theory in uncertainty quantification and demonstrate the improved performance of their approach compared to traditional methods.Abstract
Deep learning has shown promising contributions in medical image segmentation with powerful learning and feature representation abilities. However, it has limitations for reasoning with and combining imperfect (imprecise, uncertain, and partial) information. In this thesis, we study medical image segmentation approaches with belief function theory and deep learning, specifically focusing on information modeling and fusion based on uncertain evidence. First, we review existing belief function theory-based medical image segmentation methods and discuss their advantages and challenges. Second, we present a semi-supervised medical image segmentation framework to decrease the uncertainty caused by the lack of annotations with evidential segmentation and evidence fusion. Third, we compare two evidential classifiers, evidential neural network and radial basis function network, and show the effectiveness of belief function theory in uncertainty quantification; we use the two evidential classifiers with deep neural networks to construct deep evidential models for lymphoma segmentation. Fourth, we present a multimodal medical image fusion framework taking into account the reliability of each MR image source when performing different segmentation tasks using mass functions and contextual discounting.
摘要
深度学习在医疗影像分割中表现出了扎实的贡献,具有强大的学习和特征表示能力。然而,深度学习在处理不准确、uncertain和部分的信息时存在限制。在这个论文中,我们研究医疗影像分割方法,使用信念函数理论和深度学习,特别是关于信息模型化和融合基于不确定证据。首先,我们查看了现有的信念函数理论基于的医疗影像分割方法,并讨论了它们的优点和挑战。其次,我们提出了一种半监督医疗影像分割框架,以减少由缺乏标注所导致的不确定性。第三,我们比较了两种信念分类器,即信念神经网络和卷积函数网络,并证明了信念函数理论在不确定评估中的效果。我们使用了两种信念分类器与深度神经网络构建了深度信念模型以进行淋巴癌分 segmentation。最后,我们提出了一种多Modal医疗影像融合框架,考虑每个MRI图像来源的可靠性,并在不同的 segmentation 任务中使用质量函数和上下文折扣来进行融合。
Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning
results: 这个研究在六个常用的开源X射像数据集上进行了评估,结果显示MaCo比七种现有的方法在类别、分类和零shot阶段固定掌握中表现出色,demonstrating its great potential to promote a wide range of medical image analysis tasks。Abstract
Recently, multi-modal vision-language foundation models have gained significant attention in the medical field. While these models offer great opportunities, they still face a number of challenges, such as the requirement for fine-grained knowledge understanding in computer-aided diagnosis and capability of utilizing very limited or no task-specific labeled data in real-world clinical applications. In this study, we present MaCo, a novel multi-modal medical foundation model that explores masked contrastive learning to achieve granular alignment and zero-shot learning for a variety of medical imaging tasks. MaCo incorporates a correlation weighting mechanism to adjust the correlation between masked image patches and their corresponding reports, thereby enhancing the representation learning capabilities. We evaluate MaCo on six well-known open-source X-ray datasets, and the experimental results show it outperforms seven state-of-the-art approaches for classification, segmentation, and zero-shot phase grounding, demonstrating its great potential to promote a wide range of medical image analysis tasks.
摘要
近期,多模态视语言基础模型在医疗领域得到了广泛关注。虽然这些模型具有很大的潜力,但仍面临许多挑战,如计算机辅助诊断中细化知识理解的需求以及在实际临床应用中使用非常有限或无任务特定标注数据的能力。在本研究中,我们提出了MaCo,一种新的多模态医疗基础模型,利用掩码对比学习来实现精细对Alignment和零shot学习,以涵盖各种医学影像任务。MaCo通过调整掩码图像块和其相应的报告之间的相关性权重,进而提高了表征学习能力。我们在六个常见的开源X射线数据集上进行了实验,结果显示MaCo超过了七种现状最佳方法,包括分类、分割和零shot阶段定位,这表明它在各种医学影像分析任务中具有极大的潜力。
Adversarial Attacks Assessment of Salient Object Detection via Symbolic Learning
results: 研究发现,符号学习方法在面对恶意攻击和干扰时表现更好,与深度学习方法不同,它们在攻击下表现出了显著的性能下降。Abstract
Machine learning is at the center of mainstream technology and outperforms classical approaches to handcrafted feature design. Aside from its learning process for artificial feature extraction, it has an end-to-end paradigm from input to output, reaching outstandingly accurate results. However, security concerns about its robustness to malicious and imperceptible perturbations have drawn attention since its prediction can be changed entirely. Salient object detection is a research area where deep convolutional neural networks have proven effective but whose trustworthiness represents a significant issue requiring analysis and solutions to hackers' attacks. Brain programming is a kind of symbolic learning in the vein of good old-fashioned artificial intelligence. This work provides evidence that symbolic learning robustness is crucial in designing reliable visual attention systems since it can withstand even the most intense perturbations. We test this evolutionary computation methodology against several adversarial attacks and noise perturbations using standard databases and a real-world problem of a shorebird called the Snowy Plover portraying a visual attention task. We compare our methodology with five different deep learning approaches, proving that they do not match the symbolic paradigm regarding robustness. All neural networks suffer significant performance losses, while brain programming stands its ground and remains unaffected. Also, by studying the Snowy Plover, we remark on the importance of security in surveillance activities regarding wildlife protection and conservation.
摘要
machine learning 是现代科技中最重要的核心技术,其超越了经典的手动特征设计方法。除了学习过程中的人工特征提取外,它还具有从输入到输出的端到端模式,达到了极高的准确率。然而,由于其鲁棒性问题,它的预测结果可以被 entirely 改变。静观检测是一个研究领域,深度卷积神经网络在这个领域中表现出色,但是其可靠性问题吸引了广泛的关注。Brain programming 是一种符号学习,与传统的人工智能有着类似的思想。我们的工作提供了证据,表明符号学习的鲁棒性是设计可靠视觉注意力系统的关键。我们对多个敌意攻击和噪声扰动使用标准数据库和实际问题进行测试,并与五种深度学习方法进行比较。我们发现,深度学习方法在鲁棒性方面不能与符号学习相匹敌。所有神经网络都受到了重大的性能损失,而 brain programming 则保持不变。此外,通过研究雪亚鸥(Snowy Plover),我们强调了野生动物保护和保育活动中的安全问题的重要性。
Hierarchical Conditional Semi-Paired Image-to-Image Translation For Multi-Task Image Defect Correction On Shopping Websites
methods: 使用 Image-to-Image (I2I)翻译模型,具有高级别损害组和具体损害类型的注意机制,以干涯各种产品类型的损害图像。( using an Image-to-Image translation model with an attention mechanism to correct multiple defects of different product types)
results: 相比MONCE,our model reduces the Frechet Inception Distance(FID)by 24.6% on average,并在一个商业化的购物网站上降低了FID的平均值by 63.2% compared with WS-I2I。( compared with MONCE, our model reduces the Frechet Inception Distance by 24.6% on average, and reduces the average value of FID by 63.2% on a commercial shopping website)Abstract
On shopping websites, product images of low quality negatively affect customer experience. Although there are plenty of work in detecting images with different defects, few efforts have been dedicated to correct those defects at scale. A major challenge is that there are thousands of product types and each has specific defects, therefore building defect specific models is unscalable. In this paper, we propose a unified Image-to-Image (I2I) translation model to correct multiple defects across different product types. Our model leverages an attention mechanism to hierarchically incorporate high-level defect groups and specific defect types to guide the network to focus on defect-related image regions. Evaluated on eight public datasets, our model reduces the Frechet Inception Distance (FID) by 24.6% in average compared with MoNCE, the state-of-the-art I2I method. Unlike public data, another practical challenge on shopping websites is that some paired images are of low quality. Therefore we design our model to be semi-paired by combining the L1 loss of paired data with the cycle loss of unpaired data. Tested on a shopping website dataset to correct three image defects, our model reduces (FID) by 63.2% in average compared with WS-I2I, the state-of-the art semi-paired I2I method.
摘要
在购物网站上,产品图像质量低下影响用户体验。尽管有很多研究检测不同损害的图像,但几乎没有努力用于大规模修复这些损害。主要挑战是每种产品都有特定的损害类型,因此建立特定损害模型是不可能的。在这篇论文中,我们提出了一种通用的图像到图像(I2I)翻译模型,用于修复不同产品类型的多个损害。我们的模型利用注意力机制,将高级损害组和特定损害类型注入到网络中,以便指导网络专注于图像中的损害关键区域。经过八个公共数据集的评估,我们的模型与MoNCE,当前I2I方法的状态OF THE ARTS,相比下降24.6%的Frechet Inception Distance(FID)。与公共数据不同的另一个实际挑战是,一些购物网站上的对应图像质量较低。因此我们设计了我们的模型为半相对的,将L1损失与相对损失结合在一起。在一个购物网站数据集上测试,我们的模型可以降低(FID)的平均值为63.2%,相比WS-I2I,当前半相对I2I方法的状态OF THE ARTS。
methods: 本研究使用了深度神经网络模型,并提出了一种新的攻击方法 called DodgePersonation Attack,可以在FV系统中创造出可以欺骗人类识别的面部图像。此外,本研究还提出了一种新的攻击分类方法,可以帮助更好地理解和防范FV系统受到的攻击。
results: 本研究的结果显示,DodgePersonation Attack可以在一个well-known scenario中实现state-of-the-art的性能,可以创造出9张图像,覆盖43.82%的身份。此外,本研究还发现,使用9张图像可以覆盖57.27%到58.5%的身份,并且这些图像看起来都是完全一致的。Abstract
Face verification (FV) using deep neural network models has made tremendous progress in recent years, surpassing human accuracy and seeing deployment in various applications such as border control and smartphone unlocking. However, FV systems are vulnerable to Adversarial Attacks, which manipulate input images to deceive these systems in ways usually unnoticeable to humans. This paper provides an in-depth study of attacks on FV systems. We introduce the DodgePersonation Attack that formulates the creation of face images that impersonate a set of given identities while avoiding being identified as any of the identities in a separate, disjoint set. A taxonomy is proposed to provide a unified view of different types of Adversarial Attacks against FV systems, including Dodging Attacks, Impersonation Attacks, and Master Face Attacks. Finally, we propose the ''One Face to Rule Them All'' Attack which implements the DodgePersonation Attack with state-of-the-art performance on a well-known scenario (Master Face Attack) and which can also be used for the new scenarios introduced in this paper. While the state-of-the-art Master Face Attack can produce a set of 9 images to cover 43.82% of the identities in their test database, with 9 images our attack can cover 57.27% to 58.5% of these identifies while giving the attacker the choice of the identity to use to create the impersonation. Moreover, the 9 generated attack images appear identical to a casual observer.
摘要
面部验证(FV)使用深度神经网络模型已经在最近几年内做出了巨大的进步,超过了人类准确率并在不同的应用中部署,如边境控制和智能手机锁定。然而,FV系统受到了抗击攻击的威胁,这些攻击可以通过修改输入图像来让FV系统错误地认为图像是其他人的。本文提供了FV系统攻击的深入研究,我们介绍了创建一个名为“掩饰人格攻击”的新攻击方法,该方法可以在不同的人脸库中创建一个掩饰人格,以避免被识别为任何其他人的人。此外,我们还提出了一种攻击分类法,以提供对FV系统攻击的一般视角,包括掩饰攻击、冒充攻击和主人脸攻击。最后,我们提出了一种“一个人脸征服全部”的攻击,该攻击可以在一个已知的enario(主人脸攻击)中实现state-of-the-art性能,并且可以在新提出的scenario中使用。而现状的主人脸攻击可以生成9张图像,覆盖43.82%的人脸库,而我们的攻击可以生成9张图像,覆盖57.27%到58.5%的人脸库,同时给攻击者选择创建假装的人脸。此外,生成的9张攻击图像都看起来都是一般人类无法分辨的。
for: quantum computing, communication, and sensing
methods: combining Quantum Random Access Memory (QRAM) and quantum networks
results: significant benefits in efficiency, security, and precision for customers, with potential scientific and business opportunities in machine learning and big data industries.Here’s the text in Simplified Chinese:
for: 量子计算、量子通信和量子探测
methods: 结合量子随机访问存储器(QRAM)和量子网络
results: 为客户提供高效、安全、精度的 significative benefits,在机器学习和大数据行业中探讨可能的科学和商业机遇。Abstract
A quantum version of data centers might be significant in the quantum era. In this paper, we introduce Quantum Data Center (QDC), a quantum version of existing classical data centers, with a specific emphasis on combining Quantum Random Access Memory (QRAM) and quantum networks. We argue that QDC will provide significant benefits to customers in terms of efficiency, security, and precision, and will be helpful for quantum computing, communication, and sensing. We investigate potential scientific and business opportunities along this novel research direction through hardware realization and possible specific applications. We show the possible impacts of QDCs in business and science, especially the machine learning and big data industries.
摘要
一个量子版的数据中心可能在量子时代具有重要意义。在这篇论文中,我们介绍量子数据中心(QDC),量子版的现有古典数据中心,强调将量子随机访问存储(QRAM)和量子网络结合使用。我们认为QDC将为客户提供高效、安全和精度的好处,并将对量子计算、通信和探测做出重要贡献。我们通过硬件实现和可能的具体应用来研究这一新的研究方向的科学和商业机会。我们展示了QDC在业务和科学领域的可能的影响,特别是机器学习和大数据领域。
The Relational Bottleneck as an Inductive Bias for Efficient Abstraction
paper_authors: Taylor W. Webb, Steven M. Frankland, Awni Altabaa, Kamesh Krishnamurthy, Declan Campbell, Jacob Russin, Randall O’Reilly, John Lafferty, Jonathan D. Cohen
results: 研究表明,这种方法可以在数据效率的情况下induce抽象,并且可能是人类大脑中抽象概念的获得的机制。Abstract
A central challenge for cognitive science is to explain how abstract concepts are acquired from limited experience. This effort has often been framed in terms of a dichotomy between empiricist and nativist approaches, most recently embodied by debates concerning deep neural networks and symbolic cognitive models. Here, we highlight a recently emerging line of work that suggests a novel reconciliation of these approaches, by exploiting an inductive bias that we term the relational bottleneck. We review a family of models that employ this approach to induce abstractions in a data-efficient manner, emphasizing their potential as candidate models for the acquisition of abstract concepts in the human mind and brain.
摘要
中心挑战是让抽象概念从有限经验中获得。这一努力 часто被划分为经验主义和Native主义方法,最近在深度神经网络和符号认知模型之间展开了讨论。我们在这里强调一种新出现的工作,通过利用我们称为关系瓶颈的推理偏好来解决这一问题。我们评论了一家族模型,这些模型通过这种方法来从数据有效地获得抽象,强调它们在人类大脑和脑中抽象概念的获得中的潜在作用。
A Reinforcement Learning Approach for Robotic Unloading from Visual Observations
paper_authors: Vittorio Giammarino, Alberto Giammarino, Matthew Pearce
For: 本研究强调使用视觉输入自动解压包裹问题,以便机器人可以通过RGB-D图像来学习无需标注数据。* Methods: 我们提出了一种层次控制结构,其中高级决策模块和传统的运动控制结合在一起。高级模块通过深度征识学习(DRL)进行训练,并采用安全偏好机制和适应于这个任务的奖励函数。* Results: 我们的实验表明,这两个元素都对学习性能产生了关键作用。此外,为确保可重复性和未来研究的标准,我们提供了免费代码和 simulate。Abstract
In this work, we focus on a robotic unloading problem from visual observations, where robots are required to autonomously unload stacks of parcels using RGB-D images as their primary input source. While supervised and imitation learning have accomplished good results in these types of tasks, they heavily rely on labeled data, which are challenging to obtain in realistic scenarios. Our study aims to develop a sample efficient controller framework that can learn unloading tasks without the need for labeled data during the learning process. To tackle this challenge, we propose a hierarchical controller structure that combines a high-level decision-making module with classical motion control. The high-level module is trained using Deep Reinforcement Learning (DRL), wherein we incorporate a safety bias mechanism and design a reward function tailored to this task. Our experiments demonstrate that both these elements play a crucial role in achieving improved learning performance. Furthermore, to ensure reproducibility and establish a benchmark for future research, we provide free access to our code and simulation.
摘要
在这项工作中,我们关注了基于视觉观察的 роботизирован无需卸车问题,Robot需要通过RGB-D图像作为主要输入来自动卸车堆叠的包裹。而supervised学习和模仿学习在这类任务中已经达到了良好的结果,但它们强依赖于实际场景中困难获得的标注数据。我们的研究旨在开发一种样本效率高的控制框架,可以在学习过程中不需要标注数据。为此,我们提议一种层次控制结构, combining高级决策模块和传统的运动控制。高级模块通过深度循环学习(DRL)进行训练,并在这个过程中添加了安全偏好机制和适应到这个任务的奖励函数。我们的实验表明,这两个元素具有重要的作用,可以提高学习性能。此外,为确保可重现性和建立未来研究的标准,我们提供了免费的代码和 simulations。
CloudBrain-NMR: An Intelligent Cloud Computing Platform for NMR Spectroscopy Processing, Reconstruction and Analysis
results: 该平台可以快速地处理大量的NMR数据,并提供高精度的定量分析结果。此外,它还具有开放的API,可以让用户根据需要选择不同的处理流程和深度学习算法。Abstract
Nuclear Magnetic Resonance (NMR) spectroscopy has served as a powerful analytical tool for studying molecular structure and dynamics in chemistry and biology. However, the processing of raw data acquired from NMR spectrometers and subsequent quantitative analysis involves various specialized tools, which necessitates comprehensive knowledge in programming and NMR. Particularly, the emerging deep learning tools is hard to be widely used in NMR due to the sophisticated setup of computation. Thus, NMR processing is not an easy task for chemist and biologists. In this work, we present CloudBrain-NMR, an intelligent online cloud computing platform designed for NMR data reading, processing, reconstruction, and quantitative analysis. The platform is conveniently accessed through a web browser, eliminating the need for any program installation on the user side. CloudBrain-NMR uses parallel computing with graphics processing units and central processing units, resulting in significantly shortened computation time. Furthermore, it incorporates state-of-the-art deep learning-based algorithms offering comprehensive functionalities that allow users to complete the entire processing procedure without relying on additional software. This platform has empowered NMR applications with advanced artificial intelligence processing. CloudBrain-NMR is openly accessible for free usage at https://csrc.xmu.edu.cn/CloudBrain.html
摘要
核磁共振(NMR)分析是化学和生物研究中的一种强大工具,但是从NMR仪器上获取的原始数据处理和后续的量化分析需要各种专门的工具,这需要深入的编程和NMR知识。特别是在深度学习工具出现之前,NMR处理是非常困难的。为了解决这问题,我们提出了CloudBrain-NMR,一个基于云计算的智能在线平台,用于NMR数据的读取、处理、重构和量化分析。该平台通过浏览器访问,不需要用户安装任何软件。CloudBrain-NMR使用并行计算和图形处理器,实现了明显缩短计算时间。此外,它还包含了最新的深度学习算法,提供了全面的功能,让用户可以完成整个处理过程,不需要靠其他软件。这使得NMR应用程序得到了高级人工智能处理的 empowerment。CloudBrain-NMR是免费开放的,可以在https://csrc.xmu.edu.cn/CloudBrain.html中免费使用。
Hybrid Algorithm Selection and Hyperparameter Tuning on Distributed Machine Learning Resources: A Hierarchical Agent-based Approach
paper_authors: Ahmad Esmaeili, Julia T. Rayz, Eric T. Matson
for: This paper proposes a fully automatic and collaborative agent-based mechanism for selecting distributedly organized machine learning algorithms and simultaneously tuning their hyperparameters.
methods: The proposed method builds upon an existing agent-based hierarchical machine-learning platform and augments its query structure to support the aforementioned functionalities without being limited to specific learning, selection, and tuning mechanisms.
results: According to the results, our solution is totally correct and exhibits linear time and space complexity in relation to the size of available resources. The proposed method is also demonstrated to be effective in adapting and performing across a range of algorithmic options and datasets through experiments using a system comprised of 24 algorithms and 9 datasets.Abstract
Algorithm selection and hyperparameter tuning are critical steps in both academic and applied machine learning. On the other hand, these steps are becoming ever increasingly delicate due to the extensive rise in the number, diversity, and distributedness of machine learning resources. Multi-agent systems, when applied to the design of machine learning platforms, bring about several distinctive characteristics such as scalability, flexibility, and robustness, just to name a few. This paper proposes a fully automatic and collaborative agent-based mechanism for selecting distributedly organized machine learning algorithms and simultaneously tuning their hyperparameters. Our method builds upon an existing agent-based hierarchical machine-learning platform and augments its query structure to support the aforementioned functionalities without being limited to specific learning, selection, and tuning mechanisms. We have conducted theoretical assessments, formal verification, and analytical study to demonstrate the correctness, resource utilization, and computational efficiency of our technique. According to the results, our solution is totally correct and exhibits linear time and space complexity in relation to the size of available resources. To provide concrete examples of how the proposed methodologies can effectively adapt and perform across a range of algorithmic options and datasets, we have also conducted a series of experiments using a system comprised of 24 algorithms and 9 datasets.
摘要
algorithm 选择和超参数调整是学术应用机器学习中的关键步骤,然而这些步骤正在不断增加、多样化和分布化机器学习资源的情况下变得越来越细腻。在机器学习平台的设计中,多智能体系统带来了规模、灵活性和稳定性等特点。本文提出了一种完全自动和协作的智能体基于机制,用于分布式组织机器学习算法和同时调整其超参数。我们的方法基于现有的智能体层次机器学习平台,并将其查询结构改进以支持上述功能性能不受特定学习、选择和调整机制限制。我们已经进行了理论评估、正式验证和分析研究,以证明我们的技术是完全正确的,并且在资源大小的情况下具有线性时间和空间复杂度。为了让读者更好地理解我们的方法在不同算法和数据集上的应用和效果,我们还进行了一系列实验,使用了24种算法和9个数据集。
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning
results: 根据论文的描述,Rank2Tell 数据集和联合模型在视觉场景理解和相关领域中提供了一个有价值的资源,并在量化评估中达到了高水平的性能。Abstract
The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Further, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
摘要
广泛采用商业自动驾驶车(AV)和高级驾驶助手系统(ADAS)的普及可能受到社会的接受程度的限制,这个任务的难度在于现代自动驾驶系统软件借助黑盒人工智能模型。为达到这个目标,本文提出了一个新的数据集,即 Rank2Tell,这是一个多模态自我中心数据集,用于评估对象的重要性水平和说明其重要性的原因。该数据集通过多种close和开放式视觉问答来提供各种语义、空间、时间和关系特征的精密注释,使其成为研究视觉Scene理解和相关领域的价值资源。此外,我们还引入了一种共同模型,用于同时评估对象的重要性水平和生成自然语言描述,以 benchmark我们的数据集并进行评估性评价。
Do Generative Large Language Models need billions of parameters?
results: 这项研究提供了创新的工具和方法,可以创造更高效和有效的LLMs,为AI语言模型的可持续发展和普及做出了贡献。Abstract
This paper presents novel systems and methodologies for the development of efficient large language models (LLMs). It explores the trade-offs between model size, performance, and computational resources, with the aim of maximizing the efficiency of these AI systems. The research explores novel methods that allow different parts of the model to share parameters, reducing the total number of unique parameters required. This approach ensures that the model remains compact without sacrificing its ability to learn and represent complex language structures. This study provides valuable insights and tools for creating more efficient and effective LLMs, contributing to a more sustainable and accessible future for AI language modeling.
摘要
Translation notes:* "efficient" is translated as "高效" (gāo yè)* "large language models" is translated as "大型语言模型" (dà xíng yǔ yán módel)* "trade-offs" is translated as "负担" (zhāng dāng)* "compact" is translated as "减少" (jiǎn shǎo)* "sustainable" is translated as "可持续" (kě chéng xù)* "accessible" is translated as "可达" (kě dà)
HurriCast: An Automatic Framework Using Machine Learning and Statistical Modeling for Hurricane Forecasting
results: 实验表明,这种混合方法可以准确地模拟历史飓风行为,并提供详细的未来轨迹和强度预测,对风险管理策略提供了有价值的参考。Abstract
Hurricanes present major challenges in the U.S. due to their devastating impacts. Mitigating these risks is important, and the insurance industry is central in this effort, using intricate statistical models for risk assessment. However, these models often neglect key temporal and spatial hurricane patterns and are limited by data scarcity. This study introduces a refined approach combining the ARIMA model and K-MEANS to better capture hurricane trends, and an Autoencoder for enhanced hurricane simulations. Our experiments show that this hybrid methodology effectively simulate historical hurricane behaviors while providing detailed projections of potential future trajectories and intensities. Moreover, by leveraging a comprehensive yet selective dataset, our simulations enrich the current understanding of hurricane patterns and offer actionable insights for risk management strategies.
摘要
飓风在美国 pose 严重挑战,因为它们可能导致毁灭性的影响。为了降低这些风险,保险业是关键的,使用复杂的统计模型进行风险评估。然而,这些模型经常忽略风暴的时间和空间特征,并且由于数据缺乏,受限于。本研究提出了一种改进的方法,结合ARIMA模型和K-MEANS,以更好地捕捉风暴趋势,并使用自适应神经网络进行增强的风暴模拟。我们的实验表明,这种混合方法可以准确地模拟历史风暴行为,并提供详细的未来轨迹和强度预测。此外,通过利用全面而选择性的数据集,我们的模拟提高了现有风暴模式的理解,并提供了有价值的风险管理策略。
Hierarchical Multi-Task Learning Framework for Session-based Recommendations
results: 在两个Session-based recommendation数据集上,HierSRec比既有SBRS的next-item预测精度高,并且针对手动生成的候选项(例如4%的总ITEMS)进行可扩展的推理。Abstract
While session-based recommender systems (SBRSs) have shown superior recommendation performance, multi-task learning (MTL) has been adopted by SBRSs to enhance their prediction accuracy and generalizability further. Hierarchical MTL (H-MTL) sets a hierarchical structure between prediction tasks and feeds outputs from auxiliary tasks to main tasks. This hierarchy leads to richer input features for main tasks and higher interpretability of predictions, compared to existing MTL frameworks. However, the H-MTL framework has not been investigated in SBRSs yet. In this paper, we propose HierSRec which incorporates the H-MTL architecture into SBRSs. HierSRec encodes a given session with a metadata-aware Transformer and performs next-category prediction (i.e., auxiliary task) with the session encoding. Next, HierSRec conducts next-item prediction (i.e., main task) with the category prediction result and session encoding. For scalable inference, HierSRec creates a compact set of candidate items (e.g., 4% of total items) per test example using the category prediction. Experiments show that HierSRec outperforms existing SBRSs as per next-item prediction accuracy on two session-based recommendation datasets. The accuracy of HierSRec measured with the carefully-curated candidate items aligns with the accuracy of HierSRec calculated with all items, which validates the usefulness of our candidate generation scheme via H-MTL.
摘要
session-based recommender systems (SBRSs) 已经显示出了更高的推荐性能,但是多任务学习 (MTL) 已经被 SBRSs 采用以进一步提高其预测精度和泛化性。层次多任务学习 (H-MTL) 设置了一个层次结构 между预测任务和输出 auxiliary tasks 的输出。这个层次结构导致主任务的输入特征更加丰富,并且提高了预测结果的解释性,相比既有MTL框架。然而,H-MTL 框架在 SBRSs 中尚未被研究。在这篇论文中,我们提出了 HierSRec,它将 H-MTL 框架应用于 SBRSs。HierSRec 使用 metadata-aware Transformer 对给定的会话进行编码,然后使用会话编码进行下一个类型预测(即 auxiliary task)。接着,HierSRec 使用类型预测结果和会话编码进行下一个项目预测(即 main task)。为了可扩展的推理,HierSRec 创建了一个紧凑的候选项列表(例如,4% 的总项)每个测试示例,使用类型预测结果进行筛选。实验结果显示,HierSRec 在两个会话基于推荐数据集上的下一个项目预测精度上表现出色,与现有 SBRSs 相比。HierSRec 测试结果与我们精心准备的候选项列表相关,证明了我们的候选生成方案的有用性。
Minimum Bayes’ Risk Decoding for System Combination of Grammatical Error Correction Systems
results: 实验结果表明,使用提议的 MBR 解码方法可以提高 GEC 系统的性能,并且可以通过调整奖励度量来控制系统的精度、准确率和 F-score。Abstract
For sequence-to-sequence tasks it is challenging to combine individual system outputs. Further, there is also often a mismatch between the decoding criterion and the one used for assessment. Minimum Bayes' Risk (MBR) decoding can be used to combine system outputs in a manner that encourages better alignment with the final assessment criterion. This paper examines MBR decoding for Grammatical Error Correction (GEC) systems, where performance is usually evaluated in terms of edits and an associated F-score. Hence, we propose a novel MBR loss function directly linked to this form of criterion. Furthermore, an approach to expand the possible set of candidate sentences is described. This builds on a current max-voting combination scheme, as well as individual edit-level selection. Experiments on three popular GEC datasets and with state-of-the-art GEC systems demonstrate the efficacy of the proposed MBR approach. Additionally, the paper highlights how varying reward metrics within the MBR decoding framework can provide control over precision, recall, and the F-score in combined GEC systems.
摘要
For sequence-to-sequence tasks, it is challenging to combine individual system outputs. Furthermore, there is often a mismatch between the decoding criterion and the one used for assessment. Minimum Bayes' Risk (MBR) decoding can be used to combine system outputs in a manner that encourages better alignment with the final assessment criterion. This paper examines MBR decoding for Grammatical Error Correction (GEC) systems, where performance is usually evaluated in terms of edits and an associated F-score. Therefore, we propose a novel MBR loss function directly linked to this form of criterion. Additionally, an approach to expand the possible set of candidate sentences is described. This builds on a current max-voting combination scheme, as well as individual edit-level selection. Experiments on three popular GEC datasets and with state-of-the-art GEC systems demonstrate the efficacy of the proposed MBR approach. Moreover, the paper highlights how varying reward metrics within the MBR decoding framework can provide control over precision, recall, and the F-score in combined GEC systems.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Learning Disentangled Avatars with Hybrid 3D Representations
results: 本paper实现了人类模拟器的分解,并在不同应用中表现出色,如分解人体和服装、分解面孔和头发等。此外,paper还发现了可以将头发和服装转移到不同的身体形态上。Abstract
Tremendous efforts have been made to learn animatable and photorealistic human avatars. Towards this end, both explicit and implicit 3D representations are heavily studied for a holistic modeling and capture of the whole human (e.g., body, clothing, face and hair), but neither representation is an optimal choice in terms of representation efficacy since different parts of the human avatar have different modeling desiderata. For example, meshes are generally not suitable for modeling clothing and hair. Motivated by this, we present Disentangled Avatars~(DELTA), which models humans with hybrid explicit-implicit 3D representations. DELTA takes a monocular RGB video as input, and produces a human avatar with separate body and clothing/hair layers. Specifically, we demonstrate two important applications for DELTA. For the first one, we consider the disentanglement of the human body and clothing and in the second, we disentangle the face and hair. To do so, DELTA represents the body or face with an explicit mesh-based parametric 3D model and the clothing or hair with an implicit neural radiance field. To make this possible, we design an end-to-end differentiable renderer that integrates meshes into volumetric rendering, enabling DELTA to learn directly from monocular videos without any 3D supervision. Finally, we show that how these two applications can be easily combined to model full-body avatars, such that the hair, face, body and clothing can be fully disentangled yet jointly rendered. Such a disentanglement enables hair and clothing transfer to arbitrary body shapes. We empirically validate the effectiveness of DELTA's disentanglement by demonstrating its promising performance on disentangled reconstruction, virtual clothing try-on and hairstyle transfer. To facilitate future research, we also release an open-sourced pipeline for the study of hybrid human avatar modeling.
摘要
很大的努力已经投入到了人类动画和实际化人物模型的学习中。在这个领域,both explicit和implicit的3D表示都被广泛研究,以实现人类整体模型化和捕捉(例如,身体、衣服和头发),但 neither representation是优选的选择,因为不同的人物部分有不同的模型需求。例如,网格不适用于衣服和头发的模型。这种情况下,我们提出了Disentangled Avatars(DELTA),它使用了混合的explicit-implicit 3D表示来模型人类。DELTA通过一个灰度RGB视频输入,生成了一个人物模型,其中包括身体和衣服/头发层。具体来说,我们展示了两个重要的应用场景。在第一个应用场景中,我们考虑了人体和衣服的分离,在第二个应用场景中,我们分离了面孔和头发。为了实现这些应用场景,DELTA使用了一种由网格和神经辐射场组成的混合表示方法。为了实现这种方法,我们设计了一个端到端可微 differentiable 渲染器,该渲染器将网格 integrate into volumetric rendering,以便DELTA可以直接从灰度视频中学习,不需要任何3D监督。最后,我们表明了如何将这两个应用场景结合起来,以实现全身人物模型的分离和重新渲染。这种分离允许头发和衣服进行到任何身体形状的转移。我们通过实验证明了DELTA的分离性能的表现,包括分离重建、虚拟服装尝试和头发样式传输。为了促进未来的研究,我们还发布了一个开源的人类动画模型研究管道。
LEAP Hand: Low-Cost, Efficient, and Anthropomorphic Hand for Robot Learning
methods: 这 paper 使用了一种新的机械结构,使得手臂在不同的手势状态下仍能保持最大的dexterity。此外,paper 还使用了 Machine Learning 技术进行 manipulate 任务的学习。
results: 这 paper 的实验结果表明,LEAP Hand 可以在真实世界中完成多种抓取任务,包括视觉 теле操作和学习从无动视频数据。LEAP Hand 在所有实验中都表现出色,而且与最近的竞争对手 Allegro Hand 相比,它的成本为 1/8。Abstract
Dexterous manipulation has been a long-standing challenge in robotics. While machine learning techniques have shown some promise, results have largely been currently limited to simulation. This can be mostly attributed to the lack of suitable hardware. In this paper, we present LEAP Hand, a low-cost dexterous and anthropomorphic hand for machine learning research. In contrast to previous hands, LEAP Hand has a novel kinematic structure that allows maximal dexterity regardless of finger pose. LEAP Hand is low-cost and can be assembled in 4 hours at a cost of 2000 USD from readily available parts. It is capable of consistently exerting large torques over long durations of time. We show that LEAP Hand can be used to perform several manipulation tasks in the real world -- from visual teleoperation to learning from passive video data and sim2real. LEAP Hand significantly outperforms its closest competitor Allegro Hand in all our experiments while being 1/8th of the cost. We release detailed assembly instructions, the Sim2Real pipeline and a development platform with useful APIs on our website at https://leap-hand.github.io/
摘要
dexterous 操作已经是机器人领域的长期挑战。虽然机器学习技术已经显示了一定的承诺,但结果主要受到硬件的限制。在这篇论文中,我们介绍了LEAP手,一个低成本的手臂,用于机器学习研究。与之前的手臂不同,LEAP手具有新的骨骼结构,允许无论手指pose都能够达到最大的dexterity。LEAP手是低成本的,可以在4小时内为2000美元组装,使用可得到的部件。它可以在长时间内一直承受大的扭矩。我们表明LEAP手可以在真实世界中完成多种操作任务,从视觉操作到学习从无动视频数据和sim2real。LEAP手在所有实验中都能够superior于Allegro手,而且只有1/8的成本。我们在网站https://leap-hand.github.io/上发布了详细的组装指南,Sim2Real管道和开发平台,以及有用的API。
Unveiling the potential of large language models in generating semantic and cross-language clones
results: 研究发现,GPT-3模型在生成semantic和cross-language代码副本方面具有出色的表现,其中在semantic clones方面取得了62.14%的准确率和0.55 BLEU分数,在cross-language clones方面达到了91.25%的准确率。Abstract
Semantic and Cross-language code clone generation may be useful for code reuse, code comprehension, refactoring and benchmarking. OpenAI's GPT model has potential in such clone generation as GPT is used for text generation. When developers copy/paste codes from Stack Overflow (SO) or within a system, there might be inconsistent changes leading to unexpected behaviours. Similarly, if someone possesses a code snippet in a particular programming language but seeks equivalent functionality in a different language, a semantic cross-language code clone generation approach could provide valuable assistance. In this study, using SemanticCloneBench as a vehicle, we evaluated how well the GPT-3 model could help generate semantic and cross-language clone variants for a given fragment.We have comprised a diverse set of code fragments and assessed GPT-3s performance in generating code variants.Through extensive experimentation and analysis, where 9 judges spent 158 hours to validate, we investigate the model's ability to produce accurate and semantically correct variants. Our findings shed light on GPT-3's strengths in code generation, offering insights into the potential applications and challenges of using advanced language models in software development. Our quantitative analysis yields compelling results. In the realm of semantic clones, GPT-3 attains an impressive accuracy of 62.14% and 0.55 BLEU score, achieved through few-shot prompt engineering. Furthermore, the model shines in transcending linguistic confines, boasting an exceptional 91.25% accuracy in generating cross-language clones
摘要
semantic和跨语言代码倾Copy generation可能有用于代码重用、代码理解、重构和benchmarking。OpenAI的GPT模型有潜力在这种倾Copy generation中,因为GPT是用于文本生成。当开发者从Stack Overflow(SO)或系统中复制代码时,可能会出现不一致的更改,导致意外的行为。 Similarly,如果某人拥有一个代码段在特定编程语言中,但寻找相同的功能在不同语言中,semantic cross-language code clone generation方法可以提供有价值的帮助。在本研究中,使用SemanticCloneBench作为载体,我们评估了GPT-3模型在给定副本中是否可以生成Semantic和跨语言倾Copy变体。我们组织了一个多样化的代码副本,并评估GPT-3模型在生成代码变体方面的能力。经过广泛的实验和分析,9名判icator在158小时内验证了我们的结论,我们调查了模型在代码生成方面的能力。我们的数据分析得出了有力的结果。在semantic倾Copy领域,GPT-3达到了62.14%的精度和0.55 BLEU分数,通过几个极少的提示工程来实现。此外,模型在跨语言倾Copy方面表现出色,达到了91.25%的精度。
Exploring Large Language Models for Ontology Alignment
results: 初步发现,LLMs 可能会超越现有的 ontology alignment 系统 like BERTMap,但需要注意framwork和提示设计。Abstract
This work investigates the applicability of recent generative Large Language Models (LLMs), such as the GPT series and Flan-T5, to ontology alignment for identifying concept equivalence mappings across ontologies. To test the zero-shot performance of Flan-T5-XXL and GPT-3.5-turbo, we leverage challenging subsets from two equivalence matching datasets of the OAEI Bio-ML track, taking into account concept labels and structural contexts. Preliminary findings suggest that LLMs have the potential to outperform existing ontology alignment systems like BERTMap, given careful framework and prompt design.
摘要
这个研究探讨了最近的生成型大型自然语言模型(LLM),如GPT系列和Flan-T5,在ontology alignment中的可行性,以确定 Ontology 中的概念相似映射。为了测试 Flan-T5-XXL 和 GPT-3.5-turbo 的零学习性能,我们利用了 OAEI Bio-ML 跟踪中的两个等价匹配数据集,考虑概念标签和结构上下文。初步发现,LLM 有可能超越现有的ontology alignment系统BERTMap,提供精心设计的框架和提示。
results: 研究表明,这些机制可以使网络模型近似固定操作,如矩阵-向量乘法 $\phi(A,x) \rightarrow Ax$,并有应用于测试依赖关系或交互顺序在图模型中。Abstract
Can an $\mathbb{R}^n\rightarrow \mathbb{R}^n$ feedforward network learn matrix-vector multiplication? This study introduces two mechanisms - flexible masking to take matrix inputs, and a unique network pruning to respect the mask's dependency structure. Networks can approximate fixed operations such as matrix-vector multiplication $\phi(A,x) \rightarrow Ax$, motivating the mechanisms introduced with applications towards litmus-testing dependencies or interaction order in graph-based models.
摘要
可以不是$\mathbb{R}^n\to\mathbb{R}^n$的Feedforward网络学习矩阵-向量乘法吗?这个研究提出了两种机制——灵活的面 masking来处理矩阵输入,以及特殊的网络剔除来尊重面的依赖结构。网络可以近似固定操作,如矩阵-向量乘法$\phi(A,x)\to Ax$,这些机制的引入鼓励了在图模型中进行考验依赖关系或交互顺序。
Style2Fab: Functionality-Aware Segmentation for Fabricating Personalized 3D Models with Generative AI
results: 研究人员通过对1000个来自Thingiverse的3D模型进行质量分析,创建了一个功能分类体系,并使用这个体系进行分类,以及一个名为Style2Fab的系统,允许用户选择性地修改3D模型的艺术元素,而不会影响模型的原始功能。Abstract
With recent advances in Generative AI, it is becoming easier to automatically manipulate 3D models. However, current methods tend to apply edits to models globally, which risks compromising the intended functionality of the 3D model when fabricated in the physical world. For example, modifying functional segments in 3D models, such as the base of a vase, could break the original functionality of the model, thus causing the vase to fall over. We introduce a method for automatically segmenting 3D models into functional and aesthetic elements. This method allows users to selectively modify aesthetic segments of 3D models, without affecting the functional segments. To develop this method we first create a taxonomy of functionality in 3D models by qualitatively analyzing 1000 models sourced from a popular 3D printing repository, Thingiverse. With this taxonomy, we develop a semi-automatic classification method to decompose 3D models into functional and aesthetic elements. We propose a system called Style2Fab that allows users to selectively stylize 3D models without compromising their functionality. We evaluate the effectiveness of our classification method compared to human-annotated data, and demonstrate the utility of Style2Fab with a user study to show that functionality-aware segmentation helps preserve model functionality.
摘要
To develop this method, we first created a taxonomy of functionality in 3D models by analyzing 1000 models from a popular 3D printing repository, Thingiverse. We then developed a semi-automatic classification method to decompose 3D models into functional and aesthetic elements. We call this system Style2Fab, and it allows users to selectively stylize 3D models without compromising their functionality.We evaluated the effectiveness of our classification method compared to human-annotated data and demonstrated the utility of Style2Fab with a user study. Our results show that functionality-aware segmentation helps preserve the model's functionality, and users can use Style2Fab to selectively stylize 3D models without worrying about compromising their intended use.
Grounded Language Acquisition From Object and Action Imagery
paper_authors: James Robert Kubricht, Zhaoyuan Yang, Jianwei Qiu, Peter Henry Tu
for: 这 paper 的目的是研究如何使用深度学习方法来掌握视觉数据的Private语言表示。
methods: 该 paper 使用了 Referential game 环境和冲突学习环境来训练 Emergent Language(EL)Encoder/Decoder,并使用了神经机器翻译和随机森林分类来将symbolic表示转化为类别标签。
results: 该 paper 在object recognition和action recognition两个实验中使用了这些方法,并使用了Grad-CAM和t-SNE方法来解释symbols生成的含义。Abstract
Deep learning approaches to natural language processing have made great strides in recent years. While these models produce symbols that convey vast amounts of diverse knowledge, it is unclear how such symbols are grounded in data from the world. In this paper, we explore the development of a private language for visual data representation by training emergent language (EL) encoders/decoders in both i) a traditional referential game environment and ii) a contrastive learning environment utilizing a within-class matching training paradigm. An additional classification layer utilizing neural machine translation and random forest classification was used to transform symbolic representations (sequences of integer symbols) to class labels. These methods were applied in two experiments focusing on object recognition and action recognition. For object recognition, a set of sketches produced by human participants from real imagery was used (Sketchy dataset) and for action recognition, 2D trajectories were generated from 3D motion capture systems (MOVI dataset). In order to interpret the symbols produced for data in each experiment, gradient-weighted class activation mapping (Grad-CAM) methods were used to identify pixel regions indicating semantic features which contribute evidence towards symbols in learned languages. Additionally, a t-distributed stochastic neighbor embedding (t-SNE) method was used to investigate embeddings learned by CNN feature extractors.
摘要
深度学习方法在自然语言处理方面已经做出了很大的进步。这些模型生成的符号表达了各种多样化的知识,但是不清楚这些符号如何与世界数据相关联。在这篇论文中,我们探索了在私人语言表达中发展的私人语言(EL)编码器/解码器,并在两种不同的环境中训练这些模型:一种传统的引用游戏环境和一种对比学习环境。此外,我们还使用神经机器翻译和随机森林分类来转换符号表达(序列数字符号)为类别标签。这些方法在两个实验中应用,一个是对象识别实验(Sketchy dataset),另一个是动作识别实验(MOVI dataset)。为了解释在每个实验中生成的符号,我们使用梯度权重分布映射(Grad-CAM)方法来确定符号中含有哪些Semantic特征,以及这些特征对象的证据。此外,我们还使用了高度分布随机邻居嵌入(t-SNE)方法来调查由Convolutional Neural Networks(CNN)特征提取器学习的嵌入。
Learning Minimalistic Tsetlin Machine Clauses with Markov Boundary-Guided Pruning
results: 作者通过实验和理论分析,证明了该 scheme 可以充分利用上下文特定的独立性来找到 Markov boundary,并且可以提高 TM 的学习效率和准确性。Abstract
A set of variables is the Markov blanket of a random variable if it contains all the information needed for predicting the variable. If the blanket cannot be reduced without losing useful information, it is called a Markov boundary. Identifying the Markov boundary of a random variable is advantageous because all variables outside the boundary are superfluous. Hence, the Markov boundary provides an optimal feature set. However, learning the Markov boundary from data is challenging for two reasons. If one or more variables are removed from the Markov boundary, variables outside the boundary may start providing information. Conversely, variables within the boundary may stop providing information. The true role of each candidate variable is only manifesting when the Markov boundary has been identified. In this paper, we propose a new Tsetlin Machine (TM) feedback scheme that supplements Type I and Type II feedback. The scheme introduces a novel Finite State Automaton - a Context-Specific Independence Automaton. The automaton learns which features are outside the Markov boundary of the target, allowing them to be pruned from the TM during learning. We investigate the new scheme empirically, showing how it is capable of exploiting context-specific independence to find Markov boundaries. Further, we provide a theoretical analysis of convergence. Our approach thus connects the field of Bayesian networks (BN) with TMs, potentially opening up for synergies when it comes to inference and learning, including TM-produced Bayesian knowledge bases and TM-based Bayesian inference.
摘要
一个集合的变量是marks blanket的一个随机变量,如果它包含所有预测变量的信息,则称之为Markov bound。如果边界不能被缩小而失去有用的信息,则称之为Markov bound。标识随机变量的Markov bound是有利的,因为所有外部边界的变量都是 redundant。因此,Markov bound提供了一个优化的特征集。然而,从数据中学习Markov bound是困难的,因为如果一个或多个变量被从Markov bound中移除,外部边界上的变量可能会开始提供信息。相反,Markov bound内部的变量可能会停止提供信息。每个候选变量的真实角色只有在Markov bound已经被确定出来时才会表现出来。在这篇论文中,我们提出了一种新的Tsetlin Machine(TM)反馈方案,该方案附加了类型I和类型II反馈。方案使用了一个新的Finite State Automaton——Context-Specific Independence Automaton。机器学习以外的 automaton 可以学习随机变量的Markov bound,从而在TM学习过程中将其从TM中除除。我们对新方案进行了实验性研究,并证明了它可以利用上下文特定的独立性来找到Markov bound。我们还提供了一种理论分析的归一化。我们的方法因此将Bayesian networks(BN)和TM相连,可能会开拓新的可能性,包括TM生成的Bayesian知识库和TM基于Bayesian推理的推理。
AI4Food-NutritionFW: A Novel Framework for the Automatic Synthesis and Analysis of Eating Behaviours
paper_authors: Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Isabel Espinosa-Salinas, Gala Freixer, Julian Fierrez, Ruben Vera-Rodriguez, Enrique Carrillo de Santa Pau, Ana Ramírez de Molina, Javier Ortega-Garcia
results: 该论文通过自动评估饮食习惯中的健康指数,并实现了99.53%和99.60%的准确率和敏感度。I hope that helps! Let me know if you have any other questions.Abstract
Nowadays millions of images are shared on social media and web platforms. In particular, many of them are food images taken from a smartphone over time, providing information related to the individual's diet. On the other hand, eating behaviours are directly related to some of the most prevalent diseases in the world. Exploiting recent advances in image processing and Artificial Intelligence (AI), this scenario represents an excellent opportunity to: i) create new methods that analyse the individuals' health from what they eat, and ii) develop personalised recommendations to improve nutrition and diet under specific circumstances (e.g., obesity or COVID). Having tunable tools for creating food image datasets that facilitate research in both lines is very much needed. This paper proposes AI4Food-NutritionFW, a framework for the creation of food image datasets according to configurable eating behaviours. AI4Food-NutritionFW simulates a user-friendly and widespread scenario where images are taken using a smartphone. In addition to the framework, we also provide and describe a unique food image dataset that includes 4,800 different weekly eating behaviours from 15 different profiles and 1,200 subjects. Specifically, we consider profiles that comply with actual lifestyles from healthy eating behaviours (according to established knowledge), variable profiles (e.g., eating out, holidays), to unhealthy ones (e.g., excess of fast food or sweets). Finally, we automatically evaluate a healthy index of the subject's eating behaviours using multidimensional metrics based on guidelines for healthy diets proposed by international organisations, achieving promising results (99.53% and 99.60% accuracy and sensitivity, respectively). We also release to the research community a software implementation of our proposed AI4Food-NutritionFW and the mentioned food image dataset created with it.
摘要
现在,数百万个图像在社交媒体和网络平台上被分享。特别是,许多这些图像是由智能手机拍摄的食物图像,提供了关于个人的饮食信息。然而,饮食习惯直接关联了世界上许多最常见的疾病。利用最新的图像处理技术和人工智能(AI),这种情况表现出了优秀的机遇,可以:i) 创建新的方法,从饮食中获取个人健康信息,ii) 为特定情况(如肥胖或 COVID)提供个性化的饮食建议。有一个可调的工具集,用于创建饮食图像集,是研究这两个方面的非常需要。 本文提出了 AI4Food-NutritionFW 框架,用于创建饮食图像集,根据可配置的饮食习惯。 AI4Food-NutritionFW 模拟了一种用户友好、广泛的场景,在智能手机上拍摄图像。除框架外,我们还提供了一个唯一的饮食图像集,包含 4,800 个不同的每周饮食习惯,来自 15 个 profiles 和 1,200 个主题。特别是,我们考虑了遵循实际生活方式的健康饮食习惯(根据已知的知识)、变化 profiles(例如,吃外卖、假日),以及不健康的习惯(例如,过量快餐或糖果)。最后,我们自动评估主题的饮食习惯健康指数,使用多维度指标,基于国际组织提出的健康饮食指南,达到了非常有 promise 的结果(99.53% 和 99.60% 的准确率和敏感度,分别)。我们还向研究社区发布了我们所提出的 AI4Food-NutritionFW 和饮食图像集。
Transferability analysis of data-driven additive manufacturing knowledge: a case study between powder bed fusion and directed energy deposition
results: 研究表明,可以成功将LPBF处理技术中的数据驱动解决方案传递到DED处理技术中,并在不同数据表示、模型架构和模型参数层次上进行了成功传递。Abstract
Data-driven research in Additive Manufacturing (AM) has gained significant success in recent years. This has led to a plethora of scientific literature to emerge. The knowledge in these works consists of AM and Artificial Intelligence (AI) contexts that have not been mined and formalized in an integrated way. Moreover, no tools or guidelines exist to support data-driven knowledge transfer from one context to another. As a result, data-driven solutions using specific AI techniques are being developed and validated only for specific AM process technologies. There is a potential to exploit the inherent similarities across various AM technologies and adapt the existing solutions from one process or problem to another using AI, such as Transfer Learning. We propose a three-step knowledge transferability analysis framework in AM to support data-driven AM knowledge transfer. As a prerequisite to transferability analysis, AM knowledge is featurized into identified knowledge components. The framework consists of pre-transfer, transfer, and post-transfer steps to accomplish knowledge transfer. A case study is conducted between flagship metal AM processes. Laser Powder Bed Fusion (LPBF) is the source of knowledge motivated by its relative matureness in applying AI over Directed Energy Deposition (DED), which drives the need for knowledge transfer as the less explored target process. We show successful transfer at different levels of the data-driven solution, including data representation, model architecture, and model parameters. The pipeline of AM knowledge transfer can be automated in the future to allow efficient cross-context or cross-process knowledge exchange.
摘要
“数据驱动的研究在添加制造(AM)领域在最近几年内取得了重要成功。这导致了一大量的科学文献出现。这些文献中的知识包括AM和人工智能(AI)上下文的知识,它们没有被综合化和系统化地挖掘。此外,没有任何工具或指南来支持数据驱动知识的转移 между不同的上下文。因此,为了解决特定的AM过程技术中的问题,数据驱动解决方案使用特定的AI技术进行开发和验证。这有一定的潜在利用AM过程技术之间的共同特征,并将现有的解决方案从一个过程或问题中转移到另一个过程或问题中使用AI技术,例如传输学习。我们提议一个三步知识转移可行性分析框架在AM中支持数据驱动AM知识转移。在转移可行性分析之前,AM知识被特征化为识别出来的知识组件。该框架包括先转移、转移和后转移三个步骤,以完成知识转移。一个案例研究在标志性金属AM过程之间进行了转移。用激光粉末充电(LPBF)作为知识来源,因为它在应用AI方面更加成熟,而 Directed Energy Deposition(DED)更是一个未经探索的目标过程,这导致了知识转移的需求。我们在不同数据驱动解决方案的各级上成功进行了转移,包括数据表示、模型架构和模型参数。将来,AM知识转移管道可以被自动化,以实现跨上下文或跨过程的高效知识交换。”
Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos
paper_authors: Bavesh Balaji, Jerrin Bright, Harish Prakash, Yuhao Chen, David A Clausi, John Zelek
for: automatic jersey number detection in sports videos
methods: spatio-temporal network, multi-task loss function
results: significant increase in accuracy (37.81% and 37.70%)Abstract
Player identification is a crucial component in vision-driven soccer analytics, enabling various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatically detecting jersey numbers from player tracklets in videos presents challenges due to motion blur, low resolution, distortions, and occlusions. Existing methods, utilizing Spatial Transformer Networks, CNNs, and Vision Transformers, have shown success in image data but struggle with real-world video data, where jersey numbers are not visible in most of the frames. Hence, identifying frames that contain the jersey number is a key sub-problem to tackle. To address these issues, we propose a robust keyframe identification module that extracts frames containing essential high-level information about the jersey number. A spatio-temporal network is then employed to model spatial and temporal context and predict the probabilities of jersey numbers in the video. Additionally, we adopt a multi-task loss function to predict the probability distribution of each digit separately. Extensive evaluations on the SoccerNet dataset demonstrate that incorporating our proposed keyframe identification module results in a significant 37.81% and 37.70% increase in the accuracies of 2 different test sets with domain gaps. These results highlight the effectiveness and importance of our approach in tackling the challenges of automatic jersey number detection in sports videos.
摘要
player identification是视觉驱动足球分析中的关键组件,允许多个下渠道任务,如玩家评估、游戏分析和直播生产。然而,自动从视频中检测篮球号码存在很多挑战,包括运动模糊、低分辨率、扭曲和遮挡。现有方法,使用空间变换网络、CNN和视觉变换器,在图像数据上表现出成功,但在真实世界视频数据上却表现不佳,因为篮球号码在大多数帧中不可见。因此,确定包含篮球号码的帧是关键的子问题。为解决这些问题,我们提议一种可靠的关键帧标识模块,该模块可以提取包含篮球号码的高级信息的帧。然后,我们采用了一种空间-时间网络,以模拟空间和时间上下文,并预测视频中篮球号码的概率。此外,我们采用了多任务损失函数,以预测每个数字的概率分布。我们在SoccerNet数据集进行了广泛的评估,结果表明,将我们提议的关锥帧标识模块integrated into our approach,可以提高视频中篮球号码自动检测的准确率,相对于不含该模块的情况,提高37.81%和37.70%。这些结果 highlights the effectiveness and importance of our approach in tackling the challenges of automatic jersey number detection in sports videos.
Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation
results: 我们的方法可以有效地评估每个样本中每个模式的贡献,并实现了不同多模式模型中的显著提高。Abstract
One primary topic of multi-modal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multi-modal cooperation, which could not jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but are often hard to provide the fine-grained observation of multi-modal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a fine-grained modality valuation metric to evaluate the contribution of each modality at sample-level. Via modality valuation, we regretfully observe that the multi-modal model tends to rely on one specific modality, resulting in other modalities being low-contributing. We further analyze this issue and improve cooperation between modalities by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution at sample-level and achieve considerable improvement on different multi-modal models.
摘要
(使用简化字符串)一个主要的多样性学习主题是将不同Modalities中的异质数据集合在一起。然而,大多数模型通常会受到不满意的多样性合作,无法有效地使用所有Modalities。一些方法可以识别和提高不好学习的Modalities,但是往往无法在样本水平提供细化的多样性合作观察。因此,我们需要合理地观察和改进多样性合作,尤其是在面临现实情况下,模态差异可能会随样本不同而变化。为此,我们引入细化的模态价值度量来评估每个模态的样本级贡献。通过模态价值评估,我们 regretfully 发现,多模态模型往往会依赖于一个具体的模态,导致其他模态成为低贡献的。我们进一步分析这一问题,并通过提高低贡献模态的推诊能力来改善多样性合作。总的来说,我们的方法可以合理地观察细化的单模态贡献,并在不同的多模态模型上实现显著改进。
methods: 文章通过一系列快速问题,探讨了“解释”和“相关的解释”的混淆性。文章拒绝了“解释”和“相关的解释”, argue that XAIxArt 是人类中心艺术的不安和害怕旧有作者和人类活动的回归。
results: 文章通过区分了 ornamentation 模型和 sense-making 模型来证明这一观点。Abstract
The position paper highlights the range of concerns that are engulfed in the injunction of explainable artificial intelligence in art (XAIxArt). Through a series of quick sub-questions, it points towards the ambiguities concerning 'explanation' and the postpositivist tradition of 'relevant explanation'. Rejecting both 'explanation' and 'relevant explanation', the paper takes a stance that XAIxArt is a symptom of insecurity of the anthropocentric notion of art and a nostalgic desire to return to outmoded notions of authorship and human agency. To justify this stance, the paper makes a distinction between an ornamentation model of explanation to a model of explanation as sense-making.
摘要
文章发表于XAIxArt的各种问题,包括'解释'和'有用的解释'的歧义,以及人类中心艺术的不安和宁静愿返回过时的作者和人类活动。文章根据解释模型和意义解释模型的区别,提出了这种姿势。Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other parts of the world.
Unveiling Signle-Bit-Flip Attacks on DNN Executables
results: 发现深度学习模型执行器中存在广泛、严重(如单位位异常)和可传递的攻击表面,这些攻击表面不存在于高级深度学习框架中的模型 weights,可以让攻击者控制输出标签Abstract
Recent research has shown that bit-flip attacks (BFAs) can manipulate deep neural networks (DNNs) via DRAM Rowhammer exploitations. Existing attacks are primarily launched over high-level DNN frameworks like PyTorch and flip bits in model weight files. Nevertheless, DNNs are frequently compiled into low-level executables by deep learning (DL) compilers to fully leverage low-level hardware primitives. The compiled code is usually high-speed and manifests dramatically distinct execution paradigms from high-level DNN frameworks. In this paper, we launch the first systematic study on the attack surface of BFA specifically for DNN executables compiled by DL compilers. We design an automated search tool to identify vulnerable bits in DNN executables and identify practical attack vectors that exploit the model structure in DNN executables with BFAs (whereas prior works make likely strong assumptions to attack model weights). DNN executables appear more "opaque" than models in high-level DNN frameworks. Nevertheless, we find that DNN executables contain extensive, severe (e.g., single-bit flip), and transferrable attack surfaces that are not present in high-level DNN models and can be exploited to deplete full model intelligence and control output labels. Our finding calls for incorporating security mechanisms in future DNN compilation toolchains.
摘要
SCP: Scene Completion Pre-training for 3D Object Detection
results: 使用SCP方法可以使现有的状态体际3D探测器达到相同的性能,只需要20%的标注数据。Abstract
3D object detection using LiDAR point clouds is a fundamental task in the fields of computer vision, robotics, and autonomous driving. However, existing 3D detectors heavily rely on annotated datasets, which are both time-consuming and prone to errors during the process of labeling 3D bounding boxes. In this paper, we propose a Scene Completion Pre-training (SCP) method to enhance the performance of 3D object detectors with less labeled data. SCP offers three key advantages: (1) Improved initialization of the point cloud model. By completing the scene point clouds, SCP effectively captures the spatial and semantic relationships among objects within urban environments. (2) Elimination of the need for additional datasets. SCP serves as a valuable auxiliary network that does not impose any additional efforts or data requirements on the 3D detectors. (3) Reduction of the amount of labeled data for detection. With the help of SCP, the existing state-of-the-art 3D detectors can achieve comparable performance while only relying on 20% labeled data.
摘要
三维对象检测使用激光点云是计算机视觉、 робо控和自动驾驶等领域的基本任务。然而,现有的三维检测器均依赖于标注过的数据集,这些数据集的标注过程昂贵且容易出错。在这篇论文中,我们提出了场景完成预训练(SCP)方法,以提高三维对象检测器的性能,并且需要更少的标注数据。SCP具有以下三个优势:1. 提高点云模型的初始化。通过完善场景点云,SCP可以有效地捕捉城市环境中物体之间的空间和semantic关系。2. 消除需要更多数据集的需求。SCP作为一个有价值的辅助网络,不需要额外的努力或数据要求。3. 降低检测需要的标注数据量。通过SCP的帮助,现有的状态对检测器可以在20%标注数据的情况下实现相同的性能。
360$^\circ$ from a Single Camera: A Few-Shot Approach for LiDAR Segmentation
results: 该方法可以超过现有的标注效率方法的结果,并且在一些传统的完全监督分割网络之上还取得了更高的性能。Abstract
Deep learning applications on LiDAR data suffer from a strong domain gap when applied to different sensors or tasks. In order for these methods to obtain similar accuracy on different data in comparison to values reported on public benchmarks, a large scale annotated dataset is necessary. However, in practical applications labeled data is costly and time consuming to obtain. Such factors have triggered various research in label-efficient methods, but a large gap remains to their fully-supervised counterparts. Thus, we propose ImageTo360, an effective and streamlined few-shot approach to label-efficient LiDAR segmentation. Our method utilizes an image teacher network to generate semantic predictions for LiDAR data within a single camera view. The teacher is used to pretrain the LiDAR segmentation student network, prior to optional fine-tuning on 360$^\circ$ data. Our method is implemented in a modular manner on the point level and as such is generalizable to different architectures. We improve over the current state-of-the-art results for label-efficient methods and even surpass some traditional fully-supervised segmentation networks.
摘要
深度学习应用于激光数据受到不同感知器或任务的域隔差很强。为了使这些方法在不同数据上达到类似准确性,需要一个大规模的注意力标注数据集。然而,在实际应用中,标注数据昂贵和时间消耗。这些因素引发了各种研究label-efficient方法,但与完全监督方法之间仍有大的差距。因此,我们提出ImageTo360,一种高效的几极shot方法 для标签efficient LiDAR分割。我们的方法使用图像教师网络生成激光数据中的semantic预测,并在单个相机视图中使用这些预测来预训练LiDAR分割学生网络。我们的方法实现在点级别上,可以与不同的架构进行拓展。我们超越当前状态的域隔差标签方法,甚至超过了一些传统的完全监督分割网络。
A 3M-Hybrid Model for the Restoration of Unique Giant Murals: A Case Study on the Murals of Yongle Palace
paper_authors: Jing Yang, Nur Intan Raihana Ruhaiyem, Chichun Zhou
for: restore the Yongle Palace murals, which are valuable cultural heritage but have suffered damage
methods: propose a 3M-Hybrid model that leverages a pre-trained Vision Transformer model (VIT) and a multi-scale and multi-perspective strategy to address the challenges of domain bias and large defect restoration
results: improve SSIM and PSNR by 14.61% and 4.73%, respectively, compared to the best model among four representative CNN models, and achieve favorable results in the final restoration of giant murals.Here’s the full text in Simplified Chinese:
for: Restore the Yongle Palace murals, which are valuable cultural heritage but have suffered damage.
results: 与最佳四种表征 CNN 模型比较,提高 SSIM 和 PSNR 指标的提升率分别为14.61%和4.73%,并在大 défaut的最终 restore 问题上获得了良好的结果。Abstract
The Yongle Palace murals, as valuable cultural heritage, have suffered varying degrees of damage, making their restoration of significant importance. However, the giant size and unique data of Yongle Palace murals present challenges for existing deep-learning based restoration methods: 1) The distinctive style introduces domain bias in traditional transfer learning-based restoration methods, while the scarcity of mural data further limits the applicability of these methods. 2) Additionally, the giant size of these murals results in a wider range of defect types and sizes, necessitating models with greater adaptability. Consequently, there is a lack of focus on deep learning-based restoration methods for the unique giant murals of Yongle Palace. Here, a 3M-Hybrid model is proposed to address these challenges. Firstly, based on the characteristic that the mural data frequency is prominent in the distribution of low and high frequency features, high and low frequency features are separately abstracted for complementary learning. Furthermore, we integrate a pre-trained Vision Transformer model (VIT) into the CNN module, allowing us to leverage the benefits of a large model while mitigating domain bias. Secondly, we mitigate seam and structural distortion issues resulting from the restoration of large defects by employing a multi-scale and multi-perspective strategy, including data segmentation and fusion. Experimental results demonstrate the efficacy of our proposed model. In regular-sized mural restoration, it improves SSIM and PSNR by 14.61% and 4.73%, respectively, compared to the best model among four representative CNN models. Additionally, it achieves favorable results in the final restoration of giant murals.
摘要
永乐宫壁画,作为文化遗产,受到不同程度的损害,因此 restore 的重要性提高。然而,永乐宫壁画的巨大大小和独特数据带来了现有深度学习基于 restore 方法的挑战:1)宫壁画的特殊风格引入传统 transfer learning 基于 restore 方法中的领域偏见,而且宫壁画数据的罕见性更限制了这些方法的应用。2)此外,宫壁画的巨大大小导致了更多的缺陷类型和大小,需要更加适应性强的模型。因此,对于永乐宫壁画独特的巨大宫壁画,深度学习基于 restore 方法受到了不 enough 的关注。在这里,我们提出了一种3M-Hybrid模型,以解决这些挑战。首先,基于宫壁画数据频谱在低频和高频特征之间的分布,我们分别抽取了高频和低频特征进行 complementary learning。其次,我们将预训练的 Vision Transformer 模型(VIT)integrated 到 CNN 模块中,以利用大型模型的优势,同时避免领域偏见。其次,我们使用多比例和多视角策略来mitigate 修复大缺陷的问题,包括数据分割和融合。实验结果表明,我们提出的模型在 regular-sized 宫壁画修复中提高了 SSIM 和 PSNR 指标的值,相比最佳四种 CNN 模型,提高了14.61%和4.73%。此外,它在巨大宫壁画的最终修复中也获得了良好的结果。
Glancing Future for Simultaneous Machine Translation
results: 比 STRONG 基eline 高效,且适用于各种同域机器翻译方法Abstract
Simultaneous machine translation (SiMT) outputs translation while reading the source sentence. Unlike conventional sequence-to-sequence (seq2seq) training, existing SiMT methods adopt the prefix-to-prefix (prefix2prefix) training, where the model predicts target tokens based on partial source tokens. However, the prefix2prefix training diminishes the ability of the model to capture global information and introduces forced predictions due to the absence of essential source information. Consequently, it is crucial to bridge the gap between the prefix2prefix training and seq2seq training to enhance the translation capability of the SiMT model. In this paper, we propose a novel method that glances future in curriculum learning to achieve the transition from the seq2seq training to prefix2prefix training. Specifically, we gradually reduce the available source information from the whole sentence to the prefix corresponding to that latency. Our method is applicable to a wide range of SiMT methods and experiments demonstrate that our method outperforms strong baselines.
摘要
同时机器翻译(SiMT)输出翻译 mientras lee la oración de fuente. 与现有的序列到序列(seq2seq)训练不同,现有的 SiMT 方法采用了 prefix-to-prefix(prefix2prefix)训练,其中模型预测目标Token基于部分源Token。然而, prefix2prefix 训练减少了模型捕捉全局信息的能力,并导致强制预测因为缺少必要的源信息。因此,它是必要的 bridge seq2seq 训练和 prefix2prefix 训练,以提高 SiMT 模型的翻译能力。在这篇论文中,我们提出了一种新的方法,通过观察未来的劳动学习来实现这种过渡。具体来说,我们逐渐减少了可用的源信息从整个句子到对应的遅延。我们的方法适用于各种 SiMT 方法,并且实验表明,我们的方法超过了强大的基eline。
Robust-MBDL: A Robust Multi-branch Deep Learning Based Model for Remaining Useful Life Prediction and Operational Condition Identification of Rotating Machines
methods: 提posed system包括主要组件:1)LSTM自适应网络去噪振荡数据; 2)特征提取从去噪数据中生成时域、频域和时频基于特征; 3)一种新型和可靠的多支lines deep learning网络架构,利用多个特征
results: 对两个基准数据集XJTU-SY和PRONOSTIA进行了性能评估,结果表明我们提posed系统在RUL和CO预测方面准确率高,与当前最佳系统相比,具有实际应用潜力。Abstract
In this paper, a Robust Multi-branch Deep learning-based system for remaining useful life (RUL) prediction and condition operations (CO) identification of rotating machines is proposed. In particular, the proposed system comprises main components: (1) an LSTM-Autoencoder to denoise the vibration data; (2) a feature extraction to generate time-domain, frequency-domain, and time-frequency based features from the denoised data; (3) a novel and robust multi-branch deep learning network architecture to exploit the multiple features. The performance of our proposed system was evaluated and compared to the state-of-the-art systems on two benchmark datasets of XJTU-SY and PRONOSTIA. The experimental results prove that our proposed system outperforms the state-of-the-art systems and presents potential for real-life applications on bearing machines.
摘要
本文提出了一种基于深度学习的多支分支系统,用于预测旋转机器的剩余有用生命(RUL)和状况操作(CO)。特别是,提案的系统包括主要组成部分:1. LSTM自适应神经网络来排除振荡数据中的噪声;2. 特征提取来生成时域、频域和时频基于特征;3. 一种新的和可靠的多支分支深度学习网络架构,以利用多个特征。我们提出的系统的性能被评估并与现有系统进行比较,使用了两个XJTU-SY和PRONOSTIA的数据集。实验结果表明,我们的提案系统在RUL预测和CO识别方面表现出色,并有可能应用于真实的滚珍机器。
Measuring vagueness and subjectivity in texts: from symbolic to neural VAGO
paper_authors: Benjamin Icard, Vincent Claveau, Ghislain Atemezing, Paul Égré
for: 本研究旨在开发一种自动测量文本模糊和主观性的方法。
methods: 我们首先介绍了专家系统VAGO,然后在一个小 benchmark上证明了它在fact vs. opinion句子上的效果,然后在更大的法语新闻词汇库FreSaDa上进行了对比,并证明了讽刺文章中的主观标志更为常见。最后,我们基于BERT-like架构建立了一个神经网络版本VAGO,并通过LIME的解释工具来证明其对symbolic VAGO scores的增强和其他语言版本的生成的重要性。
results: 研究结果表明,神经网络版本VAGO在 FreSaDa 上的表现更好,并且可以增强 symbolic VAGO scores 的lexicons。此外,神经网络版本还可以生成其他语言版本,并且可以通过LIME的解释工具来了解它们的工作原理。Abstract
We present a hybrid approach to the automated measurement of vagueness and subjectivity in texts. We first introduce the expert system VAGO, we illustrate it on a small benchmark of fact vs. opinion sentences, and then test it on the larger French press corpus FreSaDa to confirm the higher prevalence of subjective markers in satirical vs. regular texts. We then build a neural clone of VAGO, based on a BERT-like architecture, trained on the symbolic VAGO scores obtained on FreSaDa. Using explainability tools (LIME), we show the interest of this neural version for the enrichment of the lexicons of the symbolic version, and for the production of versions in other languages.
摘要
我们提出了一种混合方法来自动量化文本中的uncertainty和主观性。我们首先介绍了专家系统VAGO,然后在一个小的对比实验中使用它对fact vs. opinion句子进行了示例,然后在更大的法国报纸词汇 corpus FreSaDa 上进行了测试,以确认在幽默 VS. 常规文本中的主观标记的更高频率。然后,我们建立了一个基于BERT-like架构的神经网络副本,并使用LIME Explainability工具来展示其在符号式 VAGO 分数中的利用性和在其他语言中生成版本的可能性。
JOADAA: joint online action detection and action anticipation
paper_authors: Mohammed Guermal, Francois Bremond, Rui Dai, Abid Ali
for: 这两个任务的缺失完整的知识集(过去、当前和未来)使得推断动作依赖关系困难,从而影响性能。
methods: 我们提议将这两个任务 fusion into a single uniform architecture,通过结合动作预测和在线动作检测,以捕捉未来信息的潜在相互关系。
results: 我们在三个挑战性 dataset(THUMOS’14、CHARADES和Multi-THUMOS)上验证了我们的提议模型(JOADAA),并 achieved SOTA results for both tasks。Abstract
Action anticipation involves forecasting future actions by connecting past events to future ones. However, this reasoning ignores the real-life hierarchy of events which is considered to be composed of three main parts: past, present, and future. We argue that considering these three main parts and their dependencies could improve performance. On the other hand, online action detection is the task of predicting actions in a streaming manner. In this case, one has access only to the past and present information. Therefore, in online action detection (OAD) the existing approaches miss semantics or future information which limits their performance. To sum up, for both of these tasks, the complete set of knowledge (past-present-future) is missing, which makes it challenging to infer action dependencies, therefore having low performances. To address this limitation, we propose to fuse both tasks into a single uniform architecture. By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information in online action detection. This method referred to as JOADAA, presents a uniform model that jointly performs action anticipation and online action detection. We validate our proposed model on three challenging datasets: THUMOS'14, which is a sparsely annotated dataset with one action per time step, CHARADES, and Multi-THUMOS, two densely annotated datasets with more complex scenarios. JOADAA achieves SOTA results on these benchmarks for both tasks.
摘要
<> translate="no"Action anticipation involves forecasting future actions by connecting past events to future ones. However, this reasoning ignores the real-life hierarchy of events, which is composed of three main parts: past, present, and future. We argue that considering these three main parts and their dependencies could improve performance. On the other hand, online action detection is the task of predicting actions in a streaming manner. In this case, one has access only to the past and present information. Therefore, in online action detection (OAD), the existing approaches miss semantics or future information, which limits their performance. To sum up, for both of these tasks, the complete set of knowledge (past-present-future) is missing, which makes it challenging to infer action dependencies, therefore having low performances. To address this limitation, we propose to fuse both tasks into a single uniform architecture. By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information in online action detection. This method, referred to as JOADAA, presents a uniform model that jointly performs action anticipation and online action detection. We validate our proposed model on three challenging datasets: THUMOS'14, which is a sparsely annotated dataset with one action per time step, CHARADES, and Multi-THUMOS, two densely annotated datasets with more complex scenarios. JOADAA achieves SOTA results on these benchmarks for both tasks.Note: I've kept the original text's sentence structure and vocabulary as much as possible, but some words and phrases may have been adjusted slightly to fit the Simplified Chinese grammar and idiomatic expressions.
LEyes: A Lightweight Framework for Deep Learning-Based Eye Tracking using Synthetic Eye Images
paper_authors: sean anthony byrne, virmarie maquiling, marcus nyström, enkelejda kasneci, diederick c. niehorster
For: This paper aims to address the problem of inadequate training datasets for gaze estimation techniques, which has hindered the deployment of deep learning models in real-world applications.* Methods: The proposed framework, called Light Eyes (LEyes), uses simple light distributions to model key image features required for video-based eye tracking, facilitating easy configuration for training neural networks across diverse gaze-estimation tasks.* Results: The authors demonstrate that models trained using LEyes outperform other state-of-the-art algorithms in terms of pupil and CR localization across well-known datasets, and a LEyes trained model outperforms the industry standard eye tracker using significantly more cost-effective hardware.Abstract
Deep learning has bolstered gaze estimation techniques, but real-world deployment has been impeded by inadequate training datasets. This problem is exacerbated by both hardware-induced variations in eye images and inherent biological differences across the recorded participants, leading to both feature and pixel-level variance that hinders the generalizability of models trained on specific datasets. While synthetic datasets can be a solution, their creation is both time and resource-intensive. To address this problem, we present a framework called Light Eyes or "LEyes" which, unlike conventional photorealistic methods, only models key image features required for video-based eye tracking using simple light distributions. LEyes facilitates easy configuration for training neural networks across diverse gaze-estimation tasks. We demonstrate that models trained using LEyes outperform other state-of-the-art algorithms in terms of pupil and CR localization across well-known datasets. In addition, a LEyes trained model outperforms the industry standard eye tracker using significantly more cost-effective hardware. Going forward, we are confident that LEyes will revolutionize synthetic data generation for gaze estimation models, and lead to significant improvements of the next generation video-based eye trackers.
摘要
Fidelity-Induced Interpretable Policy Extraction for Reinforcement Learning
results: 在 StarCraft II 复杂的控制环境中进行实验,FIPE 方法在交互性性和一致性两个方面都超过了基eline,同时易于理解。Abstract
Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making problems. However, existing DRL agents make decisions in an opaque fashion, hindering the user from establishing trust and scrutinizing weaknesses of the agents. While recent research has developed Interpretable Policy Extraction (IPE) methods for explaining how an agent takes actions, their explanations are often inconsistent with the agent's behavior and thus, frequently fail to explain. To tackle this issue, we propose a novel method, Fidelity-Induced Policy Extraction (FIPE). Specifically, we start by analyzing the optimization mechanism of existing IPE methods, elaborating on the issue of ignoring consistency while increasing cumulative rewards. We then design a fidelity-induced mechanism by integrate a fidelity measurement into the reinforcement learning feedback. We conduct experiments in the complex control environment of StarCraft II, an arena typically avoided by current IPE methods. The experiment results demonstrate that FIPE outperforms the baselines in terms of interaction performance and consistency, meanwhile easy to understand.
摘要
Specifically, we start by analyzing the optimization mechanism of existing IPE methods, highlighting the issue of ignoring consistency while increasing cumulative rewards. We then design a fidelity-induced mechanism by integrating a fidelity measurement into the reinforcement learning feedback. We conduct experiments in the complex control environment of StarCraft II, an environment typically avoided by current IPE methods. The experiment results demonstrate that FIPE outperforms the baselines in terms of interaction performance and consistency, while being easy to understand.
A Machine Learning Framework to Deconstruct the Primary Drivers for Electricity Market Price Events
results: 研究结果表明,价格峰值事件的主要驱动因素包括可再生能源含量、天气因素和市场操作因素等。这些结果可以用于多个重要的市场设计、可再生发电和干预、运营和网络安全应用。Abstract
Power grids are moving towards 100% renewable energy source bulk power grids, and the overall dynamics of power system operations and electricity markets are changing. The electricity markets are not only dispatching resources economically but also taking into account various controllable actions like renewable curtailment, transmission congestion mitigation, and energy storage optimization to ensure grid reliability. As a result, price formations in electricity markets have become quite complex. Traditional root cause analysis and statistical approaches are rendered inapplicable to analyze and infer the main drivers behind price formation in the modern grid and markets with variable renewable energy (VRE). In this paper, we propose a machine learning-based analysis framework to deconstruct the primary drivers for price spike events in modern electricity markets with high renewable energy. The outcomes can be utilized for various critical aspects of market design, renewable dispatch and curtailment, operations, and cyber-security applications. The framework can be applied to any ISO or market data; however, in this paper, it is applied to open-source publicly available datasets from California Independent System Operator (CAISO) and ISO New England (ISO-NE).
摘要
《电力网络在向100%可再生能源源扩展方向上进行变革,电力市场的运营和供应链也在不断发生变化。电力市场不仅经济地派发资源,还考虑了多种可控行为,如可再生能源减少、传输拥堵缓解和能量存储优化,以确保网络可靠性。因此,电力市场价格的形成变得非常复杂。传统的根本原因分析和统计方法在现代网络和市场中变得无法应用,用于分析和推导现代电力市场价格主要驱动力的主要驱动力。在这篇论文中,我们提出一种基于机器学习的分析框架,以分解现代电力市场价格峰值事件的主要驱动力。这些结果可以用于各种关键应用,如市场设计、可再生发电和减少、运营和网络安全应用。这种框架可以应用于任何ISO或市场数据,但在这篇论文中,它被应用于公开可用的CAISO和ISO-NE数据集。》Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.
BatMan-CLR: Making Few-shots Meta-Learners Resilient Against Label Noise
results: 研究结果显示,当meta-training中存在 label noise时, gradient-based $N$-way $K$-shot learners 的准确率可以下降达42%。而通过使用 manifold (Man) 和 batch manifold (BatMan) 采样技术,可以减少 meta-testing 中 label noise 的影响,并限制meta-testing准确率下降在${2.5}$, ${9.4}$, ${1.1}$ percent points 之间。Abstract
The negative impact of label noise is well studied in classical supervised learning yet remains an open research question in meta-learning. Meta-learners aim to adapt to unseen learning tasks by learning a good initial model in meta-training and consecutively fine-tuning it according to new tasks during meta-testing. In this paper, we present the first extensive analysis of the impact of varying levels of label noise on the performance of state-of-the-art meta-learners, specifically gradient-based $N$-way $K$-shot learners. We show that the accuracy of Reptile, iMAML, and foMAML drops by up to 42% on the Omniglot and CifarFS datasets when meta-training is affected by label noise. To strengthen the resilience against label noise, we propose two sampling techniques, namely manifold (Man) and batch manifold (BatMan), which transform the noisy supervised learners into semi-supervised ones to increase the utility of noisy labels. We first construct manifold samples of $N$-way $2$-contrastive-shot tasks through augmentation, learning the embedding via a contrastive loss in meta-training, and then perform classification through zeroing on the embedding in meta-testing. We show that our approach can effectively mitigate the impact of meta-training label noise. Even with 60% wrong labels \batman and \man can limit the meta-testing accuracy drop to ${2.5}$, ${9.4}$, ${1.1}$ percent points, respectively, with existing meta-learners across the Omniglot, CifarFS, and MiniImagenet datasets.
摘要
研究者们已经广泛研究了经典的超级学习中的标签噪声的负面影响,但在元学习领域中,这个问题仍然是一个开放的研究问题。元学习者希望通过学习一个好的初始模型,并在新任务上进行细化调整来适应未看过的学习任务。在这篇论文中,我们提供了首次对元学习器的标签噪声的影响进行了广泛的分析。我们发现,当元训练被标签噪声影响时,Reptile、iMAML和foMAML的精度 Drop by up to 42% on the Omniglot and CifarFS datasets。为了增强元学习器对标签噪声的抗性,我们提议了两种采样技术: manifold(Man)和批量 manifold(BatMan)。这两种技术可以将噪声标注的超级学习器转换成 semi-supervised 学习器,以提高噪声标注的利用性。我们首先通过扩展来构建 $N$-way $2$-contrastive-shot任务的 manifold 样本,然后通过预训练一个嵌入向量,并在元测试中使用零化来进行分类。我们发现,我们的方法可以有效地减轻元训练标签噪声的影响。即使有60%的标签是错误的,batman 和 man 可以限制元测试精度下降到 ${2.5}$, ${9.4}$, ${1.1}$%点,分别。
Update Monte Carlo tree search (UMCTS) algorithm for heuristic global search of sizing optimization problems for truss structures
methods: reinforcement learning (RL) and Monte Carlo tree search (MCTS) with upper confidence bound (UCB)
results: efficient optimization algorithm with at least ten times faster computation time than branch and bound (BB) method, and stable better solutions than other conventional methods.Here’s the simplified Chinese text:
for: 这篇论文是针对桁架结构的最佳化问题进行研究。
methods: 使用强化学习(RL)和Monte Carlo tree search(MCTS)方法,并且加上最高信心界(UCB)。
results: 提出了一个高效的最佳化算法, computation time 至少比branch and bound(BB)方法快十倍,并且稳定地获得更好的解。Abstract
Sizing optimization of truss structures is a complex computational problem, and the reinforcement learning (RL) is suitable for dealing with multimodal problems without gradient computations. In this paper, a new efficient optimization algorithm called update Monte Carlo tree search (UMCTS) is developed to obtain the appropriate design for truss structures. UMCTS is an RL-based method that combines the novel update process and Monte Carlo tree search (MCTS) with the upper confidence bound (UCB). Update process means that in each round, the optimal cross-sectional area of each member is determined by search tree, and its initial state is the final state in the previous round. In the UMCTS algorithm, an accelerator for the number of selections for member area and iteration number is introduced to reduce the computation time. Moreover, for each state, the average reward is replaced by the best reward collected on the simulation process to determine the optimal solution. The proposed optimization method is examined on some benchmark problems of planar and spatial trusses with discrete sizing variables to demonstrate the efficiency and validity. It is shown that the computation time for the proposed approach is at least ten times faster than the branch and bound (BB) method. The numerical results indicate that the proposed method stably achieves better solution than other conventional methods.
摘要
��erton 优化算法的评估是一个复杂的计算问题,而人工智能学习(RL)是适用于多Modal问题的解决方案。在这篇论文中,一种新的高效优化算法called update Monte Carlo tree search(UMCTS)被开发出来,以获取适当的设计 dla truss 结构。 UMCTS 是一种基于 RL 的方法,它将 Monte Carlo tree search(MCTS)与Upper Confidence Bound(UCB)结合,并在每一轮中,通过搜索树来确定每个成员的最佳跨部分面积。在 UMCTS 算法中,一种加速器 для成员面积和迭代次数的数量被引入,以降低计算时间。此外,每个状态下的平均奖励被 replaced by 最佳在 Simulation 过程中收集的奖励,以确定最佳解决方案。提出的优化方法被应用于一些 benchmark 问题 of planar 和 spatial trusses with discrete sizing variables,以示出其高效性和有效性。结果表明,提出的方法的计算时间至少比 branch and bound(BB)方法快 ten times。 numerics 表明,提出的方法稳定地实现了 better solution than other conventional methods。
Learning Score-based Grasping Primitive for Human-assisting Dexterous Grasping
methods: 本研究提出了一种新的任务 called human-assisting dexterous grasping,旨在训练一个控制 робо手指的策略,以帮助用户 grasping 物品。
results: 实验结果表明,我们提出的方法在比基eline的情况下具有优势,highlighting 用户consciousness 和实际应用性。codes 和演示可以在 “https://sites.google.com/view/graspgf“ 中查看。Abstract
The use of anthropomorphic robotic hands for assisting individuals in situations where human hands may be unavailable or unsuitable has gained significant importance. In this paper, we propose a novel task called human-assisting dexterous grasping that aims to train a policy for controlling a robotic hand's fingers to assist users in grasping objects. Unlike conventional dexterous grasping, this task presents a more complex challenge as the policy needs to adapt to diverse user intentions, in addition to the object's geometry. We address this challenge by proposing an approach consisting of two sub-modules: a hand-object-conditional grasping primitive called Grasping Gradient Field~(GraspGF), and a history-conditional residual policy. GraspGF learns `how' to grasp by estimating the gradient from a success grasping example set, while the residual policy determines `when' and at what speed the grasping action should be executed based on the trajectory history. Experimental results demonstrate the superiority of our proposed method compared to baselines, highlighting the user-awareness and practicality in real-world applications. The codes and demonstrations can be viewed at "https://sites.google.com/view/graspgf".
摘要
人类辅助dexterous grasping任务在帮助用户抓取物品时具有重要 significanc。在这篇论文中,我们提出一种新任务,即人机合作dexterous grasping,旨在训练一个控制机器人手指的策略,以助用户抓取物品。与传统的dexterous grasping不同,这个任务呈现了更加复杂的挑战,因为策略需要适应用户的意图,以及物品的几何学特征。我们解决这个挑战的方法是通过两个子模块:一个手机-物品conditional grasping基本单元 called Grasping Gradient Field(GraspGF),以及一个历史条件的差异策略。GraspGF学习了如何抓取的“如何”,通过成功抓取示例集来估算抓取的梯度,而差异策略则确定了执行抓取动作的时间和速度,根据轨迹历史。实验结果表明我们提出的方法在基eline之上显著超越,highlighting用户意识和实际应用中的实用性。代码和演示可以在“https://sites.google.com/view/graspgf”上查看。
Automatically Estimating the Effort Required to Repay Self-Admitted Technical Debt
paper_authors: Yikun Li, Mohamed Soliman, Paris Avgeriou for: 本研究旨在提高技术债(Technical Debt)的优化和维护效率,特别是自我投诉技术债(Self-Admitted Technical Debt,SATD)的优化。methods: 本研究使用了一个大型的SATD数据集,包括341,740个SATD项来自2,568,728个提交,从1,060个Apache仓库中收集。然后,我们采用了BERT和TextCNN等机器学习方法来自动估算SATD的偿还努力。results: 我们发现不同类型的SATD偿还努力有不同的水平,代码/设计、需求、测试债需要更多的偿还努力,而文档债需要较少的偿还努力。此外,我们还总结了在SATD偿还过程中不同水平的偿还努力关键词。本研究的贡献可以帮助优化技术债的优化和维护效率,最终为软件开发和维护带来利益。Abstract
Technical debt refers to the consequences of sub-optimal decisions made during software development that prioritize short-term benefits over long-term maintainability. Self-Admitted Technical Debt (SATD) is a specific form of technical debt, explicitly documented by developers within software artifacts such as source code comments and commit messages. As SATD can hinder software development and maintenance, it is crucial to address and prioritize it effectively. However, current methodologies lack the ability to automatically estimate the repayment effort of SATD based on its textual descriptions. To address this limitation, we propose a novel approach for automatically estimating SATD repayment effort, utilizing a comprehensive dataset comprising 341,740 SATD items from 2,568,728 commits across 1,060 Apache repositories. Our findings show that different types of SATD require varying levels of repayment effort, with code/design, requirement, and test debt demanding greater effort compared to non-SATD items, while documentation debt requires less. We introduce and evaluate machine learning methodologies, particularly BERT and TextCNN, which outperforms classic machine learning methods and the naive baseline in estimating repayment effort. Additionally, we summarize keywords associated with varying levels of repayment effort that occur during SATD repayment. Our contributions aim to enhance the prioritization of SATD repayment effort and resource allocation efficiency, ultimately benefiting software development and maintainability.
摘要
Molecular Conformation Generation via Shifting Scores
results: 我们对分子数据集进行实验,并证明了我们的方法在现有方法的基础上具有优势。Abstract
Molecular conformation generation, a critical aspect of computational chemistry, involves producing the three-dimensional conformer geometry for a given molecule. Generating molecular conformation via diffusion requires learning to reverse a noising process. Diffusion on inter-atomic distances instead of conformation preserves SE(3)-equivalence and shows superior performance compared to alternative techniques, whereas related generative modelings are predominantly based upon heuristical assumptions. In response to this, we propose a novel molecular conformation generation approach driven by the observation that the disintegration of a molecule can be viewed as casting increasing force fields to its composing atoms, such that the distribution of the change of inter-atomic distance shifts from Gaussian to Maxwell-Boltzmann distribution. The corresponding generative modeling ensures a feasible inter-atomic distance geometry and exhibits time reversibility. Experimental results on molecular datasets demonstrate the advantages of the proposed shifting distribution compared to the state-of-the-art.
摘要
分子形态生成,计算化学中一项关键任务,涉及生成给定分子的三维结构均衡。通过扩散来生成分子形态,需要学习反向噪声过程。与传统技术不同,我们的方法基于分子裂解的观察,即将分子中的原子受到增加的力场影响,因此分子中间的距离分布从 Gaussian 转变为 Maxwell-Boltzmann 分布。这种生成模型保证了原子间距离的可行性,并且具有时间逆向性。对分子数据进行实验,我们发现了我们提出的分布Shift的优势,比 estado-of-the-art 更好。
results: 这个研究的结果显示,使用 DSLOT-NN 技术可以节省大量电力和能源,并且具有较短的循环时间和较高的 OPS 每瓦特。 compared with state-of-the-art Stripes 的性能指标。Abstract
We propose a Digit-Serial Left-tO-righT (DSLOT) arithmetic based processing technique called DSLOT-NN with aim to accelerate inference of the convolution operation in the deep neural networks (DNNs). The proposed work has the ability to assess and terminate the ineffective convolutions which results in massive power and energy savings. The processing engine is comprised of low-latency most-significant-digit-first (MSDF) (also called online) multipliers and adders that processes data from left-to-right, allowing the execution of subsequent operations in digit-pipelined manner. Use of online operators eliminates the need for the development of complex mechanism of identifying the negative activation, as the output with highest weight value is generated first, and the sign of the result can be identified as soon as first non-zero digit is generated. The precision of the online operators can be tuned at run-time, making them extremely useful in situations where accuracy can be compromised for power and energy savings. The proposed design has been implemented on Xilinx Virtex-7 FPGA and is compared with state-of-the-art Stripes on various performance metrics. The results show the proposed design presents power savings, has shorter cycle time, and approximately 50% higher OPS per watt.
摘要
Our processing engine consists of low-latency most-significant-digit-first (MSDF) multipliers and adders that process data from left to right, allowing for digit-pipelined execution. This eliminates the need for complex mechanisms to identify negative activation, as the output with the highest weight value is generated first, and the sign of the result can be identified as soon as the first non-zero digit is generated.The precision of our online operators can be tuned at runtime, making them highly versatile in situations where accuracy can be compromised for power and energy savings. We have implemented our design on Xilinx Virtex-7 FPGA and compared it with state-of-the-art Stripes on various performance metrics. Our results show that the proposed design achieves power savings, has a shorter cycle time, and offers approximately 50% higher operations per second per watt.
paper_authors: Anthony Cioppa, Silvio Giancola, Vladimir Somers, Floriane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deliège, Jan Held, Carlos Hinojosa, Amir M. Mansourian, Pierre Miralles, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdullah Kamal, Adrien Maglo, Albert Clapés, Amr Abdelaziz, Artur Xarles, Astrid Orcesi, Atom Scott, Bin Liu, Byoungkwon Lim, Chen Chen, Fabian Deuser, Feng Yan, Fufu Yu, Gal Shitrit, Guanshuo Wang, Gyusik Choi, Hankyul Kim, Hao Guo, Hasby Fahrudin, Hidenari Koguchi, Håkan Ardö, Ibrahim Salah, Ido Yerushalmy, Iftikar Muhammad, Ikuma Uchida, Ishay Be’ery, Jaonary Rabarisoa, Jeongae Lee, Jiajun Fu, Jianqin Yin, Jinghang Xu, Jongho Nang, Julien Denize, Junjie Li, Junpei Zhang, Juntae Kim, Kamil Synowiec, Kenji Kobayashi, Kexin Zhang, Konrad Habel, Kota Nakajima, Licheng Jiao, Lin Ma, Lizhi Wang, Luping Wang, Menglong Li, Mengying Zhou, Mohamed Nasr, Mohamed Abdelwahed, Mykola Liashuha, Nikolay Falaleev, Norbert Oswald, Qiong Jia, Quoc-Cuong Pham, Ran Song, Romain Hérault, Rui Peng, Ruilong Chen, Ruixuan Liu, Ruslan Baikulov, Ryuto Fukushima, Sergio Escalera, Seungcheon Lee, Shimin Chen, Shouhong Ding, Taiga Someya, Thomas B. Moeslund, Tianjiao Li, Wei Shen, Wei Zhang, Wei Li, Wei Dai, Weixin Luo, Wending Zhao, Wenjie Zhang, Xinquan Yang, Yanbiao Ma, Yeeun Joo, Yingsen Zeng, Yiyang Gan, Yongqiang Zhu, Yujie Zhong, Zheng Ruan, Zhiheng Li, Zhijian Huang, Ziyu Meng for:* 这篇论文是为了描述2023年度的SoccerNet视频理解挑战(第三届),这些挑战包括七个视觉任务,分为三个主题。methods:* 这篇论文使用了多种视觉技术,包括动作检测、球体变化检测、笔记录和Camera calibration等。results:* 这篇论文描述了2023年度的SoccerNet视频理解挑战,包括七个视觉任务的结果,其中有三个任务是新添加的,一个任务得到了更多的数据和注释,另一个任务改为着眼点到终端方法。Abstract
The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet.
摘要
occerNet 2023 挑战是第三届视频理解挑战,由 SoccerNet 团队组织。这一第三届挑战包括七个视觉任务,分为三个主题。第一主题是广播视频理解,包括三个高级任务:(1)动作搜索,关注从广播视频中检索全局动作的时间戳;(2)球动作搜索,关注从广播视频中检索球的状态变化时间戳;(3)稠密视频描述,关注使用自然语言描述广播视频,并附加时间戳。第二主题是场地理解,单个任务为(4)摄像头协调,关注从图像中提取摄像头参数。第三主题是球员理解,包括三个低级任务:(5)重识别,关注在多视图中重新识别同一名球员;(6)多对tracking,关注在未编辑视频流中跟踪球员和球;(7)球衣号码识别,关注从跟踪片中识别球员的球衣号码。相比前两届 SoccerNet 挑战,任务(2-3-7)是新的,包括新的注释和数据,任务(4)增加了更多的数据和注释,任务(6)现在专注于终端方法。更多关于任务、挑战和排名的信息可以在 上获取。基础和开发集可以在 上找到。
Life-inspired Interoceptive Artificial Intelligence for Autonomous and Adaptive Agents
results: 本研究提出了一种新的感知方法,可以帮助建立自适应和自主的人工智能代理人,并结合了遗传学、增强学习和神经科学的新进展。Abstract
Building autonomous --- i.e., choosing goals based on one's needs -- and adaptive -- i.e., surviving in ever-changing environments -- agents has been a holy grail of artificial intelligence (AI). A living organism is a prime example of such an agent, offering important lessons about adaptive autonomy. Here, we focus on interoception, a process of monitoring one's internal environment to keep it within certain bounds, which underwrites the survival of an organism. To develop AI with interoception, we need to factorize the state variables representing internal environments from external environments and adopt life-inspired mathematical properties of internal environment states. This paper offers a new perspective on how interoception can help build autonomous and adaptive agents by integrating the legacy of cybernetics with recent advances in theories of life, reinforcement learning, and neuroscience.
摘要
建立自主---即根据自己需求选择目标---以及适应---即在不断变化的环境中生存---的人工智能(AI)是人工智能的圣杯。生物体是这种代理人的一个好例子,它们提供了关键的适应自主性教训。在这里,我们关注内部环境监测---即保持内部环境在某些范围内---这一过程,这对生物体的存亡起到了关键作用。为建立具有内部监测能力的AI,我们需要将内部环境状态变量分离于外部环境状态变量,并采用生命力学中的内部环境状态生命力学性质。这篇论文提供了一种新的视角,即通过结合人工智能的启蒙、生命力学、奖励学习和神经科学的进步,实现内部监测能力的AI。
Goal Space Abstraction in Hierarchical Reinforcement Learning via Reachability Analysis
results: 在导航任务上,通过逐渐学习表示和策略,实现了数据效率和可贸转性。Abstract
Open-ended learning benefits immensely from the use of symbolic methods for goal representation as they offer ways to structure knowledge for efficient and transferable learning. However, the existing Hierarchical Reinforcement Learning (HRL) approaches relying on symbolic reasoning are often limited as they require a manual goal representation. The challenge in autonomously discovering a symbolic goal representation is that it must preserve critical information, such as the environment dynamics. In this work, we propose a developmental mechanism for subgoal discovery via an emergent representation that abstracts (i.e., groups together) sets of environment states that have similar roles in the task. We create a HRL algorithm that gradually learns this representation along with the policies and evaluate it on navigation tasks to show the learned representation is interpretable and results in data efficiency.
摘要
开放式学习受益于使用象征方法表示目标,因为它们可以为高效和可传递的学习提供结构知识。然而,现有的层次强化学习(HRL)方法,它们通常需要手动设定目标表示,这会带来一定的限制。挑战在自动发现象征目标表示时是保留环境动力学信息。在这种工作中,我们提出一种发展机制,通过观察环境状态的集合,自动找出子目标。我们创建了一种基于HRL算法,逐渐学习这种表示,并评估其在导航任务中的效果,结果表明学习的表示是可解释的,并且具有数据效率。
Knowledge-Guided Short-Context Action Anticipation in Human-Centric Videos
results: 在 Breakfast 和 50Salads 两个标准 datasets 上,我们的方法与当前状态的方法相比,在长期行为预测中使用短视频段的情况下提高了9%。Abstract
This work focuses on anticipating long-term human actions, particularly using short video segments, which can speed up editing workflows through improved suggestions while fostering creativity by suggesting narratives. To this end, we imbue a transformer network with a symbolic knowledge graph for action anticipation in video segments by boosting certain aspects of the transformer's attention mechanism at run-time. Demonstrated on two benchmark datasets, Breakfast and 50Salads, our approach outperforms current state-of-the-art methods for long-term action anticipation using short video context by up to 9%.
摘要
Answering Subjective Induction Questions on Products by Summarizing Multi-sources Multi-viewpoints Knowledge
results: 由于这个新任务没有相关的评估标准集,因此 constructed a large-scale dataset named SupQA,包含48,352个样本和15种产品领域。评估结果表明了我们的方法的效果。Abstract
This paper proposes a new task in the field of Answering Subjective Induction Question on Products (SUBJPQA). The answer to this kind of question is non-unique, but can be interpreted from many perspectives. For example, the answer to 'whether the phone is heavy' has a variety of different viewpoints. A satisfied answer should be able to summarize these subjective opinions from multiple sources and provide objective knowledge, such as the weight of a phone. That is quite different from the traditional QA task, in which the answer to a factoid question is unique and can be found from a single data source. To address this new task, we propose a three-steps method. We first retrieve all answer-related clues from multiple knowledge sources on facts and opinions. The implicit commonsense facts are also collected to supplement the necessary but missing contexts. We then capture their relevance with the questions by interactive attention. Next, we design a reinforcement-based summarizer to aggregate all these knowledgeable clues. Based on a template-controlled decoder, we can output a comprehensive and multi-perspective answer. Due to the lack of a relevant evaluated benchmark set for the new task, we construct a large-scale dataset, named SupQA, consisting of 48,352 samples across 15 product domains. Evaluation results show the effectiveness of our approach.
摘要
Here's the translation in Simplified Chinese:这篇论文提出了一个新的任务,即答え问题 Answering Subjective Induction Questions on Products (SUBJPQA)。这类问题的答案不唯一,可以从多个角度解释。例如,问题 "手机是否重" 有多种不同的看法。一个满意的答案应该能够汇集多个来源的主观意见,并提供对象知识,如手机的重量。这与传统的 QA 任务不同,传统的答案是唯一的,可以从单个数据源中找到。为 Addressing this new task, the authors propose a three-step method. First, they retrieve all answer-related clues from multiple knowledge sources on facts and opinions. They also collect implicit commonsense facts to supplement necessary but missing contexts. Next, they capture the relevance of the clues with the questions using interactive attention. Finally, they design a reinforcement-based summarizer to aggregate all the knowledgeable clues. Based on a template-controlled decoder, they can output a comprehensive and multi-perspective answer. Due to the lack of a relevant evaluated benchmark set for the new task, the authors construct a large-scale dataset named SupQA, consisting of 48,352 samples across 15 product domains. The evaluation results show the effectiveness of their approach.
MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials Modeling
paper_authors: Kin Long Kelvin Lee, Carmelo Gonzales, Marcel Nassar, Matthew Spellings, Mikhail Galkin, Santiago Miret
For: 该论文目的是提出一个新的benchmark,用于模型化固体材料科学领域中的机器学习方法。* Methods: 该论文使用的方法包括使用开源数据集,包括大规模数据集如OpenCatalyst、OQMD、NOMAD、Carolina Materials Database和Materials Project等,以及 simulated energies、atomic forces、材料带隔和 Space group classification数据等。* Results: 该论文通过使用多个数据集进行联合预测,实现了对固体材料的多任务学习和多数据集学习。通过 MatSci ML benchmark,研究人员可以评估不同的 graf neural network 和 equivariant point cloud network 在多种学习场景中的性能。Abstract
We propose MatSci ML, a novel benchmark for modeling MATerials SCIence using Machine Learning (MatSci ML) methods focused on solid-state materials with periodic crystal structures. Applying machine learning methods to solid-state materials is a nascent field with substantial fragmentation largely driven by the great variety of datasets used to develop machine learning models. This fragmentation makes comparing the performance and generalizability of different methods difficult, thereby hindering overall research progress in the field. Building on top of open-source datasets, including large-scale datasets like the OpenCatalyst, OQMD, NOMAD, the Carolina Materials Database, and Materials Project, the MatSci ML benchmark provides a diverse set of materials systems and properties data for model training and evaluation, including simulated energies, atomic forces, material bandgaps, as well as classification data for crystal symmetries via space groups. The diversity of properties in MatSci ML makes the implementation and evaluation of multi-task learning algorithms for solid-state materials possible, while the diversity of datasets facilitates the development of new, more generalized algorithms and methods across multiple datasets. In the multi-dataset learning setting, MatSci ML enables researchers to combine observations from multiple datasets to perform joint prediction of common properties, such as energy and forces. Using MatSci ML, we evaluate the performance of different graph neural networks and equivariant point cloud networks on several benchmark tasks spanning single task, multitask, and multi-data learning scenarios. Our open-source code is available at https://github.com/IntelLabs/matsciml.
摘要
我们提出了MatSci ML,一个新的测试基准 для模型化材料科学使用机器学习方法(MatSci ML),专注于固体材料 periodic crystal structures 的模型。采用机器学习方法对固体材料是一个新兴领域,受到各种数据集的不同所致,这种分化使得对不同方法的比较和总体进展困难,从而阻碍了材料科学领域的研究进步。基于开源数据集,包括大规模数据集如OpenCatalyst、OQMD、NOMAD、Carolina Materials Database和Materials Project,MatSci ML 提供了多种材料系统和性能数据 для模型训练和评估,包括模拟能量、原子力、材料带隙、以及通过空间群来分类的晶体结构数据。MatSci ML 中的多种属性多样性使得实现和评估多任务学习算法 для固体材料 possible,而多种数据集的多样性促进了新的、更一般的算法和方法的开发。在多个数据集学习 Setting,MatSci ML 允许研究人员将多个数据集中的观察结果合并进行共同预测常见的属性,如能量和力。使用 MatSci ML,我们评估了不同的图 neural networks 和平衡点云网络在多个benchmark任务中的表现,涵盖单任务、多任务和多数据学习场景。我们的开源代码可以在https://github.com/IntelLabs/matsciml 中找到。
Combining deep learning and street view imagery to map smallholder crop types
results: 研究发现,在泰国,该方法可以创建一个国家范围内的全面的rice、manioc、maize和甘蔗类型地图,其准确率达93%。这种方法可以在全球各地,尤其是小农场地区,提供一种快速、便宜、高准确率的农作物类型地图生成方式。Abstract
Accurate crop type maps are an essential source of information for monitoring yield progress at scale, projecting global crop production, and planning effective policies. To date, however, crop type maps remain challenging to create in low and middle-income countries due to a lack of ground truth labels for training machine learning models. Field surveys are the gold standard in terms of accuracy but require an often-prohibitively large amount of time, money, and statistical capacity. In recent years, street-level imagery, such as Google Street View, KartaView, and Mapillary, has become available around the world. Such imagery contains rich information about crop types grown at particular locations and times. In this work, we develop an automated system to generate crop type ground references using deep learning and Google Street View imagery. The method efficiently curates a set of street view images containing crop fields, trains a model to predict crop type by utilizing weakly-labelled images from disparate out-of-domain sources, and combines predicted labels with remote sensing time series to create a wall-to-wall crop type map. We show that, in Thailand, the resulting country-wide map of rice, cassava, maize, and sugarcane achieves an accuracy of 93%. As the availability of roadside imagery expands, our pipeline provides a way to map crop types at scale around the globe, especially in underserved smallholder regions.
摘要
准确的作物类别地图是考古规模监测作物进步、全球作物产量预测和制定有效政策的重要来源。然而,在LOW和中等收入国家,创建作物类别地图仍然是一项挑战,因为缺乏地面 truth标签用于训练机器学习模型。 Field surveys 是精度最高的标准,但它们需要大量的时间、金钱和统计资源。在最近几年,街道级别的图像,如Google Street View、KartaView和Mapillary,在全球变得可用。这些图像包含特定地点和时间的作物种植的详细信息。在这个工作中,我们开发了一个自动化的系统,使用深度学习和Google Street View图像来生成作物类别地标。该方法可以效率地筛选出包含作物田的街道图像,使用弱 labels 的图像从不同的域外来源来训练模型,并将预测标签与远程感知时序列合并以创建墙到墙的作物类别地图。我们显示,在泰国,得到的国家范围内的rice、 Cassava、maize和sugarcane的地图达到93%的准确率。随着路边图像的可用性扩展,我们的管道可以在全球范围内地图作物类别,特别是在小holder地区。
Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals
results: 对一系列的转移试验,我们获得了平均提高5.5%的分类精度,至于之前的状态艺术。此外,我们还证明了该方法在模式匹配情况下具有强大的实用性,包括预期不确定的模式掉包或替换。Abstract
Leveraging multimodal information from biosignals is vital for building a comprehensive representation of people's physical and mental states. However, multimodal biosignals often exhibit substantial distributional shifts between pretraining and inference datasets, stemming from changes in task specification or variations in modality compositions. To achieve effective pretraining in the presence of potential distributional shifts, we propose a frequency-aware masked autoencoder ($\texttt{bio}$FAME) that learns to parameterize the representation of biosignals in the frequency space. $\texttt{bio}$FAME incorporates a frequency-aware transformer, which leverages a fixed-size Fourier-based operator for global token mixing, independent of the length and sampling rate of inputs. To maintain the frequency components within each input channel, we further employ a frequency-maintain pretraining strategy that performs masked autoencoding in the latent space. The resulting architecture effectively utilizes multimodal information during pretraining, and can be seamlessly adapted to diverse tasks and modalities at test time, regardless of input size and order. We evaluated our approach on a diverse set of transfer experiments on unimodal time series, achieving an average of $\uparrow$5.5% improvement in classification accuracy over the previous state-of-the-art. Furthermore, we demonstrated that our architecture is robust in modality mismatch scenarios, including unpredicted modality dropout or substitution, proving its practical utility in real-world applications. Code will be available soon.
摘要
利用多Modal信息从生物信号是建立全面人体和心理状态的关键。然而,多Modal生物信号经常会在预训练和测试集数据中出现重大的分布变化,这可能是因为任务规定的变化或多Modal组合的变化。为了在可能存在的分布变化情况下进行有效的预训练,我们提议一种频率意识的隐藏式自动编码器($\texttt{bio}$FAME),该模型学习在频率空间中 parameterize 生物信号的表示。$\texttt{bio}$FAME 包含一种频率意识的转换器,该转换器通过固定大小的干扰函数来进行全token混合,不виси于输入的长度和采样率。为了保持每个输入通道中的频率组件,我们还采用了一种保持频率的预训练策略,该策略在幽latent空间进行隐藏式自动编码。这种架构可以有效利用多Modal信息进行预训练,并可以在测试时轻松适应多种任务和多Modal,无论输入的大小和顺序。我们对一组多Modal时间序列转换任务进行了多种转换试验,实现了平均提高5.5%的分类精度,胜过之前的状态艺。此外,我们还证明了我们的架构在模态匹配情况下具有实用性,包括预测不可预知的模态掉落或替换,证明了它在实际应用中的实用性。代码即将上传。
results: 实验结果显示,提出的算法可以实现高效的分类和特征选择,并且比较低的计算成本。Abstract
Sparse logistic regression aims to perform classification and feature selection simultaneously for high-dimensional data. Although many studies have been done to solve $\ell_1$-regularized logistic regression, there is no equivalently abundant literature about solving sparse logistic regression associated with nonconvex penalties. In this paper, we propose to solve $\ell_1$-regularized sparse logistic regression and some nonconvex penalties-regularized sparse logistic regression, when the nonconvex penalties satisfy some prerequisites, with similar optimization frameworks. In the proposed optimization frameworks, we utilize different line search criteria to guarantee good convergence performance for different regularization terms. Empirical experiments on binary classification tasks with real-world datasets demonstrate our proposed algorithms are capable of performing classification and feature selection effectively with a lower computational cost.
摘要
“稀疏逻辑回传数据类别和特征选择同时进行分类。处理 $\ell_1$-规定逻辑回传数据的研究已经很多,但是关于非对称责任逻辑回传数据的研究却没有相应的实践。本文提出了解决 $\ell_1$-规定稀疏逻辑回传数据和一些非对称责任逻辑回传数据的优化框架。在提议的优化框架中,我们利用不同的搜索条件来保证不同的规定Terms的优化表现良好。实验结果显示我们的提议算法在实际数据上能够有效地进行分类和特征选择,并且比较低的计算成本。”Note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. The other form is Traditional Chinese.
A Survey of Hallucination in Large Foundation Models
results: 本文提供了关于幻觉在LFM中的全面检查和解决方案。Abstract
Hallucination in a foundation model (FM) refers to the generation of content that strays from factual reality or includes fabricated information. This survey paper provides an extensive overview of recent efforts that aim to identify, elucidate, and tackle the problem of hallucination, with a particular focus on ``Large'' Foundation Models (LFMs). The paper classifies various types of hallucination phenomena that are specific to LFMs and establishes evaluation criteria for assessing the extent of hallucination. It also examines existing strategies for mitigating hallucination in LFMs and discusses potential directions for future research in this area. Essentially, the paper offers a comprehensive examination of the challenges and solutions related to hallucination in LFMs.
摘要
幻想在基础模型(FM)中指的是生成不符事实或包含虚假信息的内容。这篇评论文章提供了对最近努力防止幻想的全面回顾,尤其是对大型基础模型(LFMs)的幻想问题。文章分类了LFMs中幻想现象的不同类型,并设置了评估幻想程度的评价标准。它还检查了现有的防止幻想策略,并讨论了未来研究的可能性。简言之,文章提供了幻想在LFMs中的挑战和解决方案的全面检查。
SAGE: Structured Attribute Value Generation for Billion-Scale Product Catalogs
paper_authors: Athanasios N. Nikolakopoulos, Swati Kaul, Siva Karthik Gade, Bella Dubrov, Umit Batur, Suleiman Ali Khan
for: 这篇论文旨在为电商平台中的产品 attribute 值预测做出提高。
methods: 该论文提出了一种新的 attribute-value 预测方法,基于 Seq2Seq 概率模型,可以适应不同语言、产品类型和目标 attribute。该方法可以预测 attribute 值,even when such values are mentioned implicitly using periphrastic language, or not-at-all-as is the case for common-sense defaults。
results: 实验表明,该方法可以有效地预测 attribute 值,并且比之前的方法更高效。此外,该方法还可以预测 attribute 值在零例教程下,从而减少需要训练的数据量。Abstract
We introduce SAGE; a Generative LLM for inferring attribute values for products across world-wide e-Commerce catalogs. We introduce a novel formulation of the attribute-value prediction problem as a Seq2Seq summarization task, across languages, product types and target attributes. Our novel modeling approach lifts the restriction of predicting attribute values within a pre-specified set of choices, as well as, the requirement that the sought attribute values need to be explicitly mentioned in the text. SAGE can infer attribute values even when such values are mentioned implicitly using periphrastic language, or not-at-all-as is the case for common-sense defaults. Additionally, SAGE is capable of predicting whether an attribute is inapplicable for the product at hand, or non-obtainable from the available information. SAGE is the first method able to tackle all aspects of the attribute-value-prediction task as they arise in practical settings in e-Commerce catalogs. A comprehensive set of experiments demonstrates the effectiveness of the proposed approach, as well as, its superiority against state-of-the-art competing alternatives. Moreover, our experiments highlight SAGE's ability to tackle the task of predicting attribute values in zero-shot setting; thereby, opening up opportunities for significantly reducing the overall number of labeled examples required for training.
摘要
我们介绍SAGE;一种生成式LLM用于在全球电子商务目录中预测产品的属性值。我们提出了一种新的属性值预测问题的表述方式,即将属性值预测问题转化为一个Seq2Seq摘要任务,跨语言、产品类型和目标属性。我们的新的模型方法解决了预测属性值时的限制,包括预测属性值必须在先Specified的选择范围内,以及文本中必须直接提到属性值。SAGE可以在文本中推断属性值,即使这些值是使用射igrated语言表达或者不直接提到。此外,SAGE还可以预测产品上的属性是否存在或者可以从可用信息中获取。SAGE是第一种能够在实际设置中解决所有属性值预测问题的方法。一系列实验证明了我们的方法的有效性和其对State-of-the-art的竞争方法的超越性。此外,我们的实验还 highlight了SAGE在零shot设置下预测属性值的能力,从而开启了减少训练数据的机会。
Stochastic LLMs do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMs
results: 本论文的研究结果表明,LLMS 存在许多限制和不足,如不能准确地提供实际信息,因为它们所有的文本都被视为一样有价值的;此外,LLMS 的知识也会被归类为微特征(weights)中,无法归纳到有意义的概念中;此外,LLMS 在某些语言上也会出现偏差。Abstract
In our opinion the exuberance surrounding the relative success of data-driven large language models (LLMs) is slightly misguided and for several reasons (i) LLMs cannot be relied upon for factual information since for LLMs all ingested text (factual or non-factual) was created equal; (ii) due to their subsymbolic na-ture, whatever 'knowledge' these models acquire about language will always be buried in billions of microfeatures (weights), none of which is meaningful on its own; and (iii) LLMs will often fail to make the correct inferences in several linguistic contexts (e.g., nominal compounds, copredication, quantifier scope ambi-guities, intensional contexts. Since we believe the relative success of data-driven large language models (LLMs) is not a reflection on the symbolic vs. subsymbol-ic debate but a reflection on applying the successful strategy of a bottom-up reverse engineering of language at scale, we suggest in this paper applying the effective bottom-up strategy in a symbolic setting resulting in symbolic, explainable, and ontologically grounded language models.
摘要
我们认为大量数据驱动的大型自然语言模型(LLM)的热情是有所误导的,主要有以下几点原因:1. LLM不能依赖于事实信息,因为它们对所有的文本(事实或非事实)都是一样的;2.由于它们的非符号性质,LMM所获得的语言知识都将被归类为微特征( weights),这些微特征都是无法单独解释的;3. LLM在某些语言上将不能做出正确的推理,如名词复合、共Predication、量词范围不确定性、意思上的context。因为我们认为数据驱动大型自然语言模型(LLM)的成功不是符号vs非符号的问题,而是应用大规模的底层逆工程Strategy的成功,因此在这篇论文中,我们建议在符号环境中应用有效的底层逆工程策略,从而获得符号、可解释、基于ontology的语言模型。
ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning
results: 测试结果表明,通过借鉴动态计划的能力,ACT能够在环境杂化下表现出效果很好,与基eline方法在各种标准准则上表现出色。此外,我们通过减少分析来检验ACT的不同设计方案的影响。Abstract
Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies.
摘要
Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning
methods: 我们使用一个名为 QAD 的quality-agnostic deepfake detection方法,通过观察通用错误预期函数的上限,将不同质量水平的图像间的中间表现dependency最大化。
results: 我们的 QAD 模型在七个受测 dataset 上进行了广泛的实验,与先前的 SOTA benchmark 相比,表现出色。Abstract
Deepfake has recently raised a plethora of societal concerns over its possible security threats and dissemination of fake information. Much research on deepfake detection has been undertaken. However, detecting low quality as well as simultaneously detecting different qualities of deepfakes still remains a grave challenge. Most SOTA approaches are limited by using a single specific model for detecting certain deepfake video quality type. When constructing multiple models with prior information about video quality, this kind of strategy incurs significant computational cost, as well as model and training data overhead. Further, it cannot be scalable and practical to deploy in real-world settings. In this work, we propose a universal intra-model collaborative learning framework to enable the effective and simultaneous detection of different quality of deepfakes. That is, our approach is the quality-agnostic deepfake detection method, dubbed QAD . In particular, by observing the upper bound of general error expectation, we maximize the dependency between intermediate representations of images from different quality levels via Hilbert-Schmidt Independence Criterion. In addition, an Adversarial Weight Perturbation module is carefully devised to enable the model to be more robust against image corruption while boosting the overall model's performance. Extensive experiments over seven popular deepfake datasets demonstrate the superiority of our QAD model over prior SOTA benchmarks.
摘要
为了解决这个问题,我们提出了一种通用内部协作学习框架,以实现不同质量深圳投影的同时检测。具体来说,我们通过观察总体错误预期的上限来 maximize 图像不同质量水平之间的依赖关系,使用希尔伯特-施密特独立性 критерион。此外,我们还妥善地设计了一个对抗权重偏移模块,以使模型更加抗 resize 而增强整体模型的表现。我们在七个流行的深圳投影数据集上进行了广泛的实验,并证明了我们的 QAD 模型在先前的 SOTA 标准之上表现出色。
Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation
results: Llama-2模型在生成kernels时显示了竞争力或更高的准确率,而Copilot生成的代码更可靠但 menos优化,相反,Llama-2生成的代码虽然更不可靠但更高效当 corrrect。Abstract
We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work that is based on the OpenAI Codex, which is a descendant of GPT-3, to generate similar kernels with simple prompts via GitHub Copilot. Our goal is to compare the accuracy of Llama-2 and our original GPT-3 baseline by using a similar metric. Llama-2 has a simplified model that shows competitive or even superior accuracy. We also report on the differences between these foundational large language models as generative AI continues to redefine human-computer interactions. Overall, Copilot generates codes that are more reliable but less optimized, whereas codes generated by Llama-2 are less reliable but more optimized when correct.
摘要
我们评估了基于开源的Llama-2模型来生成知名度高、性能优秀的计算器件(例如AXPY、GEMV、GEMM)在不同的并行编程模型和语言(例如C++:OpenMP、OpenMP Offload、OpenACC、CUDA、HIP;Fortran:OpenMP、OpenMP Offload、OpenACC;Python:numpy、Numba、pyCUDA、cuPy;Julia:Threads、CUDA.jl、AMDGPU.jl)上。我们基于我们之前的工作,这是基于OpenAI Codex,这是GPT-3的后代,通过GitHub Copilot来生成类似的kernels。我们的目标是将Llama-2和我们原始的GPT-3基线相比,使用相似的度量。Llama-2有简化的模型,并显示了竞争或更高的准确性。我们还报告了这些基础的大语言模型在人机交互中如何不断定义。总之,Copilot生成的代码更可靠但较少优化,而Llama-2生成的代码虽然较不可靠但当正确时更高效。
Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing
results: 研究发现,GPT-3.5受到上下文框架的影响很大,但是它对抽象的策略理解能力有限。GPT-4和LLaMa-2根据游戏结构和上下文进行策略调整,但LLaMa-2表现出更加细腻的游戏机制理解。这些结果 highlights current LLMs的限制和不同的聪明程度,警告不要在需要复杂策略理解的任务中不经过训练使用LLMs。Abstract
This paper investigates the strategic decision-making capabilities of three Large Language Models (LLMs): GPT-3.5, GPT-4, and LLaMa-2, within the framework of game theory. Utilizing four canonical two-player games -- Prisoner's Dilemma, Stag Hunt, Snowdrift, and Prisoner's Delight -- we explore how these models navigate social dilemmas, situations where players can either cooperate for a collective benefit or defect for individual gain. Crucially, we extend our analysis to examine the role of contextual framing, such as diplomatic relations or casual friendships, in shaping the models' decisions. Our findings reveal a complex landscape: while GPT-3.5 is highly sensitive to contextual framing, it shows limited ability to engage in abstract strategic reasoning. Both GPT-4 and LLaMa-2 adjust their strategies based on game structure and context, but LLaMa-2 exhibits a more nuanced understanding of the games' underlying mechanics. These results highlight the current limitations and varied proficiencies of LLMs in strategic decision-making, cautioning against their unqualified use in tasks requiring complex strategic reasoning.
摘要
The findings show that GPT-3.5 is highly sensitive to contextual framing, but has limited ability to engage in abstract strategic reasoning. Both GPT-4 and LLaMa-2 adjust their strategies based on game structure and context, but LLaMa-2 demonstrates a more nuanced understanding of the games' underlying mechanics. These results highlight the current limitations and varied proficiencies of LLMs in strategic decision-making, and caution against their unqualified use in tasks requiring complex strategic reasoning.
results: 根据实验结果,RT-LM可以减少响应时间和提高吞吐量,但是runtime开销很小。Abstract
Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and optimize these uncertainty-induced latency performance variations in LMs. Specifically, we present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how specific input uncertainties, adversely affect latency, often leading to an increased output length. Exploiting these insights, we devise a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. Utilizing this quantification as a latency heuristic, we integrate the uncertainty information into a system-level scheduler which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading. Quantitative experiments across five state-of-the-art LMs on two hardware platforms demonstrates that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
摘要
RT-LM aims to comprehend, quantify, and optimize the uncertainty-induced latency performance variations in LMs. We present a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. By exploiting these insights, we integrate the uncertainty information into a system-level scheduler, which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading.Experiments on five state-of-the-art LMs on two hardware platforms show that RT-LM can significantly reduce the average response time and improve throughput while incurring a small runtime overhead. Our approach can effectively mitigate the impact of uncertainty on real-time response-demanding systems, enabling the widespread adoption of LMs in various applications such as conversation AI.
Text Encoders Lack Knowledge: Leveraging Generative LLMs for Domain-Specific Semantic Textual Similarity
results: 我们的结果显示,使用生成语言模型(LLM)在具有世界知识的STS任务上的性能比使用编码器基于模型更高,在医学、政治和体育等领域的新收集的STS挑战集上,生成语言模型与STS特定的提示策略相结合可以 дости到状态之巅的性能。Abstract
Amidst the sharp rise in the evaluation of large language models (LLMs) on various tasks, we find that semantic textual similarity (STS) has been under-explored. In this study, we show that STS can be cast as a text generation problem while maintaining strong performance on multiple STS benchmarks. Additionally, we show generative LLMs significantly outperform existing encoder-based STS models when characterizing the semantic similarity between two texts with complex semantic relationships dependent on world knowledge. We validate this claim by evaluating both generative LLMs and existing encoder-based STS models on three newly collected STS challenge sets which require world knowledge in the domains of Health, Politics, and Sports. All newly collected data is sourced from social media content posted after May 2023 to ensure the performance of closed-source models like ChatGPT cannot be credited to memorization. Our results show that, on average, generative LLMs outperform the best encoder-only baselines by an average of 22.3% on STS tasks requiring world knowledge. Our results suggest generative language models with STS-specific prompting strategies achieve state-of-the-art performance in complex, domain-specific STS tasks.
摘要
在大语言模型(LLM)评估的快速上升中, semantic textual similarity(STS)却被忽视了。在这项研究中,我们发现STS可以被视为文本生成问题,同时保持多个STS benchmark task的优秀表现。此外,我们发现生成型LLM在描述两个文本之间的Semantic关系时,表现出色,特别是在具有世界知识的复杂Semantic关系上。我们 validate这一点通过对生成型LLM和现有encoder-based STS模型在健康、政治和运动等领域的三个新收集的STS挑战集上进行评估。所有新收集的数据都来自社交媒体上的内容,其中大部分是在2023年5月之后发布的,以避免Memorization的问题。我们的结果表明,在需要世界知识的STS任务中,生成型LLM平均表现出色,高于最佳encoder-only baseline的22.3%。我们的结果表明,使用STS特定的提示策略的生成语言模型在复杂的领域特定STS任务中实现了状态的表现。
Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed Hinglish Memes
results: 本研究发现了50多个参与者注册参与共同任务,5个参与者提交了测试集的最终Submission。最佳最终F1分数为Task A为34.41,Task B为79.77,Task C为59.82。Abstract
Analyzing memes on the internet has emerged as a crucial endeavor due to the impact this multi-modal form of content wields in shaping online discourse. Memes have become a powerful tool for expressing emotions and sentiments, possibly even spreading hate and misinformation, through humor and sarcasm. In this paper, we present the overview of the Memotion 3 shared task, as part of the DeFactify 2 workshop at AAAI-23. The task released an annotated dataset of Hindi-English code-mixed memes based on their Sentiment (Task A), Emotion (Task B), and Emotion intensity (Task C). Each of these is defined as an individual task and the participants are ranked separately for each task. Over 50 teams registered for the shared task and 5 made final submissions to the test set of the Memotion 3 dataset. CLIP, BERT modifications, ViT etc. were the most popular models among the participants along with approaches such as Student-Teacher model, Fusion, and Ensembling. The best final F1 score for Task A is 34.41, Task B is 79.77 and Task C is 59.82.
摘要
互联网上的迷因分析已经成为一项重要的专业,因为这种多Modal的内容可以影响网络上的讨论。迷因已经成为一个强大的表达情感和意见的工具,可能 même 传播仇恨和误information,通过幽默和讽刺。在这篇文章中,我们介绍了Memotion 3共同任务的概观,这是DeFactify 2会议上的一部分。这个任务发布了统计数据集的印地语-英语混合迷因,并分为三个任务:情感(任务A)、情感(任务B)和情感强度(任务C)。每个任务都是一个独立的任务,参赛者会被分别排名。这个任务获得了超过50队的注册,并有5队提交了测试集的最终Submission。参赛者主要使用CLIP、BERT修改和ViT等模型,以及学生教师模型、融合和投票等方法。最佳的最终F1分 для任务A是34.41,任务B是79.77,任务C是59.82。
Leveraging Large Language Models and Weak Supervision for Social Media data annotation: an evaluation using COVID-19 self-reported vaccination tweets
results: 研究发现,GPT-4(2023年3月23日版本)在自动标注COVID-19疫苗相关 tweet 方面的表现与人类标注者相当,且可以快速、高效地进行自动标注。Abstract
The COVID-19 pandemic has presented significant challenges to the healthcare industry and society as a whole. With the rapid development of COVID-19 vaccines, social media platforms have become a popular medium for discussions on vaccine-related topics. Identifying vaccine-related tweets and analyzing them can provide valuable insights for public health research-ers and policymakers. However, manual annotation of a large number of tweets is time-consuming and expensive. In this study, we evaluate the usage of Large Language Models, in this case GPT-4 (March 23 version), and weak supervision, to identify COVID-19 vaccine-related tweets, with the purpose of comparing performance against human annotators. We leveraged a manu-ally curated gold-standard dataset and used GPT-4 to provide labels without any additional fine-tuning or instructing, in a single-shot mode (no additional prompting).
摘要
COVID-19 大流行带来了医疗业和社会全面的挑战。随着 COVID-19 疫苗的快速发展,社交媒体平台上有大量有关疫苗的讨论。可以通过分析这些微博来获得有价值的公共卫生研究人员和政策制定者的洞察。但是,手动标注大量微博是时间consuming 和昂贵的。在本研究中,我们评估了大型自然语言模型(GPT-4,2023年3月23日版)和弱级指导,用于标识 COVID-19 疫苗相关的微博,并与人工标注器进行比较。我们利用了手动精心挑选的金标准数据集,并使用 GPT-4 提供标签,无需任何额外的调整或指导,在单射模式下(没有额外的提示)。
Leveraging Large Language Models for Automated Dialogue Analysis
paper_authors: Sarah E. Finch, Ellie S. Paek, Jinho D. Choi
For: The paper aims to assess the ability of a state-of-the-art large language model (LLM) to detect nine categories of undesirable behaviors in real human-bot dialogues.* Methods: The paper uses a state-of-the-art LLM, ChatGPT-3.5, to perform dialogue behavior detection and compares its performance with specialized detection models.* Results: The paper finds that neither ChatGPT nor specialized models have yet achieved satisfactory results for this task, falling short of human performance. However, ChatGPT shows promising potential and often outperforms specialized detection models.Here are the three points in Simplified Chinese text:* For: 本研究用于评估一个状态rut-of-the-art的大语言模型(LLM)在真实人机对话中自动识别九种不良行为的能力。* Methods: 本研究使用状态rut-of-the-art的LLM,ChatGPT-3.5,进行对话行为检测,并与专门的检测模型进行比较。* Results: 本研究发现 neither ChatGPT nor specialized models have yet achieved satisfactory results for this task, falling short of human performance. However, ChatGPT shows promising potential and often outperforms specialized detection models.Abstract
Developing high-performing dialogue systems benefits from the automatic identification of undesirable behaviors in system responses. However, detecting such behaviors remains challenging, as it draws on a breadth of general knowledge and understanding of conversational practices. Although recent research has focused on building specialized classifiers for detecting specific dialogue behaviors, the behavior coverage is still incomplete and there is a lack of testing on real-world human-bot interactions. This paper investigates the ability of a state-of-the-art large language model (LLM), ChatGPT-3.5, to perform dialogue behavior detection for nine categories in real human-bot dialogues. We aim to assess whether ChatGPT can match specialized models and approximate human performance, thereby reducing the cost of behavior detection tasks. Our findings reveal that neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance. Nevertheless, ChatGPT shows promising potential and often outperforms specialized detection models. We conclude with an in-depth examination of the prevalent shortcomings of ChatGPT, offering guidance for future research to enhance LLM capabilities.
摘要
发展高性能对话系统受益于自动识别对话系统回应中不良行为的能力。然而,检测这些行为仍然是一项挑战,因为它需要对对话实践的广泛知识和理解。虽然最近的研究主要集中在建立特殊的对话行为分类器,但行为覆盖率仍然不完整,并且缺乏实际人机交互的测试。本文研究了一个现代大语言模型(LLM)ChatGPT-3.5在真实人机对话中的对话行为检测能力。我们希望评估ChatGPT是否能够与专门的模型相匹配,并且 approximates human performance,以降低行为检测任务的成本。我们发现 neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance。然而,ChatGPT表现了良好的潜力,并且经常超过专门的检测模型。我们结束于对ChatGPT缺陷的深入分析,并提供未来研究进一步增强LLM能力的指导。
Widely Interpretable Semantic Representation: Frameless Meaning Representation for Broader Applicability
paper_authors: Lydia Feng, Gregor Williamson, Han He, Jinho D. Choi
for: This paper presents a novel semantic representation, WISeR, to overcome challenges in Abstract Meaning Representation (AMR).
methods: The paper examines the numbered arguments of predicates in AMR and converts them to thematic roles, improving the inter-annotator agreement for beginner and experienced annotators.
results: The WISeR model exhibits higher accuracy than its AMR counterpart across the board, demonstrating that WISeR is easier for parsers to learn.Here’s the same information in Simplified Chinese text:
for: 这篇论文提出了一种新的Semantic Representation,WISeR,以解决Abstract Meaning Representation(AMR)中的挑战。
results: WISeR模型在所有板块中表现出了高度的准确性,证明WISeR更易 für parser 学习。Abstract
This paper presents a novel semantic representation, WISeR, that overcomes challenges for Abstract Meaning Representation (AMR). Despite its strengths, AMR is not easily applied to languages or domains without predefined semantic frames, and its use of numbered arguments results in semantic role labels, which are not directly interpretable and are semantically overloaded for parsers. We examine the numbered arguments of predicates in AMR and convert them to thematic roles that do not require reference to semantic frames. We create a new corpus of 1K English dialogue sentences annotated in both WISeR and AMR. WISeR shows stronger inter-annotator agreement for beginner and experienced annotators, with beginners becoming proficient in WISeR annotation more quickly. Finally, we train a state-of-the-art parser on the AMR 3.0 corpus and a WISeR corpus converted from AMR 3.0. The parser is evaluated on these corpora and our dialogue corpus. The WISeR model exhibits higher accuracy than its AMR counterpart across the board, demonstrating that WISeR is easier for parsers to learn.
摘要
We create a new corpus of 1,000 English dialogue sentences annotated in both WISeR and AMR. Our results show that WISeR has stronger inter-annotator agreement for both beginner and experienced annotators, with beginners becoming proficient in WISeR annotation more quickly. Additionally, we train a state-of-the-art parser on the AMR 3.0 corpus and a WISeR corpus converted from AMR 3.0. The parser is evaluated on these corpora and our dialogue corpus, and the WISeR model exhibits higher accuracy than its AMR counterpart across the board, demonstrating that WISeR is easier for parsers to learn.
Recovering from Privacy-Preserving Masking with Large Language Models
results: 实验结果表明,使用匿名标识符替换后,模型在下游自然语言处理任务中能够保持与原始数据无隐私保护的同等性能。Abstract
Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.
摘要
模型适应是处理代理训练数据和实际用户数据接收的差异的关键。为了有效地进行适应,用户的文本数据通常会被存储在服务器或本地设备上,以便下游自然语言处理(NLP)模型可以直接使用这些领域数据进行训练。然而,这可能会带来隐私和安全问题,因为披露用户信息会增加对敌人的风险。将用户信息从文本数据中替换为通用标识符已经被研究。在这种工作中,我们利用大语言模型(LLM)来建议替换的Marker Token,并对这些方法的效果进行了实验研究。specifically,我们提出了多种预训练和精度调整的LLM基于方法,并在不同的数据集上进行了实验比较这些方法的效果。实验结果表明,使用隐藏 corpora 进行训练的模型能够达到与没有隐藏 token 训练的模型相同的性能。
results: 该论文表明,使用CTS可以提高自动生成相关工作的准确性,并且可以避免非事实的幻想。Abstract
Automatic related work generation must ground their outputs to the content of the cited papers to avoid non-factual hallucinations, but due to the length of scientific documents, existing abstractive approaches have conditioned only on the cited paper \textit{abstracts}. We demonstrate that the abstract is not always the most appropriate input for citation generation and that models trained in this way learn to hallucinate. We propose to condition instead on the \textit{cited text span} (CTS) as an alternative to the abstract. Because manual CTS annotation is extremely time- and labor-intensive, we experiment with automatic, ROUGE-based labeling of candidate CTS sentences, achieving sufficiently strong performance to substitute for expensive human annotations, and we propose a human-in-the-loop, keyword-based CTS retrieval approach that makes generating citation texts grounded in the full text of cited papers both promising and practical.
摘要
自动生成相关工作必须将输出锚定到引用文献中的内容,以避免非实际的幻觉。但由于科学文献的长度,现有的抽象方法只做到了基于引用文献摘要进行conditioning。我们示示了摘要并不总是最适合的引用生成输入,并且模型在这种情况下会学习幻觉。我们提议在代之之前使用引用文献中的特定文本段(CTS)作为输入,因为手动标注CTS是非常时间和劳动密集的。我们对候选CTS句子使用ROUGE基于的自动标注方法进行实验,得到了充分的表现,以至于可以代替昂贵的人工标注。此外,我们也提出了人在循环的键盘基于CTS检索方法,使得生成引用文本与引用文献的全文相关。
Learning to Predict Concept Ordering for Common Sense Generation
results: 研究发现,BART-large模型在CommonGen训练数据中的概念顺序下 consistently 表现最佳,并且比较小的LM可以在这个任务上表现更好于大型GPT3-based LLMs。此外,人工标注的输入概念集顺序可以独立地提供最佳的 sentence 生成结果,并且超过了基于概念顺序的随机化策略。Abstract
Prior work has shown that the ordering in which concepts are shown to a commonsense generator plays an important role, affecting the quality of the generated sentence. However, it remains a challenge to determine the optimal ordering of a given set of concepts such that a natural sentence covering all the concepts could be generated from a pretrained generator. To understand the relationship between the ordering of the input concepts and the quality of the generated sentences, we conduct a systematic study considering multiple language models (LMs) and concept ordering strategies. We find that BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data as measured using multiple evaluation metrics. Moreover, the larger GPT3-based large language models (LLMs) variants do not necessarily outperform much smaller LMs on this task, even when fine-tuned on task-specific training data. Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for the generation, outperforming a probabilistic concept ordering baseline
摘要
We found that the BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data, as measured by multiple evaluation metrics. Additionally, we found that the larger GPT3-based large language models (LLMs) variants do not necessarily outperform smaller LMs on this task, even when fine-tuned on task-specific training data.Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for generation, outperforming a probabilistic concept ordering baseline.
results: 实验结果表明,该方法可以提高 LLMs 的推理能力,并且可以轻松地与其他语言模型、提示方法和集成技术结合使用Abstract
Reasoning presents a significant and challenging issue for Large Language Models (LLMs). The predominant focus of research has revolved around developing diverse prompting strategies to guide and structure the reasoning processes of LLMs. However, these approaches based on decoder-only causal language models often operate the input question in a single forward pass, potentially missing the rich, back-and-forth interactions inherent in human reasoning. Scant attention has been paid to a critical dimension, i.e., the input question itself embedded within the prompts. In response, we introduce a deceptively simple yet highly effective prompting strategy, termed question "re-reading". Drawing inspiration from human learning and problem-solving, re-reading entails revisiting the question information embedded within input prompts. This approach aligns seamlessly with the cognitive principle of reinforcement, enabling LLMs to extract deeper insights, identify intricate patterns, establish more nuanced connections, and ultimately enhance their reasoning capabilities across various tasks. Experiments conducted on a series of reasoning benchmarks serve to underscore the effectiveness and generality of our method. Moreover, our findings demonstrate that our approach seamlessly integrates with various language models, though-eliciting prompting methods, and ensemble techniques, further underscoring its versatility and compatibility in the realm of LLMs.
摘要
The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models
results: 这篇论文表明,许多语言模型在处理时间序数据时会错误地分词,并且提出了一些可能的解决方案,如使用轻量级嵌入层和多模态适配器。Abstract
Large Language Models (LLMs) have demonstrated remarkable generalization across diverse tasks, leading individuals to increasingly use them as personal assistants and universal computing engines. Nevertheless, a notable obstacle emerges when feeding numerical/temporal data into these models, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical values and might struggle to understand repetitive patterns and context, treating consecutive values as separate tokens and disregarding their temporal relationships. Here, we discuss recent works that employ LLMs for human-centric tasks such as in mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly. To address that, we highlight potential solutions such as prompt tuning with lightweight embedding layers as well as multimodal adapters, that can help bridge this "modality gap". While the capability of language models to generalize to other modalities with minimal or no finetuning is exciting, this paper underscores the fact that their outputs cannot be meaningful if they stumble over input nuances.
摘要
In this paper, we discuss recent works that use LLMs for human-centric tasks such as mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly. To address this issue, we highlight potential solutions such as prompt tuning with lightweight embedding layers and multimodal adapters, which can help bridge the "modality gap". While the capability of language models to generalize to other modalities with minimal or no finetuning is exciting, this paper emphasizes that their outputs cannot be meaningful if they stumble over input nuances.
Human Action Co-occurrence in Lifestyle Vlogs using Graph Link Prediction
results: 研究发现图表非常适合捕捉人体动作之间的关系,并且学习的图表表示法对该任务具有高效性和可靠性。同时,该研究还发现了一些新和相关的信息,这些信息可以在不同的数据领域中找到应用。Abstract
We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at https://github.com/MichiganNLP/vlog_action_co-occurrence.
摘要
我们介绍了自动人体动作协同识别任务,即判断两个人体动作是否可以在同一个时间间协同出现。我们创建了ACE(动作协同)数据集,包含约12k个相互协同的视觉动作对和其相应的视频片段。我们描述了基于视觉和文本信息的图链预测模型,可以自动推断两个动作是否协同出现。我们发现图是特别适合捕捉人体动作之间的关系,并且学习的图表示是我们任务中效果很高,并且在不同的数据领域中捕捉到了新和有关的信息。ACE数据集和我们在本篇文章中介绍的代码都公开可用于https://github.com/MichiganNLP/vlog_action_co-occurrence。
results: 通过人工评估和数据分析,发现该评估方法与人类判断有高度相关性,可以用于评估不同DTM的性能和指导未来研究。Abstract
There is a lack of quantitative measures to evaluate the progression of topics through time in dynamic topic models (DTMs). Filling this gap, we propose a novel evaluation measure for DTMs that analyzes the changes in the quality of each topic over time. Additionally, we propose an extension combining topic quality with the model's temporal consistency. We demonstrate the utility of the proposed measure by applying it to synthetic data and data from existing DTMs. We also conducted a human evaluation, which indicates that the proposed measure correlates well with human judgment. Our findings may help in identifying changing topics, evaluating different DTMs, and guiding future research in this area.
摘要
DTMs 缺乏时间序量化评价标准,为此,我们提出了一种新的评价标准,用于评估 DTMs 中话题的时间发展质量。此外,我们还提出了结合话题质量和模型时间一致性的扩展。我们在synthetic data和现有 DTMs 数据上应用了该标准,并进行了人类评价,结果显示了与人类判断的高度相关性。我们的发现可能有助于 indentifying changing topics, evaluating different DTMs, and guiding future research in this area.Note: "DTMs" stands for "dynamic topic models".
Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains
for: This paper aims to provide an extensive investigation of various approaches for quantifying Fragmentation in news recommendations, with the goal of improving the accuracy of measuring the degree of fragmentation of information streams in news recommendations.
methods: The paper uses Natural Language Processing (NLP) techniques, specifically agglomerative hierarchical clustering coupled with SentenceBERT text representation, to identify distinct news events, stories, or timelines and measure Fragmentation.
results: The paper finds that the proposed approach of agglomerative hierarchical clustering coupled with SentenceBERT text representation is substantially better at detecting Fragmentation than earlier implementations, and provides valuable insights and recommendations for stakeholders concerning the measurement and interpretation of Fragmentation.Abstract
News recommender systems play an increasingly influential role in shaping information access within democratic societies. However, tailoring recommendations to users' specific interests can result in the divergence of information streams. Fragmented access to information poses challenges to the integrity of the public sphere, thereby influencing democracy and public discourse. The Fragmentation metric quantifies the degree of fragmentation of information streams in news recommendations. Accurate measurement of this metric requires the application of Natural Language Processing (NLP) to identify distinct news events, stories, or timelines. This paper presents an extensive investigation of various approaches for quantifying Fragmentation in news recommendations. These approaches are evaluated both intrinsically, by measuring performance on news story clustering, and extrinsically, by assessing the Fragmentation scores of different simulated news recommender scenarios. Our findings demonstrate that agglomerative hierarchical clustering coupled with SentenceBERT text representation is substantially better at detecting Fragmentation than earlier implementations. Additionally, the analysis of simulated scenarios yields valuable insights and recommendations for stakeholders concerning the measurement and interpretation of Fragmentation.
摘要
AKEM: Aligning Knowledge Base to Queries with Ensemble Model for Entity Recognition and Linking
results: 该方法可以高效地处理数据,并实现了 F1 分数为 0.535。Abstract
This paper presents a novel approach to address the Entity Recognition and Linking Challenge at NLPCC 2015. The task involves extracting named entity mentions from short search queries and linking them to entities within a reference Chinese knowledge base. To tackle this problem, we first expand the existing knowledge base and utilize external knowledge to identify candidate entities, thereby improving the recall rate. Next, we extract features from the candidate entities and utilize Support Vector Regression and Multiple Additive Regression Tree as scoring functions to filter the results. Additionally, we apply rules to further refine the results and enhance precision. Our method is computationally efficient and achieves an F1 score of 0.535.
摘要
这篇论文提出了一种新的方法来解决2015年NLPCC会议上的实体识别和连接挑战。任务是从短搜索查询中提取命名实体提及,并将其与中国知识库中的实体进行连接。为解决这个问题,我们首先扩展了现有的知识库,并利用外部知识来确定候选实体,从而提高了受检测率。接着,我们从候选实体中提取特征,并使用支持向量回归和多项加itive树分类函数来筛选结果。此外,我们还应用规则来进一步精细化结果,提高精度。我们的方法具有计算效率,并实现了F1分数0.535。
Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis
results: 三个队伍在评估阶段参与了评估,在总体来说取得了良好的结果 для任务1,但是对于任务2和3的结果则有所不同。Abstract
We present the first shared task for detecting and analyzing code-switching in Guarani and Spanish, GUA-SPA at IberLEF 2023. The challenge consisted of three tasks: identifying the language of a token, NER, and a novel task of classifying the way a Spanish span is used in the code-switched context. We annotated a corpus of 1500 texts extracted from news articles and tweets, around 25 thousand tokens, with the information for the tasks. Three teams took part in the evaluation phase, obtaining in general good results for Task 1, and more mixed results for Tasks 2 and 3.
摘要
我们现在介绍GUA-SPA的首次共同任务,即检测和分析库亚语和西班牙语的代码 switching。这个挑战包括三个任务:确定一个令素的语言,名实 recognize,以及一个新的任务,即在代码 switching 上下文中分类西班牙语句子的使用方式。我们为这个任务annotated一个新闻文章和微博中的1500篇文本,约25000个令素,以便提供任务的信息。三支队伍参与了评估阶段,在总体来说取得了良好的成绩, Task 1 的结果,而 Tasks 2 和 3 的结果则更为杂mix。
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
results: 该研究发现,大约半数的原本被视为安全的提示 benchmarks 可以通过 manipulate 提示来绕过已经部署的安全机制,包括概念 removals、负提示和安全指导。这些发现表明,不进行全面测试,就可能得出假的安全感,文本到图像模型可能会生成不安全或版权问题的图像。Abstract
Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.
摘要
文本到图像扩散模型,如稳定扩散(SD),最近显示了高质量内容生成的惊人能力,成为最近一波转化AI的代表之一。然而,这种进步也带来了对这种生成技术的滥用的担忧,特别是生成版权或不安全的图像(i.e.不适合工作)。虽然努力在滥用图像/提示或移除不适合的概念/风格方面进行模型细化,但这些安全机制的可靠性在多样化问题上仍然未得到探索。在这项工作中,我们提出了Prompting4Debugging(P4D)作为调试和红团工具,自动找到 diffusion 模型的异常提示,以测试已经部署的安全机制的可靠性。我们示出了P4D工具在SD模型上的有效性,并显示了大约半数的提示在现有的安全提示benchmark中被原本认为是安全的,但实际上可以通过许多已部署的安全机制进行滥用。我们的发现表明,不完全测试可能会导致对文本到图像模型的评估产生假象的安全性。
Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection
results: 研究发现,在不同的随机选择的训练数据 subsets 中精致化 PLM rankers 会具有很大的差异,这表明可以通过活动选择训练数据 subsets 来实现更高的效率。然而,这篇论文发现,现有的活动学习(AL)策略在PLM rankers的精致化中不能够实现更高的效率,并且与随机选择相比,AL策略需要更多的评估成本。Abstract
Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM-based rankers under limited training data and budget. We investigate two scenarios: fine-tuning a ranker from scratch, and domain adaptation starting with a ranker already fine-tuned on general data, and continuing fine-tuning on a target dataset. We observe a great variability in effectiveness when fine-tuning on different randomly selected subsets of training data. This suggests that it is possible to achieve effectiveness gains by actively selecting a subset of the training data that has the most positive effect on the rankers. This way, it would be possible to fine-tune effective PLM rankers at a reduced annotation budget. To investigate this, we adapt existing Active Learning (AL) strategies to the task of fine-tuning PLM rankers and investigate their effectiveness, also considering annotation and computational costs. Our extensive analysis shows that AL strategies do not significantly outperform random selection of training subsets in terms of effectiveness. We further find that gains provided by AL strategies come at the expense of more assessments (thus higher annotation costs) and AL strategies underperform random selection when comparing effectiveness given a fixed annotation cost. Our results highlight that ``optimal'' subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them.
摘要
基于预训言语模型(PLM)的搜索方法在效果上有显著提升,但是精细调整PLM-based ranker需要大量标注数据。标注数据需要大量人工劳动,因此成本高,特别是在域pecific任务中。在这篇论文中,我们研究了在有限的培训数据和预算下进行PLM-based ranker的精细调整。我们研究了两个情况:从scratch开始精细调整rankers,以及在泛化数据上精细调整rankers,然后在目标数据上继续精细调整。我们发现在不同随机选择的培训数据上进行精细调整时,效果很大。这表示可以通过活动选择培训数据来提高PLM rankers的效果,而不需要大量的标注预算。为了 investigates这一点,我们采用了现有的活动学习(AL)策略,并对其效果进行了广泛的分析。我们发现,与随机选择的培训数据相比,AL策略不能显著提高效果。此外,AL策略相比随机选择,需要更多的评估(即更高的标注成本),并且在给定标注成本下,AL策略下表现较差。我们的结果表明,“优化”的培训数据集,可以提高PLM rankers的效果,但现有的主流AL策略无法确定这些集。
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
paper_authors: Tuan Dung Nguyen, Yuan-Sen Ting, Ioana Ciucă, Charlie O’Neill, Ze-Chang Sun, Maja Jabłońska, Sandor Kruk, Ernest Perkowski, Jack Miller, Jason Li, Josh Peek, Kartheik Iyer, Tomasz Różański, Pranav Khetarpal, Sharaf Zaman, David Brodrick, Sergio J. Rodríguez Méndez, Thang Bui, Alyssa Goodman, Alberto Accomazzi, Jill Naiman, Jesse Cranney, Kevin Schawinski, UniverseTBD
for: bridging the gap between large language models and highly specialized domains like scholarly astronomy
methods: fine-tuning a 7-billion-parameter model from LLaMA-2 using over 300,000 astronomy abstracts from arXiv, optimized for traditional causal language modeling
results: achieving a 30% lower perplexity than Llama-2, generating more insightful and scientifically relevant text completions and embedding extraction than state-of-the-art foundation models despite having significantly fewer parametersAbstract
Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
摘要
大型语言模型在许多人类语言任务中表现出色,但在高度特殊化的学术天文领域中往往表现不佳。为了bridging这个差距,我们介绍AstroLLaMA,一个基于LLaMA-2的70亿个 Parameters模型,通过arXiv上的30万篇天文摘要文献进行精细调整。这个模型适用于传统的 causal 语言模型,与LLaMA-2相比,每个字的误差率下降了30%,表明了域 adaptation。我们的模型在对天文领域的文本完成和嵌入EXTRACTING方面表现更加具有意义和科学相关性,即使有较少的参数。AstroLLaMA是一个强大的专业领域模型,具有广泛的 fine-tuning 潜力。公开发布AstroLLaMA,以促进天文研究,包括自动摘要和对话代理开发。
Characterizing Latent Perspectives of Media Houses Towards Public Figures
results: 结果表明,使用该方法可以生成准确的人EntityCharacterization,并且比预先训练的模型更加准确。Abstract
Media houses reporting on public figures, often come with their own biases stemming from their respective worldviews. A characterization of these underlying patterns helps us in better understanding and interpreting news stories. For this, we need diverse or subjective summarizations, which may not be amenable for classifying into predefined class labels. This work proposes a zero-shot approach for non-extractive or generative characterizations of person entities from a corpus using GPT-2. We use well-articulated articles from several well-known news media houses as a corpus to build a sound argument for this approach. First, we fine-tune a GPT-2 pre-trained language model with a corpus where specific person entities are characterized. Second, we further fine-tune this with demonstrations of person entity characterizations, created from a corpus of programmatically constructed characterizations. This twice fine-tuned model is primed with manual prompts consisting of entity names that were not previously encountered in the second fine-tuning, to generate a simple sentence about the entity. The results were encouraging, when compared against actual characterizations from the corpus.
摘要
媒体机构报道公众人物,经常带有自己的偏见,源于自己的世界观。了解这些底层模式,能够帮助我们更好地理解和解释新闻故事。为此,我们需要多样化或主观概要,这些概要可能无法被归类为预定的类别。这项工作提出了一种零批处理方法,通过GPT-2进行非抽取式或生成性人EntityCharacterizations。我们使用了多种知名新闻媒体的报道,构建了一个具有听众力的论证。首先,我们精度地调整GPT-2预训练语言模型,使其与特定人EntityCharacterizations相关的 corpus进行了精度调整。然后,我们进一步精度调整这个模型,使其能够生成基于 manually constructed characterizations的示例。这两次精度调整的模型,通过手动提供实体名称,并不在第二次精度调整中出现过的实体名称,来生成简单的句子。结果非常鼓舞人,与实际 corpus 中的Characterizations相比。
results: 我们在两个 dataset 上进行了实验,得到了鲜明的成果。具体来说,在中文税onomy dataset上,我们的方法相比原始方法提高了准确率8.75%。此外,我们的方法还在中文税onomy dataset上比ChatGPT better。Abstract
Taxonomy expansion task is essential in organizing the ever-increasing volume of new concepts into existing taxonomies. Most existing methods focus exclusively on using textual semantics, leading to an inability to generalize to unseen terms and the "Prototypical Hypernym Problem." In this paper, we propose Visual Taxonomy Expansion (VTE), introducing visual features into the taxonomy expansion task. We propose a textual hypernymy learning task and a visual prototype learning task to cluster textual and visual semantics. In addition to the tasks on respective modalities, we introduce a hyper-proto constraint that integrates textual and visual semantics to produce fine-grained visual semantics. Our method is evaluated on two datasets, where we obtain compelling results. Specifically, on the Chinese taxonomy dataset, our method significantly improves accuracy by 8.75 %. Additionally, our approach performs better than ChatGPT on the Chinese taxonomy dataset.
摘要
《税onomy扩展任务是组织新的概念 volume 的关键,因为现有的方法仅专注于使用文本 semantics,导致无法扩展至未见到的概念和"Prototypical Hypernym Problem"。在本文中,我们提出了可视的税onomy扩展(VTE),将可视特征加入税onomy扩展任务中。我们提出了文本层次学习任务和可视标本学习任务,以排序文本和可视 semantics。此外,我们引入了文本和可视 semantics的超类征约制,以生成细部可视 semantics。我们的方法在两个数据集上进行评估,结果表明我们的方法在中文税onomy数据集上提高了精度 by 8.75%,并且比ChatGPT在中文税onomy数据集上表现更好。
Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies
results: 研究结果显示,在两个不同的分类问题(词语攻击和产品评论)上,IT跨语言策略在目标语言上表现更好。此外,研究发现,在多种跨语言传递中,CLV策略在基础语言(英语)中的知识抑制比IT策略更强。Abstract
The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual transfer strategies, we compare the intermediate-training (\textit{IT}) that uses each language sequentially and cross-lingual validation (\textit{CLV}) that uses a target language already in the validation phase of fine-tuning. We assess the success of transfer and the extent of catastrophic forgetting in a source language due to cross-lingual transfer, i.e., how much previously acquired knowledge is lost when we learn new information in a different language. The results on two different classification problems, hate speech detection and product reviews, each containing datasets in several languages, show that the \textit{IT} cross-lingual strategy outperforms \textit{CLV} for the target language. Our findings indicate that, in the majority of cases, the \textit{CLV} strategy demonstrates superior retention of knowledge in the base language (English) compared to the \textit{IT} strategy, when evaluating catastrophic forgetting in multiple cross-lingual transfers.
摘要
cross-lingual transfer是一种有前途的技术,可以解决少语言资源的任务。在这个实验研究中,我们比较了两种精细调整方法,与零架构学习和全架构学习方法结合使用大语言模型在跨语言设置下进行比较。作为精细调整策略,我们比较了参数有效的适配器方法和所有参数的 fine-tuning。作为跨语言传递策略,我们比较了中间训练(IT),使用每种语言的顺序训练,以及跨语言验证(CLV),在练习阶段对 targets 语言进行验证。我们评估了跨语言传递的成功和源语言中的恶性遗弃现象,即在学习新语言时,之前学习的知识多少会丢失。我们在两个不同的分类问题,即词汇攻击和产品评论,每个问题都包含多种语言的数据集,得到的结果表明,对于目标语言,IT 跨语言策略表现出色。我们的发现表明,在大多数情况下,CLV 跨语言策略在多个跨语言传递中表现出更好的知识保留性,对于基础语言(英语)进行评估。
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models
results: 这个论文的初步实验表明,使用GPT-4作为参考点,这些南东亚语言的大语言模型在语言技能、文化表达和敏感性方面都存在缺陷。Abstract
The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASA
摘要
<> translate_language: zh-CNThe rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASANote: The translation is in Simplified Chinese, which is the standard writing system used in mainland China and Singapore.
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair
paper_authors: Weishi Wang, Yue Wang, Shafiq Joty, Steven C. H. Hoi for: 本研究的目的是提高自动程序修复(APR)的效能,以减少开发人员的手动调试努力并提高软件可靠性。methods: 本研究使用了深度学习(DL)基于的方法,通过在数据驱动方式下自动化程序修复过程。另外,我们还使用了一种混合的修补检索器,以便在不同的语言环境下进行lexical和semantic匹配。results: 我们的实验结果表明,RAP-Gen可以在三个benchmark上显著超越之前的状态态的方法,例如在818个Defects4J bug中修复15个更多的bug。Abstract
Automatic program repair (APR) is crucial to reduce manual debugging efforts for developers and improve software reliability. While conventional search-based techniques typically rely on heuristic rules or a redundancy assumption to mine fix patterns, recent years have witnessed the surge of deep learning (DL) based approaches to automate the program repair process in a data-driven manner. However, their performance is often limited by a fixed set of parameters to model the highly complex search space of APR. To ease such burden on the parametric models, in this work, we propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) by explicitly leveraging relevant fix patterns retrieved from a codebase of previous bug-fix pairs. Specifically, we build a hybrid patch retriever to account for both lexical and semantic matching based on the raw source code in a language-agnostic manner, which does not rely on any code-specific features. In addition, we adapt a code-aware language model CodeT5 as our foundation model to facilitate both patch retrieval and generation tasks in a unified manner. We adopt a stage-wise approach where the patch retriever first retrieves a relevant external bug-fix pair to augment the buggy input for the CodeT5 patch generator, which synthesizes a ranked list of repair patch candidates. Notably, RAP-Gen is a generic APR framework that can flexibly integrate different patch retrievers and generators to repair various types of bugs. We thoroughly evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java, where the bug localization information may or may not be provided. Experimental results show that RAP-Gen significantly outperforms previous state-of-the-art approaches on all benchmarks, e.g., repairing 15 more bugs on 818 Defects4J bugs.
摘要
自动化程序修复(APR)是软件可靠性的关键因素,可以减少开发人员的手动调试努力并提高软件的可靠性。传统的搜索基本技术通常采用规则或减少假设来挖掘修复模式,而 recent years 有所见到 Deep Learning(DL) 基于的方法来自动化程序修复过程。但是,它们的性能通常受到一组固定参数来模型高度复杂的修复空间的限制。为了减轻这种固定参数的负担,在这种工作中,我们提出了一种 novel Retrieval-Augmented Patch Generation 框架(RAP-Gen),通过显式地利用 Codebase 中的修复模式来提高修复效果。 Specifically,我们构建了一种 hybrid 修复搜索器,可以根据源代码的语言无关方式进行 both lexical 和 semantic 匹配,而不需要任何代码特定的特征。此外,我们采用 Code-aware 语言模型 CodeT5 作为基础模型,以便在一个简单的方式下进行修复搜索和生成任务。我们采用分阶段的方法,首先由修复搜索器 retrieved 一个相关的外部修复对,然后将其与 CodeT5 修复生成器进行结合,以生成一个排名列表中的修复补丁候选者。需要注意的是,RAP-Gen 是一种通用的 APR 框架,可以适应不同的修复任务和语言。我们对 TFix benchmark 、Code Refinement 和 Defects4J benchmark 进行了严格的测试,结果表明,RAP-Gen 在所有benchmark上显著超过了前一个状态的方法,例如,对 818 Defects4J bug 进行修复。
How does representation impact in-context learning: A exploration on a synthetic task
paper_authors: Jingwen Fu, Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng
for: investigate the mechanism of in-context learning in Transformer
methods: construct a novel synthetic task, use two probes to evaluate in-weights and in-context components
results: demonstrate the entanglement between in-context learning and representation learning, and the importance of in-weights component for in-context learningAbstract
In-context learning, i.e., learning from in-context samples, is an impressive ability of Transformer. However, the mechanism driving the in-context learning is not yet fully understood. In this study, we aim to investigate from an underexplored perspective of representation learning. The representation is more complex for in-context learning senario, where the representation can be impacted by both model weights and in-context samples. We refer the above two conceptually aspects of representation as in-weight component and in-context component, respectively. To study how the two components affect in-context learning capabilities, we construct a novel synthetic task, making it possible to device two probes, in-weights probe and in-context probe, to evaluate the two components, respectively. We demonstrate that the goodness of in-context component is highly related to the in-context learning performance, which indicates the entanglement between in-context learning and representation learning. Furthermore, we find that a good in-weights component can actually benefit the learning of the in-context component, indicating that in-weights learning should be the foundation of in-context learning. To further understand the the in-context learning mechanism and importance of the in-weights component, we proof by construction that a simple Transformer, which uses pattern matching and copy-past mechanism to perform in-context learning, can match the in-context learning performance with more complex, best tuned Transformer under the perfect in-weights component assumption. In short, those discoveries from representation learning perspective shed light on new approaches to improve the in-context capacity.
摘要
受Context学习,即通过在Context中的样本学习,是Transformer的一项惊人能力。然而,这种学习机制仍未完全理解。在这项研究中,我们尝试从 representation learning 的一个未经探索的角度来研究。在这种情况下,表示更加复杂,因为表示可以受到模型参数和Context中的样本影响。我们将这两个概念性方面的表示称为内重Component和Context Component,分别。为了研究这两个组件如何影响Context learning能力,我们构建了一个新的 sintetic任务,使得可以设置两个探针,即内重探针和Context探针,来评估这两个组件。我们发现,Context component 的质量与 Context learning 性能之间存在很高的相关性,这表明Context learning 和 representation learning 之间存在紧密的关系。此外,我们发现一个好的内重Component 可以实际提高Context component 的学习效果,这表明内重学习应该是Context learning 的基础。为了更深入地理解Context learning 机制和内重Component 的重要性,我们证明了一个简单的Transformer模型,通过模式匹配和复制机制来实现Context learning,可以与best tuned Transformer 模型匹配Context learning性能,假设内重Component 完美。总之,这些发现从 representation learning 的角度提供了新的方法来提高Context capacity。
Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model
paper_authors: Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao
for: 本研究的目的是解释supervised和unsupervised Contrastive learning of Sentence Embeddings (CSE)在训练过程中的性能差距,以及如何减少这个差距。
methods: 本研究使用了empirical experiments和metric called Fitting Difficulty Increment (FDI)来解释和解决性能差距问题。
results: 研究发现,性能差距的主要原因是训练数据集和评估数据集的适应度差异,并提出了一种基于LLM生成数据的方法来减少性能差距。Abstract
Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with Contrastive learning of Sentence Embeddings (CSE) as the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, even when their sentence encoder and loss function are the same. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, alignment and uniformity only measure the results, which means they cannot answer "What happens during the training process that leads to the performance gap?" and "How can the performance gap be narrowed?". In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, We observe a significant difference in fitting difficulty. Thus, we introduce a metric, called Fitting Difficulty Increment (FDI), to measure the fitting difficulty gap between the evaluation dataset and the held-out training dataset, and use the metric to answer the "What" question. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the fitting difficulty of the training dataset. We achieve this by leveraging the In-Context Learning (ICL) capability of the Large Language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE.
摘要
我们首先回答"What"问题,对监督和无监督CSE在训练过程中的行为进行了仔细比较。从比较中,我们发现监督CSE在训练过程中的适应 difficulty 和无监督CSE相比较大。因此,我们引入一个指标,叫做适应难度增量 (FDI),用于度量监督和无监督CSE在评估集和封锁训练集之间的适应难度差距。然后,根据FDI指标,我们回答"What"问题。接着,基于获得的"What"问题的回答,我们解决"How"问题。我们通过利用大语言模型 (LLM) 的启发学习 (ICL) 能力,生成数据,模拟复杂的模式。通过利用 LLB 生成的数据中的层次模式,我们有效地缩小了监督和无监督CSE之间的性能差距。
Content Reduction, Surprisal and Information Density Estimation for Long Documents
results: 研究发现了不同领域的长文档信息密度系统性差异,并且对自动医疗代码生成从长医疗记录中表现效果良好。Abstract
Many computational linguistic methods have been proposed to study the information content of languages. We consider two interesting research questions: 1) how is information distributed over long documents, and 2) how does content reduction, such as token selection and text summarization, affect the information density in long documents. We present four criteria for information density estimation for long documents, including surprisal, entropy, uniform information density, and lexical density. Among those criteria, the first three adopt the measures from information theory. We propose an attention-based word selection method for clinical notes and study machine summarization for multiple-domain documents. Our findings reveal the systematic difference in information density of long text in various domains. Empirical results on automated medical coding from long clinical notes show the effectiveness of the attention-based word selection method.
摘要
多种计算语言学方法已经被提出来研究语言信息内容。我们考虑了两个有趣的研究问题:1)在长文档中如何分布信息,2)如何采用内容减少方法(如选择 Token 和文本摘要)影响长文档中的信息密度。我们提出了四个信息密度估计标准,包括悬念度、熵、一致信息密度和词汇密度。其中前三个采用信息理论的度量。我们提议使用注意力基于词选择方法来处理医疗记录,并对多个领域文档进行机器摘要。我们的发现显示了不同领域的长文档信息密度存在系统性的差异。对自动医疗代码生成从长医疗记录的实验结果表明了注意力基于词选择方法的效果。
Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults
results: 研究显示,通过更有效的数据处理,可以将Word Error Rate(WER)在MyST测试集下降至9.11%(Whisper-Small)和8.61%(Whisper-Medium),并且这种改进可以普适应用于未经看过的数据集。此外,研究还揭示了儿童语音识别系统的一些重要挑战。Abstract
Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.
摘要
Improving Robustness of Neural Inverse Text Normalization via Data-Augmentation, Semi-Supervised Learning, and Post-Aligning Method
methods: 这 paper 提出了一种直接训练方法, 使用 ASR 生成的 spoken 或 written 文本,并通过 ASR 语言上下文模拟和 semi-supervised learning 方法增强。 此外,paper 还引入了一种后置对aligning 方法来管理不可预测的错误,以提高 ITN 的可靠性。
results: experiments 表明,paper 提出的方法在多种 ASR 场景中显著提高了 ITN 性能。Abstract
Inverse text normalization (ITN) is crucial for converting spoken-form into written-form, especially in the context of automatic speech recognition (ASR). While most downstream tasks of ASR rely on written-form, ASR systems often output spoken-form, highlighting the necessity for robust ITN in product-level ASR-based applications. Although neural ITN methods have shown promise, they still encounter performance challenges, particularly when dealing with ASR-generated spoken text. These challenges arise from the out-of-domain problem between training data and ASR-generated text. To address this, we propose a direct training approach that utilizes ASR-generated written or spoken text, with pairs augmented through ASR linguistic context emulation and a semi-supervised learning method enhanced by a large language model, respectively. Additionally, we introduce a post-aligning method to manage unpredictable errors, thereby enhancing the reliability of ITN. Our experiments show that our proposed methods remarkably improved ITN performance in various ASR scenarios.
摘要
倒计时normalization (ITN) 是对话式文本转换到书面文本的关键技术,尤其在自动语音识别 (ASR) 的上下文中。大多数 ASR 下游任务需要书面文本,但 ASR 系统通常输出说话式文本,因此需要Robust ITN 在产品级 ASR 应用中。虽然神经 ITN 方法有 shown 搅拌,但它们在处理 ASR 生成的说话文本时仍然遇到性能挑战。这些挑战来自于 ASR 生成的文本与训练数据之间的域外问题。为 Addressing 这个问题,我们提议一种直接训练方法,利用 ASR 生成的书面或说话文本,并通过 ASR 语言上下文模拟和大型语言模型增强的半监督学习方法,分别对待不同的 ASR enario。此外,我们还引入了一种后对aligning 方法,以管理不可预测的错误,从而提高 ITN 的可靠性。我们的实验表明,我们的提议方法在多种 ASR 情况下有remarkably 改善 ITN 性能。
Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions
results: 结果表明,在多选问题上,chatGPT 3.5 的回答精度降低了,从 72.1% 降低到 68.9%,而在开题问题上,降低到 61.5%,比对应正确答案的比例为 44.3%。相比之下,chatGPT 4 在两类问题上的回答精度均高于 3.5 版本,不受小话信息的影响。Abstract
As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple choice and open ended questions. We gathered small talk sentences from human participants using the Mechanical Turk platform. Both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. A board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data for multiple-choice questions (72.1\% vs. 68.9\%) and open questions (61.5\% vs. 44.3\%; p=0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6\% and 66.2\%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.
摘要
LLMS(大语言模型)是基于提示语言的预测模型,因此可能存在小说和无关信息影响其回答和建议的风险。这项研究旨在研究将医疗数据混合到小说中对ChatGPT提供的医疗建议精度的影响。我们使用USMLE步骤3题目作为相关医疗数据模型。我们使用多选和开放题目两种类型。我们从人工智能 Turk 平台获得了小说句子。我们将USMLE题目分配成一种模式,其中每个原始句子后接一个小说句子。ChatGPT 3.5和4被要求回答这两个集合的问题,包括和小说句子。一位资深的医生分析了ChatGPT的答案并与正确答案进行比较。分析结果显示,当小说句子添加到医疗数据时,ChatGPT-3.5 的回答正确率下降(72.1% vs. 68.9%)和开放题目中的回答正确率下降(61.5% vs. 44.3%)。相比之下,小说句子不会对ChatGPT-4 的回答造成影响(83.6%和66.2%)。根据这些结果,ChatGPT-4 显示更加准确,而小说不会对其医疗建议能力产生影响。这些结果是我们理解LLMS在实际医疗互动中的潜在和局限性的重要一步。
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
paper_authors: Maximilian Li, Xander Davies, Max Nadeau
For: 降低 GPT-2 语言生成中的偏见行为* Methods: 范例小数据集中找到关键 causal 通路,并删除这些通路以关键排除偏见行为* Results: 删除 12 条 causal 通路可以严重降低偏见语言生成,并对其他输入的性能几乎没有影响Abstract
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
摘要
机器学习模型经常展现出提高预训练目标性能的行为,但是却害下游任务性能。我们提出了一种新的方法,通过缺省少量的 causal 路径间的缺省来消除不良行为。通过一小量的输入数据,我们学习缺省少量的重要 causal 路径。在减少 GPT-2 毒性语言生成中,我们发现缺省12个 causal 边对毒性语言生成产生了很好的效果,而不会对其他输入产生很大的影响。
Evaluating the Ebb and Flow: An In-depth Analysis of Question-Answering Trends across Diverse Platforms
paper_authors: Rima Hazra, Agnik Saha, Somnath Banerjee, Animesh Mukherjee
for: This paper aims to investigate the factors that contribute to the speed of responses on Community Question Answering (CQA) platforms.
methods: The authors analyze six highly popular CQA platforms and identify correlations between the time taken to yield the first response to a question and various variables, including metadata and patterns of user interaction. They also employ conventional machine learning models to predict which queries will receive prompt responses.
results: The study finds a correlation between the time taken to yield the first response and several variables, including the formulation of the questions and the level of interaction among users. The authors also demonstrate the feasibility of using machine learning models to predict prompt responses.Abstract
Community Question Answering (CQA) platforms steadily gain popularity as they provide users with fast responses to their queries. The swiftness of these responses is contingent on a mixture of query-specific and user-related elements. This paper scrutinizes these contributing factors within the context of six highly popular CQA platforms, identified through their standout answering speed. Our investigation reveals a correlation between the time taken to yield the first response to a question and several variables: the metadata, the formulation of the questions, and the level of interaction among users. Additionally, by employing conventional machine learning models to analyze these metadata and patterns of user interaction, we endeavor to predict which queries will receive their initial responses promptly.
摘要
社区问答平台(CQA)的流行程度逐渐增长,因为它们为用户提供了快速的答案。这种快速答案的速度受到多种问题特定和用户相关的因素的影响。这篇论文在六个非常受欢迎的CQA平台上 investigate这些贡献因素,并通过使用传统的机器学习模型分析这些元数据和用户交互的模式,尝试预测哪些问题会收到快速的初始答案。Here's a word-for-word translation:社区问答平台(CQA)的流行程度逐渐增长,因为它们为用户提供了快速的答案。这种快速答案的速度受到多种问题特定和用户相关的因素的影响。这篇论文在六个非常受欢迎的CQA平台上 investigate这些贡献因素,并通过使用传统的机器学习模型分析这些元数据和用户交互的模式,尝试预测哪些问题会收到快速的初始答案。
The Moral Machine Experiment on Large Language Models
results: 研究发现,尽管 LLM 和人类的偏好在一些方面相似,但 PaLM 2 和 Llama 2 等模型尤其存在明显的偏差。此外,尽管qualitative上的偏好相似,但 LLM 可能会偏向更加坚定的决策,与人类的偏好相比。这些发现可能有助于我们更好地理解 LLM 的伦理框架,并对道路自动驾驶的发展产生影响。Abstract
As large language models (LLMs) become more deeply integrated into various sectors, understanding how they make moral judgments has become crucial, particularly in the realm of autonomous driving. This study utilized the Moral Machine framework to investigate the ethical decision-making tendencies of prominent LLMs, including GPT-3.5, GPT-4, PaLM 2, and Llama 2, comparing their responses to human preferences. While LLMs' and humans' preferences such as prioritizing humans over pets and favoring saving more lives are broadly aligned, PaLM 2 and Llama 2, especially, evidence distinct deviations. Additionally, despite the qualitative similarities between the LLM and human preferences, there are significant quantitative disparities, suggesting that LLMs might lean toward more uncompromising decisions, compared to the milder inclinations of humans. These insights elucidate the ethical frameworks of LLMs and their potential implications for autonomous driving.
摘要
large language models (LLMs) 在不同领域深入整合后,理解它们如何作出道德判断成为了重要的焦点,尤其在自动驾驶领域。这个研究使用道德机器框架进行 investigated the ethical decision-making tendencies of prominent LLMs, including GPT-3.5, GPT-4, PaLM 2, and Llama 2, and compared their responses to human preferences。 although LLMs' and humans' preferences such as prioritizing humans over pets and favoring saving more lives are broadly aligned,PaLM 2 and Llama 2, especially, evidence distinct deviations。 In addition, despite the qualitative similarities between the LLM and human preferences, there are significant quantitative disparities, suggesting that LLMs might lean toward more uncompromising decisions, compared to the milder inclinations of humans。 These insights elucidate the ethical frameworks of LLMs and their potential implications for autonomous driving。
Balanced and Explainable Social Media Analysis for Public Health with Large Language Models
results: 根据实验结果,提出的 ALEX 方法在 Social Media Mining for Health 2023 (SMM4H) 竞赛中的三个任务中得到了杰出的表现,在两个任务中得到了第一名。Abstract
As social media becomes increasingly popular, more and more public health activities emerge, which is worth noting for pandemic monitoring and government decision-making. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). Although recent progress in LLMs has shown a strong ability to comprehend knowledge by being fine-tuned on specific domain datasets, the costs of training an in-domain LLM for every specific public health task are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally highly imbalanced, which will hinder the efficiency of LLMs tuning. To tackle these challenges, the data imbalance issue can be overcome by sophisticated data augmentation methods for social media datasets. In addition, the ability of the LLMs can be effectively utilised by prompting the model properly. In light of the above discussion, in this paper, a novel ALEX framework is proposed for social media analysis on public health. Specifically, an augmentation pipeline is developed to resolve the data imbalance issue. Furthermore, an LLMs explanation mechanism is proposed by prompting an LLM with the predicted results from BERT models. Extensive experiments conducted on three tasks at the Social Media Mining for Health 2023 (SMM4H) competition with the first ranking in two tasks demonstrate the superior performance of the proposed ALEX method. Our code has been released in https://github.com/YanJiangJerry/ALEX.
摘要
为了适应社交媒体日益普及,更多的公共健康活动在发展,这对抗疫病监测和政府决策都是值得注意的。当前的公共健康分析技术主要基于受欢迎的模型BERT和大型自然语言模型(LLM)。虽然最近的LLM进步显示在特定领域数据集上精细调整后具有强大的知识把握能力,但是训练专门领域LLM的成本尤其高昂。此外,这些社交媒体数据集通常具有很高的不均衡性,这会降低LLM的调整效率。为了解决这些挑战,本文提出了一种novel ALEX框架,用于社交媒体分析。特别是,我们开发了一个数据增强管线,以解决数据不均衡问题。此外,我们还提出了一种LLM的解释机制,通过向LLM提供BERT模型预测结果进行引导。经过广泛的实验,我们在2023年社交媒体矿山健康大赛(SMM4H)中的三个任务中获得了第一名。我们的代码已经在https://github.com/YanJiangJerry/ALEX上发布。
Language Models as Black-Box Optimizers for Vision-Language Models
paper_authors: Samuel Yu, Shihong Liu, Zhiqiu Lin, Deepak Pathak, Deva Ramanan for: 这个研究旨在开发一种基于自然语言提示的视觉语言模型(VLM)微调方法,以避免需要存取模型参数、特征嵌入或输出寄存器。methods: 我们提出了一种使用对话式大语言模型(LLM)作为黑盒优化器,通过自动“山丘攀登”程序,让 LLM 根据文本反馈来调整提示,以实现最佳提示的搜寻。results: 在一个挑战性的1架学习设置下,我们的简单方法比 white-box 连续提示方法 CoOp 高出1.5%的平均准确率 across 11个数据集,包括 ImageNet。我们的方法还超过 OpenAI 手动制作的提示和其他黑盒方法 like iterative APE,并且发现文本提示生成的过程不仅更加可读性,而且可以在不同的 CLIP 架构上传递。Abstract
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities across a variety of vision and multimodal tasks. Currently, fine-tuning methods for VLMs mainly operate in a white-box setting, requiring access to model parameters for backpropagation. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. Given that popular private large language models (LLMs) like ChatGPT still offer a language-based user interface, we aim to develop a novel fine-tuning approach for VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or output logits. In this setup, we propose employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification using CLIP. Specifically, we adopt an automatic "hill-climbing" procedure that converges on an effective prompt by evaluating the accuracy of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot learning setup, our simple approach surpasses the white-box continuous prompting method CoOp by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms OpenAI's manually crafted prompts and is more efficient than other black-box methods like iterative APE. Additionally, we highlight the advantage of conversational feedback incorporating both positive and negative prompts, suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. Lastly, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different CLIP architectures in a black-box manner.
摘要
现代视觉语言模型(VLM)在大规模网络数据上进行预训练后,在视觉和多模态任务上表现出了惊人的能力。然而,现有的细化方法主要 operate在白盒子 Setting中,需要访问模型参数进行反射。然而,许多VLM rely on proprietary data,这限制了使用白盒子approach for fine-tuning。在这种情况下,我们提出了一种新的细化方法,通过自然语言提示来避免访问模型参数、特征嵌入和输出征。在这种设置中,我们提议使用流行私人大型语言模型(LLM)like ChatGPT作为黑盒子优化器,通过自然语言反馈来搜索最佳提示。具体来说,我们采用了一种自动“山丘攀 climbing”过程,通过评估当前提示的准确率,请LLM进行提示的修改,以达到最佳提示。在一个挑战性的1shot learning设置下,我们的简单方法比白盒子连续提示方法CoOp高平均1.5% across 11 datasets,包括ImageNet。我们的方法还超过OpenAI manually 制作的提示,并且更高效 чем其他黑盒子方法,如迭代APE。此外,我们发现通过 incorporating both positive和negative提示,LLMs可以利用文本反馈中的隐式“梯度”方向进行更加高效的搜索。最后,我们发现通过我们的策略生成的文本提示不仅更加可读性高,还可以在黑盒子方式下跨不同的 CLIP 架构传输。
Do PLMs Know and Understand Ontological Knowledge?
results: 研究结果表明PLMs可以记忆一定的ontological knowledge,并且可以使用这些知识进行逻辑推理。然而,PLMs的记忆和逻辑推理性能都不完善, indicating that PLMs的ontological knowledge是部分的和不够深入的。Abstract
Ontological knowledge, which comprises classes and properties and their relationships, is integral to world knowledge. It is significant to explore whether Pretrained Language Models (PLMs) know and understand such knowledge. However, existing PLM-probing studies focus mainly on factual knowledge, lacking a systematic probing of ontological knowledge. In this paper, we focus on probing whether PLMs store ontological knowledge and have a semantic understanding of the knowledge rather than rote memorization of the surface form. To probe whether PLMs know ontological knowledge, we investigate how well PLMs memorize: (1) types of entities; (2) hierarchical relationships among classes and properties, e.g., Person is a subclass of Animal and Member of Sports Team is a subproperty of Member of ; (3) domain and range constraints of properties, e.g., the subject of Member of Sports Team should be a Person and the object should be a Sports Team. To further probe whether PLMs truly understand ontological knowledge beyond memorization, we comprehensively study whether they can reliably perform logical reasoning with given knowledge according to ontological entailment rules. Our probing results show that PLMs can memorize certain ontological knowledge and utilize implicit knowledge in reasoning. However, both the memorizing and reasoning performances are less than perfect, indicating incomplete knowledge and understanding.
摘要
ontological knowledge,包括类和属性之间的关系,对世界知识是基础性的。但是,现有的 PLM 探测研究主要集中在事实知识上,缺乏系统性的探测ontological knowledge。本文将关注 PLM 是否具备ontological knowledge,并且是否具备semantic理解这种知识,而不是只是表面上的记忆。为了探测 PLM 是否知道ontological knowledge,我们调查 PLM 是否能够记忆以下三种内容:1. 类型的实体,例如 Person 是 Animal 的 subclass。2. 类和属性之间的层次关系,例如 Member of Sports Team 是 Member of 的 subproperty。3. 属性的域和范围约束,例如 Member of Sports Team 的主题应该是 Person,而 objet 应该是 Sports Team。为了更加全面地探测 PLM 是否真正理解ontological knowledge,我们进行了系统性的逻辑推理测试,根据 ontological entailment 规则。我们的探测结果表明,PLMs 可以记忆一些ontological knowledge,并且在推理过程中可以利用隐式知识。但是,记忆和推理的性能都不完美,表明 PLMs 的知识和理解仍有限制。
results: 对比原始GNN层,布雷格曼GNN层能够更好地 mitigate 过度简化问题,并且在多层情况下仍然保持良好的学习精度。Abstract
Numerous recent research on graph neural networks (GNNs) has focused on formulating GNN architectures as an optimization problem with the smoothness assumption. However, in node classification tasks, the smoothing effect induced by GNNs tends to assimilate representations and over-homogenize labels of connected nodes, leading to adverse effects such as over-smoothing and misclassification. In this paper, we propose a novel bilevel optimization framework for GNNs inspired by the notion of Bregman distance. We demonstrate that the GNN layer proposed accordingly can effectively mitigate the over-smoothing issue by introducing a mechanism reminiscent of the "skip connection". We validate our theoretical results through comprehensive empirical studies in which Bregman-enhanced GNNs outperform their original counterparts in both homophilic and heterophilic graphs. Furthermore, our experiments also show that Bregman GNNs can produce more robust learning accuracy even when the number of layers is high, suggesting the effectiveness of the proposed method in alleviating the over-smoothing issue.
摘要
Recent research on graph neural networks (GNNs) has focused on formulating GNN architectures as optimization problems with the smoothness assumption. However, in node classification tasks, the smoothing effect induced by GNNs tends to assimilate representations and over-homogenize labels of connected nodes, leading to adverse effects such as over-smoothing and misclassification. In this paper, we propose a novel bilevel optimization framework for GNNs inspired by the notion of Bregman distance. We demonstrate that the GNN layer proposed accordingly can effectively mitigate the over-smoothing issue by introducing a mechanism reminiscent of the "skip connection". We validate our theoretical results through comprehensive empirical studies in which Bregman-enhanced GNNs outperform their original counterparts in both homophilic and heterophilic graphs. Furthermore, our experiments also show that Bregman GNNs can produce more robust learning accuracy even when the number of layers is high, suggesting the effectiveness of the proposed method in alleviating the over-smoothing issue.Here is the translation in Traditional Chinese:近期研究Graph Neural Networks (GNNs) 的方法都集中在设计GNN架构为优化问题中的匀数假设。然而,在节点分类任务中,GNNs 对节点的数据汇合和调整导致节点的表现变得太平等,从而导致过滤和错分类的问题。在这篇论文中,我们提出了一个新的两级优化框架 для GNNs, inspirited by Bregman distance的想法。我们显示了这种GNN层可以有效地减少过滤的问题,通过引入一种"skip connection"的机制。我们透过实验 validate 我们的理论结果,并证明了Bregman-enhanced GNNs 在同样的节点分类任务中表现更好,并且在不同的节点分布情况下也能够获得更好的性能。
Audio-Based Classification of Respiratory Diseases using Advanced Signal Processing and Machine Learning for Assistive Diagnosis Support
results: 研究获得了87%的准确率来分别识别健康和疾病个体,并且还使用了六种分类模型来诊断呼吸系统疾病,例如肺炎和呼吸道疾病(COPD)。此外,这个研究还提出了一个年龄和体重指数(BMI)估计模型,以及一个性别分类模型,全都基于呼吸音数据。Abstract
In global healthcare, respiratory diseases are a leading cause of mortality, underscoring the need for rapid and accurate diagnostics. To advance rapid screening techniques via auscultation, our research focuses on employing one of the largest publicly available medical database of respiratory sounds to train multiple machine learning models able to classify different health conditions. Our method combines Empirical Mode Decomposition (EMD) and spectral analysis to extract physiologically relevant biosignals from acoustic data, closely tied to cardiovascular and respiratory patterns, making our approach apart in its departure from conventional audio feature extraction practices. We use Power Spectral Density analysis and filtering techniques to select Intrinsic Mode Functions (IMFs) strongly correlated with underlying physiological phenomena. These biosignals undergo a comprehensive feature extraction process for predictive modeling. Initially, we deploy a binary classification model that demonstrates a balanced accuracy of 87% in distinguishing between healthy and diseased individuals. Subsequently, we employ a six-class classification model that achieves a balanced accuracy of 72% in diagnosing specific respiratory conditions like pneumonia and chronic obstructive pulmonary disease (COPD). For the first time, we also introduce regression models that estimate age and body mass index (BMI) based solely on acoustic data, as well as a model for gender classification. Our findings underscore the potential of this approach to significantly enhance assistive and remote diagnostic capabilities.
摘要
首先,我们部署了一个二分类模型,其在健康和疾病个体之间具有87%的平衡准确率。然后,我们使用六类分类模型,其在诊断特定的呼吸疾病,如肺炎和慢性呼吸疾病(COPD)时, achieve a balanced accuracy of 72%。此外,我们还引入了年龄和体重指数(BMI)基于呼吸数据solely的回归模型,以及一个性别分类模型。我们的发现表明,这种方法可以备受提高辅助和远程诊断能力。
Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models
results: 本研究的实验结果表明,采用估算受损度的严重程度来适应不同样本的受损程度可以提高逆Problem的解决效率和计算效率,并且与现有的扩展推理方法相比,提高了解决效率和计算效率。Abstract
Inverse problems arise in a multitude of applications, where the goal is to recover a clean signal from noisy and possibly (non)linear observations. The difficulty of a reconstruction problem depends on multiple factors, such as the structure of the ground truth signal, the severity of the degradation, the implicit bias of the reconstruction model and the complex interactions between the above factors. This results in natural sample-by-sample variation in the difficulty of a reconstruction task, which is often overlooked by contemporary techniques. Recently, diffusion-based inverse problem solvers have established new state-of-the-art in various reconstruction tasks. However, they have the drawback of being computationally prohibitive. Our key observation in this paper is that most existing solvers lack the ability to adapt their compute power to the difficulty of the reconstruction task, resulting in long inference times, subpar performance and wasteful resource allocation. We propose a novel method that we call severity encoding, to estimate the degradation severity of noisy, degraded signals in the latent space of an autoencoder. We show that the estimated severity has strong correlation with the true corruption level and can give useful hints at the difficulty of reconstruction problems on a sample-by-sample basis. Furthermore, we propose a reconstruction method based on latent diffusion models that leverages the predicted degradation severities to fine-tune the reverse diffusion sampling trajectory and thus achieve sample-adaptive inference times. We utilize latent diffusion posterior sampling to maintain data consistency with observations. We perform experiments on both linear and nonlinear inverse problems and demonstrate that our technique achieves performance comparable to state-of-the-art diffusion-based techniques, with significant improvements in computational efficiency.
摘要
<>将文本翻译成简化中文。<>逆 проблеme 在多种应用中出现,其目标是从噪声和可能非线性的观测数据中回归到清晰的信号。逆回归问题的difficulty取决于多种因素,如真实的地面信号结构、观测数据的严重程度、逆回归模型的隐式偏见以及这些因素之间的复杂互动。这导致每个样本的逆回归任务的difficulty具有自然的样本差异,通常被当前技术所忽略。在这篇论文中,我们提出了一种新的方法,称为严重编码,以估计噪声损坏的严重程度。我们发现,这个估计的严重程度与真实的损坏水平具有强相关性,并且可以为逆回归任务的样本差异提供有用的提示。此外,我们提出了基于扩散模型的逆回归方法,利用预测的损坏严重程度来细化反扩散抽样 trajectory,以实现样本适应的计算效率。我们使用扩散 posterior 抽样来保持数据的一致性与观测数据。我们在线性和非线性逆回归问题上进行了实验,并证明了我们的技术与当前扩散基于技术相比,可以实现类似的性能,同时具有显著的计算效率提升。
PCN: A Deep Learning Approach to Jet Tagging Utilizing Novel Graph Construction Methods and Chebyshev Graph Convolutions
results: 这个论文的实验结果表明,PCN可以提高jet标记率,并且比现有的标记器更高。这个研究开启了将图基的表示方法和ChebConv层应用于高能物理实验的可能性。Abstract
Jet tagging is a classification problem in high-energy physics experiments that aims to identify the collimated sprays of subatomic particles, jets, from particle collisions and tag them to their emitter particle. Advances in jet tagging present opportunities for searches of new physics beyond the Standard Model. Current approaches use deep learning to uncover hidden patterns in complex collision data. However, the representation of jets as inputs to a deep learning model have been varied, and often, informative features are withheld from models. In this study, we propose a graph-based representation of a jet that encodes the most information possible. To learn best from this representation, we design Particle Chebyshev Network (PCN), a graph neural network (GNN) using Chebyshev graph convolutions (ChebConv). ChebConv has been demonstrated as an effective alternative to classical graph convolutions in GNNs and has yet to be explored in jet tagging. PCN achieves a substantial improvement in accuracy over existing taggers and opens the door to future studies into graph-based representations of jets and ChebConv layers in high-energy physics experiments. Code is available at https://github.com/YVSemlani/PCN-Jet-Tagging.
摘要
高能物理实验中的喷气标记是一种分类问题,旨在从素子反应中检测和标识喷气,这些喷气是由素子产生的。随着喷气标记的进步,开放了新物理学之外的搜索。现有方法使用深度学习来探索复杂的喷气数据中的隐藏模式。然而,喷气被输入到深度学习模型中的表示方法各异,经常会排除有用的特征。在这个研究中,我们提议一种图格基的喷气表示方法,该方法可以最大化喷气中的信息。为了利用这种表示方法,我们设计了Particle Chebyshev Network(PCN),这是一种使用Chebychev图 convolution(ChebConv)的图神经网络(GNN)。ChebConv已经证明是传统图 convolution的有效替代方案,尚未在喷气标记中使用。PCN实现了与现有标记器相比的显著改善,打开了将来研究图基的喷气表示方法和ChebConv层在高能物理实验中的大门。代码可以在https://github.com/YVSemlani/PCN-Jet-Tagging上获取。
Sleep Stage Classification Using a Pre-trained Deep Learning Model
results: 这个模型在一个公开available的数据集“Sleep-EDF20”上取得了86.97%的准确性,比其他已知的模型更高。其中在阶段N1上的准确性为56.4%,也比其他模型更高。这些发现表明这个模型有 potential to achieve better results for the treatment of sleep disorders.Abstract
One of the common human diseases is sleep disorders. The classification of sleep stages plays a fundamental role in diagnosing sleep disorders, monitoring treatment effectiveness, and understanding the relationship between sleep stages and various health conditions. A precise and efficient classification of these stages can significantly enhance our understanding of sleep-related phenomena and ultimately lead to improved health outcomes and disease treatment. Models others propose are often time-consuming and lack sufficient accuracy, especially in stage N1. The main objective of this research is to present a machine-learning model called "EEGMobile". This model utilizes pre-trained models and learns from electroencephalogram (EEG) spectrograms of brain signals. The model achieved an accuracy of 86.97% on a publicly available dataset named "Sleep-EDF20", outperforming other models proposed by different researchers. Moreover, it recorded an accuracy of 56.4% in stage N1, which is better than other models. These findings demonstrate that this model has the potential to achieve better results for the treatment of this disease.
摘要
一种常见的人类疾病是睡眠障碍。睡眠阶段的分类扮演着基本的角色在诊断睡眠障碍、监测治疗效果和理解各种健康状况之间的关系。一个精准和高效的分类方法可以有效提高我们对睡眠相关现象的理解,从而导致改善健康结果和疾病治疗。其他研究人员提出的模型经常占用时间和缺乏准确性,尤其是N1阶段。本研究的主要目标是提出一种名为"EEGMobile"的机器学习模型,该模型利用预训练模型和电энцефаogram(EEG)spectrogram的脑信号学习。该模型在公共数据集"Sleep-EDF20"上 achievied an accuracy of 86.97%, outperforming other models proposed by different researchers. Furthermore, it recorded an accuracy of 56.4% in stage N1, which is better than other models. These findings demonstrate that this model has the potential to achieve better results for the treatment of this disease.
$G$-Mapper: Learning a Cover in the Mapper Construction
results: 实验表明,该算法可以生成高质量的覆盖,使得Mapper图能够 retain the essence of the datasets。Abstract
The Mapper algorithm is a visualization technique in topological data analysis (TDA) that outputs a graph reflecting the structure of a given dataset. The Mapper algorithm requires tuning several parameters in order to generate a "nice" Mapper graph. The paper focuses on selecting the cover parameter. We present an algorithm that optimizes the cover of a Mapper graph by splitting a cover repeatedly according to a statistical test for normality. Our algorithm is based on $G$-means clustering which searches for the optimal number of clusters in $k$-means by conducting iteratively the Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model in order to choose carefully the cover based on the distribution of a given data. Experiments for synthetic and real-world datasets demonstrate that our algorithm generates covers so that the Mapper graphs retain the essence of the datasets.
摘要
“Mapper”算法是一种数据可视化技术,用于批处理数据分析(TDA)中的数据结构。“Mapper”算法需要调整一些参数,以生成一个“好看”的图表。本文主要关注选择“覆盖”参数。我们提出了一种基于$G$-mean clustering的算法,通过每次进行 Anderson-Darling 测试来选择最佳分区。我们的分割过程使用 Gaussian mixture model,以便选择基于数据的分布来选择覆盖。我们的实验表明,我们的算法可以生成一个保留数据的本质的 Mapper 图。
Epistemic Modeling Uncertainty of Rapid Neural Network Ensembles for Adaptive Learning
methods: 这个方法使用多元数据来源,包括不同的随机初始化模型。 ensemble of 模型 realizations 用于评估因为缺乏训练数据而导致的模型建构不确定性。
results: 这个研究发现,使用rapid neural network paradigm可以实现快速的模型训练,而无需损失预测精度。 在多个分析例子中,以及一个实际的 hypersonic vehicle 飞行 Parameters 研究中,提出了一种内置模拟器的新型类型。Abstract
Emulator embedded neural networks, which are a type of physics informed neural network, leverage multi-fidelity data sources for efficient design exploration of aerospace engineering systems. Multiple realizations of the neural network models are trained with different random initializations. The ensemble of model realizations is used to assess epistemic modeling uncertainty caused due to lack of training samples. This uncertainty estimation is crucial information for successful goal-oriented adaptive learning in an aerospace system design exploration. However, the costs of training the ensemble models often become prohibitive and pose a computational challenge, especially when the models are not trained in parallel during adaptive learning. In this work, a new type of emulator embedded neural network is presented using the rapid neural network paradigm. Unlike the conventional neural network training that optimizes the weights and biases of all the network layers by using gradient-based backpropagation, rapid neural network training adjusts only the last layer connection weights by applying a linear regression technique. It is found that the proposed emulator embedded neural network trains near-instantaneously, typically without loss of prediction accuracy. The proposed method is demonstrated on multiple analytical examples, as well as an aerospace flight parameter study of a generic hypersonic vehicle.
摘要
<>转换文本到简化中文。<>模拟器内置神经网络,它是physics informed neural network的一种类型,利用多种数据来源进行航空工程系统设计的有效探索。多个神经网络模型实现被不同的随机初始化训练。ensemble的模型实现用于评估因缺乏训练样本而引起的epistemic模型不确定性。这种不确定性评估是成功目标适应学arning的关键信息。然而,训练ensemble模型的成本经常成为计算挑战,尤其当模型在适应学习过程中不在平行进行训练。在这种情况下,一种新的模拟器内置神经网络方法被提出,使用rapid neural network paradigm。与传统神经网络训练不同,这种训练方法只是将最后层连接权重调整,通过应用线性回归技术。发现,提议的模拟器内置神经网络在几乎实时内训练,通常无损失预测精度。这种方法在多个分析例子中进行了证明,以及一个涉及到一个通用 hypersonic 飞行器的航空飞行参数研究。
A Sequentially Fair Mechanism for Multiple Sensitive Attributes
paper_authors: François Hu, Philipp Ratz, Arthur Charpentier for: 这个论文的目的是为了减少敏感变量和相应分数之间的关系,以解决多重敏感特征情况下的内部平衡问题。methods: 这个论文使用了多重条件 Wasserstein 中心来扩展了传统的强人口平衡定义,以便在多重敏感特征情况下进行平衡。这种方法提供了一个关键的关系解释,允许进行精确的不公平预测器。results: 这个论文的实验结果显示,这种方法可以有效地减少多重敏感特征情况下的不公平情况。此外,这种方法还可以让敏感特征之间的联乘关系获得明确的解释。Abstract
In the standard use case of Algorithmic Fairness, the goal is to eliminate the relationship between a sensitive variable and a corresponding score. Throughout recent years, the scientific community has developed a host of definitions and tools to solve this task, which work well in many practical applications. However, the applicability and effectivity of these tools and definitions becomes less straightfoward in the case of multiple sensitive attributes. To tackle this issue, we propose a sequential framework, which allows to progressively achieve fairness across a set of sensitive features. We accomplish this by leveraging multi-marginal Wasserstein barycenters, which extends the standard notion of Strong Demographic Parity to the case with multiple sensitive characteristics. This method also provides a closed-form solution for the optimal, sequentially fair predictor, permitting a clear interpretation of inter-sensitive feature correlations. Our approach seamlessly extends to approximate fairness, enveloping a framework accommodating the trade-off between risk and unfairness. This extension permits a targeted prioritization of fairness improvements for a specific attribute within a set of sensitive attributes, allowing for a case specific adaptation. A data-driven estimation procedure for the derived solution is developed, and comprehensive numerical experiments are conducted on both synthetic and real datasets. Our empirical findings decisively underscore the practical efficacy of our post-processing approach in fostering fair decision-making.
摘要
通常的用例中的算法公平 goal 是消除敏感变量和相应的分数之间的关系。在过去的几年中,科学社区已经开发出了许多定义和工具来解决这个问题,这些工具在许多实际应用中表现良好。然而,在多个敏感特征的情况下,这些工具和定义的可用性和效果变得更加复杂。为了解决这个问题,我们提出了一个顺序框架,可以逐步实现公平性 across a set of sensitive features。我们实现这一点通过利用多个敏感特征的 Wasserstein 多重中心,该扩展了标准的强人口平衡定义到多个敏感特征的情况。这种方法还提供了一个关闭式解的优化、公平预测器,允许明确地理解敏感特征之间的相互关系。我们的方法自然地扩展到近似公平性,包括一个折衔策略,可以根据特定敏感特征内的减少不公平性进行目标化优化。这种扩展允许特定敏感特征的公平改进,使得可以实现案例特定的适应。我们发展了一种数据驱动的估计过程,并在synthetic和实际数据集上进行了广泛的数值实验。我们的实验证明了我们的后处理方法在不公平决策中具有实际的有效性。
On the Contraction Coefficient of the Schrödinger Bridge for Stochastic Linear Systems
methods: 使用Contractive fixed point recursions方法来数学解Schrödinger bridge问题,这种方法可以看作是动态版本的Sinkhorn迭代,并且在某些假设下可以 garantizar线性减少。
results: 研究Schrödinger系统的前提估计,并提供新的几何和控制理论 интерпретаciones。此外,还指出了可以通过预处理终点支持集来提高worst-case contraction coefficient的计算。Abstract
Schr\"{o}dinger bridge is a stochastic optimal control problem to steer a given initial state density to another, subject to controlled diffusion and deadline constraints. A popular method to numerically solve the Schr\"{o}dinger bridge problems, in both classical and in the linear system settings, is via contractive fixed point recursions. These recursions can be seen as dynamic versions of the well-known Sinkhorn iterations, and under mild assumptions, they solve the so-called Schr\"{o}dinger systems with guaranteed linear convergence. In this work, we study a priori estimates for the contraction coefficients associated with the convergence of respective Schr\"{o}dinger systems. We provide new geometric and control-theoretic interpretations for the same. Building on these newfound interpretations, we point out the possibility of improved computation for the worst-case contraction coefficients of linear SBPs by preconditioning the endpoint support sets.
摘要
Schrödinger 桥是一个 Stochastic Optimal Control 问题,旨在从一个初始状态密度到另一个,并且受到控制的扩散和时间限制。一种广泛使用的方法来数学解决 Schrödinger 桥问题是通过收缩的定点反射,这些反射可以看作是动态版本的 Sinkhorn 迭代。在我们的研究中,我们研究了 Schrödinger 系统的前期估计,并提供了新的几何和控制理论解释。我们还指出了可以通过预处理终点支持集来提高 worst-case 收缩系数的计算。
The Grand Illusion: The Myth of Software Portability and Implications for ML Progress
results: 我们的结果显示,主流机器学习软件框架在不同硬件类型上可能会产生更多于40%的关键功能损失,并且即使功能可以移植,其性能下降也可能是极大的,使得性能不可接受。Abstract
Pushing the boundaries of machine learning often requires exploring different hardware and software combinations. However, the freedom to experiment across different tooling stacks can be at odds with the drive for efficiency, which has produced increasingly specialized AI hardware and incentivized consolidation around a narrow set of ML frameworks. Exploratory research can be restricted if software and hardware are co-evolving, making it even harder to stray away from mainstream ideas that work well with popular tooling stacks. While this friction increasingly impacts the rate of innovation in machine learning, to our knowledge the lack of portability in tooling has not been quantified. In this work, we ask: How portable are popular ML software frameworks? We conduct a large-scale study of the portability of mainstream ML frameworks across different hardware types. Our findings paint an uncomfortable picture -- frameworks can lose more than 40% of their key functions when ported to other hardware. Worse, even when functions are portable, the slowdown in their performance can be extreme and render performance untenable. Collectively, our results reveal how costly straying from a narrow set of hardware-software combinations can be - and suggest that specialization of hardware impedes innovation in machine learning research.
摘要
推动机器学学科的前沿研究经常需要尝试不同的硬件和软件组合。然而,在效率的推动下,AI硬件的特циали化和主流ML框架的吸引力使得尝试新的想法和工具栈的探索变得更加困难。探索性研究可能会受到软硬件的共演化限制,使得尝试离开主流想法和工具栈更加困难。这种阻力在机器学学科的创新速度中产生了一定的影响,但到目前为止,工具栈的无法移植的问题尚未得到量化的解决。在这项工作中,我们问:流行的ML软件框架是否可以具有可移植性?我们进行了大规模的流行ML框架在不同硬件类型上的可移植性研究。我们的结果表现出一个不适的情况:框架在其他硬件上转移时可能会产生超过40%的关键功能损失。更糟糕的是,即使功能是可移植的,其性能下降可能会非常大,使其性能不可接受。总的来说,我们的结果表明特циали化的硬件对机器学学科的创新带来了成本,并建议特циали化的硬件阻碍了机器学学科的进步。
Unsupervised Learning of Nanoindentation Data to Infer Microstructural Details of Complex Materials
results: 研究结果显示,采用不supervised学习技术可以有效地分析机械性能数据,并确定了机械阶段的数量。此外,通过cross-validation方法,研究人员还能够评估数据的充足性,并建议数据的充足量为可靠预测所需。Abstract
In this study, Cu-Cr composites were studied by nanoindentation. Arrays of indents were placed over large areas of the samples resulting in datasets consisting of several hundred measurements of Young's modulus and hardness at varying indentation depths. The unsupervised learning technique, Gaussian mixture model, was employed to analyze the data, which helped to determine the number of "mechanical phases" and the respective mechanical properties. Additionally, a cross-validation approach was introduced to infer whether the data quantity was adequate and to suggest the amount of data required for reliable predictions -- one of the often encountered but difficult to resolve issues in machine learning of materials science problems.
摘要
在这项研究中,氧化铜-镍复合材料被使用nano indent方法进行研究。数组式的 indent 被置于样品表面上,从而生成了包含多个百度测量 Young's modulus 和硬度的数据集。我们使用了无监督学习技术 Gaussian mixture model 来分析数据,以确定机械相的数量和相应的机械性质。此外,我们还提出了一种cross-validation方法,以判断数据量是否充分,并建议数据的充分量是否可靠的预测。这是机器学习材料科学问题中经常遇到,但很难解决的问题。
Reasoning with Latent Diffusion in Offline Reinforcement Learning
results: 这篇论文的实验结果显示,使用该方法可以在D4RL测试集上达到最佳性能,特别是在长期、罕见奖励任务中表现出色。Abstract
Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative methods that are tricky to tune and struggle with multi-modal data (as we show) or rely on noisy Monte Carlo return-to-go samples for reward conditioning. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes with multi-modal data. We show that the learned temporally-abstract latent space encodes richer task-specific information for offline RL tasks as compared to raw state-actions. This improves credit assignment and facilitates faster reward propagation during Q-learning. Our method demonstrates state-of-the-art performance on the D4RL benchmarks, particularly excelling in long-horizon, sparse-reward tasks.
摘要
<>translate "Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative methods that are tricky to tune and struggle with multi-modal data (as we show) or rely on noisy Monte Carlo return-to-go samples for reward conditioning. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes with multi-modal data. We show that the learned temporally-abstract latent space encodes richer task-specific information for offline RL tasks as compared to raw state-actions. This improves credit assignment and facilitates faster reward propagation during Q-learning. Our method demonstrates state-of-the-art performance on the D4RL benchmarks, particularly excelling in long-horizon, sparse-reward tasks." into Simplified Chinese.下面是文本的中文翻译:<>offline reinforcement learning (RL) 承袭了从静态数据集中学习高奖策略的承袭,无需进一步与环境进行交互。然而,offline RL 中的一个关键挑战在于有效地将静态数据集中的部分亚优轨迹串连起来,而不导致因数据集缺失支持而产生的推断错误。现有的方法通常使用保守的方法,困难调整并在多模态数据上异常表现(我们显示),或者基于噪声 Monte Carlo 返回样本进行奖励条件。在这种情况下,我们提出了一种新的方法,利用干扰扩散的表达能力来模型在支持轨迹序列上的压缩 latent 技巧。这使得学习 Q-函数时避免推断错误,并通过批量约束来实现。干扰扩散空间也具有表达能力,可以gracefully 处理多模态数据。我们显示,学习的时间抽象 latent 空间中包含更加任务特定的信息,比raw state-action 更加丰富,这使得奖励赋权更加准确,并促进了更快的奖励传递。我们的方法在 D4RL 标准准测试上达到了领先的性能,特别是在长远、稀热奖 Task 上。
Optimal and Fair Encouragement Policy Evaluation and Learning
results: 这篇论文的结果表明,在不可预知的人们是否遵循治疗建议的情况下,可以使用 constrained 优化方法来设计优化的治疗方案,以便实现最大化 causal 结果。Abstract
In consequential domains, it is often impossible to compel individuals to take treatment, so that optimal policy rules are merely suggestions in the presence of human non-adherence to treatment recommendations. In these same domains, there may be heterogeneity both in who responds in taking-up treatment, and heterogeneity in treatment efficacy. While optimal treatment rules can maximize causal outcomes across the population, access parity constraints or other fairness considerations can be relevant in the case of encouragement. For example, in social services, a persistent puzzle is the gap in take-up of beneficial services among those who may benefit from them the most. When in addition the decision-maker has distributional preferences over both access and average outcomes, the optimal decision rule changes. We study causal identification, statistical variance-reduced estimation, and robust estimation of optimal treatment rules, including under potential violations of positivity. We consider fairness constraints such as demographic parity in treatment take-up, and other constraints, via constrained optimization. Our framework can be extended to handle algorithmic recommendations under an often-reasonable covariate-conditional exclusion restriction, using our robustness checks for lack of positivity in the recommendation. We develop a two-stage algorithm for solving over parametrized policy classes under general constraints to obtain variance-sensitive regret bounds. We illustrate the methods in two case studies based on data from randomized encouragement to enroll in insurance and from pretrial supervised release with electronic monitoring.
摘要
在重要领域中,常常无法强制个人接受治疗,因此优化的政策规则仅仅是建议在人类不遵循治疗建议的情况下。在这些同时,可能存在人群响应不同,以及治疗效果的差异。优化的治疗规则可以最大化人口级别的 causal 结果,但是访问平等约束或其他公平考虑可能在鼓励中发挥作用。例如,在社会服务中,一个持续的谜题是养成很好的服务的投入率不均匀分布。当决策者有访问和平均结果的分布预期时,优化的决策规则会改变。我们研究 causal 验证、统计减少的估计和robust estimation of 优化的治疗规则,包括在可能的非正式性下进行验证。我们考虑公平约束,如人群响应不同和其他约束,通过受限优化来实现。我们的框架可以扩展到处理算法建议在conditionally reasonable covariate-conditional exclusion restriction下进行验证,使用我们的robustness检查来检测非正式性。我们开发了一种两个参数的算法,用于在总体约束下解决参数化政策类型的问题,以获得变量敏感的 regret bound。我们在两个案例研究中应用了这些方法,基于随机鼓励保险和预 trial supervised release with electronic monitoring。
for: investigate the local convergence characteristics of Model-agnostic Meta-learning (MAML) in linear system quadratic optimal control (LQR)
methods: uses MAML and its variations, with theoretical guarantees provided for the local convergence of the algorithm
results: presents simple numerical results to demonstrate the convergence properties of MAML in LQR tasksHere’s the format you requested:
for: <what are the paper written for?>
methods: <what methods the paper use?>
results: <what results the paper get?>I hope that helps!Abstract
The main objective of this research paper is to investigate the local convergence characteristics of Model-agnostic Meta-learning (MAML) when applied to linear system quadratic optimal control (LQR). MAML and its variations have become popular techniques for quickly adapting to new tasks by leveraging previous learning knowledge in areas like regression, classification, and reinforcement learning. However, its theoretical guarantees remain unknown due to non-convexity and its structure, making it even more challenging to ensure stability in the dynamic system setting. This study focuses on exploring MAML in the LQR setting, providing its local convergence guarantees while maintaining the stability of the dynamical system. The paper also presents simple numerical results to demonstrate the convergence properties of MAML in LQR tasks.
摘要
主要目标 OF 这个研究论文是 investigate Model-agnostic Meta-learning(MAML)在线性系统弗 quadratic 优化控制(LQR)中的本地叉�强特性。 MAML 和其变种在领域如回归、分类和 reinforcement learning 中利用 previous learning knowledge 快速适应新任务,但其理论保证仍然不清楚,因为非拟合性和结构,使得在动态系统设置下稳定性更加挑战。 这个研究集中关注 MAML 在 LQR 设置中的本地叉�强特性,提供了本地叉�强保证,同时保证动态系统的稳定性。 论文还展示了简单的数据结果,以证明 MAML 在 LQR 任务中的 converges 性质。Here's the text with some additional information about the translation:I used Google Translate to translate the text into Simplified Chinese. However, please note that the translation may not be perfect and may require some adjustments to ensure accuracy. Additionally, the translation may not capture the exact nuances and idiomatic expressions of the original text, so some phrasing and wording may be different from the original.
Explainable Graph Neural Network for Alzheimer’s Disease And Related Dementias Risk Prediction
paper_authors: Xinyue Hu, Zenan Sun, Yi Nian, Yifang Dang, Fang Li, Jingna Feng, Evan Yu, Cui Tao for:* 这个研究旨在提高了阿尔ツ海默症和相关失智症(ADRD)的风险预测,使用机器学习和养成数据。methods:* 这个研究使用了变换正则化编码器-解码器图ael neural network(VGNN)来估算ADRD的可能性。results:* VGNN比Random Forest和Light Gradient Boost Machine基线模型高出10%的 receiver operating characteristic 面积。In simplified Chinese:for:* 这个研究旨在提高了阿尔ツ海默症和相关失智症(ADRD)的风险预测,使用机器学习和养成数据。methods:* 这个研究使用了变换正则化编码器-解码器图ael neural network(VGNN)来估算ADRD的可能性。results:* VGNN比Random Forest和Light Gradient Boost Machine基线模型高出10%的 receiver operating characteristic 面积。Abstract
Alzheimer's disease and related dementias (ADRD) ranks as the sixth leading cause of death in the US, underlining the importance of accurate ADRD risk prediction. While recent advancement in ADRD risk prediction have primarily relied on imaging analysis, yet not all patients undergo medical imaging before an ADRD diagnosis. Merging machine learning with claims data can reveal additional risk factors and uncover interconnections among diverse medical codes. Our goal is to utilize Graph Neural Networks (GNNs) with claims data for ADRD risk prediction. Addressing the lack of human-interpretable reasons behind these predictions, we introduce an innovative method to evaluate relationship importance and its influence on ADRD risk prediction, ensuring comprehensive interpretation. We employed Variationally Regularized Encoder-decoder Graph Neural Network (VGNN) for estimating ADRD likelihood. We created three scenarios to assess the model's efficiency, using Random Forest and Light Gradient Boost Machine as baselines. We further used our relation importance method to clarify the key relationships for ADRD risk prediction. VGNN surpassed other baseline models by 10% in the area under the receiver operating characteristic. The integration of the GNN model and relation importance interpretation could potentially play an essential role in providing valuable insight into factors that may contribute to or delay ADRD progression. Employing a GNN approach with claims data enhances ADRD risk prediction and provides insights into the impact of interconnected medical code relationships. This methodology not only enables ADRD risk modeling but also shows potential for other image analysis predictions using claims data.
摘要
供应链疾病(ADRD)是美国第六大死因之一,因此精准预测ADRD风险的重要性。Recent Advances in ADRD风险预测主要基于成像分析,但不 все患者会在ADRD诊断之前进行成像检查。通过将机器学习与保险数据结合,可以揭示更多的风险因素和不同的医疗代码之间的联系。我们的目标是使用图ael Neural Networks(GNNs)与保险数据进行ADRD风险预测。为了解释预测结果中的人类可读性原因,我们引入了关系重要性评估方法。我们使用了Variational Regularized Encoder-decoder Graph Neural Network(VGNN)来估计ADRD可能性。我们创建了三个场景来评估模型的效率,使用Random Forest和Light Gradient Boost Machine作为基线。我们还使用我们的关系重要性评估方法来解释ADRD风险预测中的关键关系。VGNN比基线模型提高了10%的接收操作特征曲线地区 beneath。将GNN模型和关系重要性评估结合可能会在提供ADRD进程中的价值预测和解释方面发挥关键作用。使用GNN方法和保险数据可以提高ADRD风险预测,同时提供成像分析预测中的可能性。这种方法不仅可以用于ADRD风险预测,还可以应用于其他成像分析预测中。
Electron Energy Regression in the CMS High-Granularity Calorimeter Prototype
results: 该论文通过使用机器学习方法,成功地重建 incident electrons 的能量,并公开发布了这些数据,以便促进相关领域的研究和应用。Abstract
We present a new publicly available dataset that contains simulated data of a novel calorimeter to be installed at the CERN Large Hadron Collider. This detector will have more than six-million channels with each channel capable of position, ionisation and precision time measurement. Reconstructing these events in an efficient way poses an immense challenge which is being addressed with the latest machine learning techniques. As part of this development a large prototype with 12,000 channels was built and a beam of high-energy electrons incident on it. Using machine learning methods we have reconstructed the energy of incident electrons from the energies of three-dimensional hits, which is known to some precision. By releasing this data publicly we hope to encourage experts in the application of machine learning to develop efficient and accurate image reconstruction of these electrons.
摘要
我们现在公布了一个新的公共可用数据集,该数据集包含 simulate 的加速器中的一种新的冷却器,该冷却器将有超过六百万个通道,每个通道都可以测量位置、离子化和精度时间。重建这些事件的方式具有极大的挑战,我们正在使用最新的机器学习技术来解决这个问题。在这个开发过程中,我们建立了一个大型原型,该原型有12,000个通道,并使用高能电子束照射它。使用机器学习方法,我们已经从三维hit的能量中重建了入射电子的能量,这已知道一定的精度。我们通过公布这些数据希望能吸引专家在机器学习应用中发展高效和准确的图像重建技术。
Promises of Deep Kernel Learning for Control Synthesis
paper_authors: Robert Reed, Luca Laurenti, Morteza Lahijanian
for: 学习和控制复杂动力系统
methods: 使用深度kernel学习 combines with Gaussian Processes, 并使用抽象基础框架进行控制synthesis
results: 对各种准确性要求的 benchmark 进行了实验,显示了与现有竞争方法相比,控制synthesis with DKL 可以具有显著性能优势。Abstract
Deep Kernel Learning (DKL) combines the representational power of neural networks with the uncertainty quantification of Gaussian Processes. Hence, it is potentially a promising tool to learn and control complex dynamical systems. In this work, we develop a scalable abstraction-based framework that enables the use of DKL for control synthesis of stochastic dynamical systems against complex specifications. Specifically, we consider temporal logic specifications and create an end-to-end framework that uses DKL to learn an unknown system from data and formally abstracts the DKL model into an Interval Markov Decision Process (IMDP) to perform control synthesis with correctness guarantees. Furthermore, we identify a deep architecture that enables accurate learning and efficient abstraction computation. The effectiveness of our approach is illustrated on various benchmarks, including a 5-D nonlinear stochastic system, showing how control synthesis with DKL can substantially outperform state-of-the-art competitive methods.
摘要
深度kernel学习(DKL)结合神经网络的表达能力和高斯过程的不确定性评估,因此它可能是控制复杂动态系统的有望工具。在这项工作中,我们开发了可扩展的抽象基础框架,使得使用DKL进行动态系统的控制合成 against complex specs 可能。特别是,我们考虑了时间逻辑特性规定,并创建了一个终端框架,使用DKL来从数据中学习未知系统,并正确地抽象DKL模型为一个Interval Markov Decision Process(IMDP)来进行控制合成。此外,我们确定了一种深度架构,使得准确地学习和高效地抽象计算。我们的方法的有效性被证明在多个标准准例上,包括一个5D非线性随机系统,显示了使用DKL进行控制合成可以大幅超越现有竞争方法。
MELAGE: A purely python based Neuroimaging software (Neonatal)
results: MELAGE软件可以快速处理和分析医疗图像,并且具有多种功能,如动态3D视化、准确的测量和交互式图像分割。这个软件在医学成像领域中具有广泛的应用前景和潜在的推广前景。Abstract
MELAGE, a pioneering Python-based neuroimaging software, emerges as a versatile tool for the visualization, processing, and analysis of medical images. Initially conceived to address the unique challenges of processing 3D ultrasound and MRI brain images during the neonatal period, MELAGE exhibits remarkable adaptability, extending its utility to the domain of adult human brain imaging. At its core, MELAGE features a semi-automatic brain extraction tool empowered by a deep learning module, ensuring precise and efficient brain structure extraction from MRI and 3D Ultrasound data. Moreover, MELAGE offers a comprehensive suite of features, encompassing dynamic 3D visualization, accurate measurements, and interactive image segmentation. This transformative software holds immense promise for researchers and clinicians, offering streamlined image analysis, seamless integration with deep learning algorithms, and broad applicability in the realm of medical imaging.
摘要
美laps, a pioneering Python-based neuroimaging software, emerges as a versatile tool for the visualization, processing, and analysis of medical images. Initially conceived to address the unique challenges of processing 3D ultrasound and MRI brain images during the neonatal period, MELAGE exhibits remarkable adaptability, extending its utility to the domain of adult human brain imaging. At its core, MELAGE features a semi-automatic brain extraction tool empowered by a deep learning module, ensuring precise and efficient brain structure extraction from MRI and 3D Ultrasound data. Moreover, MELAGE offers a comprehensive suite of features, encompassing dynamic 3D visualization, accurate measurements, and interactive image segmentation. This transformative software holds immense promise for researchers and clinicians, offering streamlined image analysis, seamless integration with deep learning algorithms, and broad applicability in the realm of medical imaging.Here's the text with some additional information about the Simplified Chinese translation:The Simplified Chinese translation of the text uses the traditional Chinese characters for "MELAGE" (美laps), which is the pinyin Romanization of the name. The translation is written in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.In the translation, I have used the phrase "美laps" to refer to the software, as this is the name that is commonly used in the field of neuroimaging. I have also used the phrase " neural imaging" (神经成像) to refer to the broader field of medical imaging, as this is the term that is commonly used in China.Additionally, I have used the phrase " semi-automatic brain extraction tool" (半自动的脑EXTRACTION工具) to refer to the software's ability to automatically extract brain structures from medical images. I have also used the phrase " deep learning module" (深度学习模块) to refer to the software's use of machine learning algorithms to improve its performance.Overall, the translation is written in a formal and technical style, using language that is appropriate for a scientific or medical audience.
results: 使命令行界面更加智能和用户友好,开拓了进一步完善和跨平台应用的可能性。Abstract
Developers and data scientists often struggle to write command-line inputs, even though graphical interfaces or tools like ChatGPT can assist. The solution? "ai-cli," an open-source system inspired by GitHub Copilot that converts natural language prompts into executable commands for various Linux command-line tools. By tapping into OpenAI's API, which allows interaction through JSON HTTP requests, "ai-cli" transforms user queries into actionable command-line instructions. However, integrating AI assistance across multiple command-line tools, especially in open source settings, can be complex. Historically, operating systems could mediate, but individual tool functionality and the lack of a unified approach have made centralized integration challenging. The "ai-cli" tool, by bridging this gap through dynamic loading and linking with each program's Readline library API, makes command-line interfaces smarter and more user-friendly, opening avenues for further enhancement and cross-platform applicability.
摘要
开发者和数据科学家经常遇到 command-line 输入的问题,即使使用图形化界面或 ChatGPT 等工具。解决方案是“ ai-cli”,一个基于 GitHub Copilot 的开源系统,可以将自然语言提示转换成多种 Linux 命令行工具的执行命令。通过使用 OpenAI 的 API,可以通过 JSON HTTP 请求进行交互,将用户提问转换成可执行的命令行指令。然而,在多个命令行工具之间集成 AI 帮助,特别是在开源环境下,可能会具有复杂性。历史上,操作系统可以作为中介,但每个工具的功能和缺乏一致的方法使得中央集成变得困难。“ ai-cli” 工具通过在每个程序的 Readline 库 API 上进行动态加载和链接,使命令行界面变得更加智能和用户友好,打开了进一步改进和跨平台应用的可能性。
results: 论文通过了复杂的数据分析和实验研究,证明了TransDRO的可靠性和精准性,并且比传统转移学习方法更快速地调整模型。Abstract
Many existing transfer learning methods rely on leveraging information from source data that closely resembles the target data. However, this approach often overlooks valuable knowledge that may be present in different yet potentially related auxiliary samples. When dealing with a limited amount of target data and a diverse range of source models, our paper introduces a novel approach, Distributionally Robust Optimization for Transfer Learning (TransDRO), that breaks free from strict similarity constraints. TransDRO is designed to optimize the most adversarial loss within an uncertainty set, defined as a collection of target populations generated as a convex combination of source distributions that guarantee excellent prediction performances for the target data. TransDRO effectively bridges the realms of transfer learning and distributional robustness prediction models. We establish the identifiability of TransDRO and its interpretation as a weighted average of source models closest to the baseline model. We also show that TransDRO achieves a faster convergence rate than the model fitted with the target data. Our comprehensive numerical studies and analysis of multi-institutional electronic health records data using TransDRO further substantiate the robustness and accuracy of TransDRO, highlighting its potential as a powerful tool in transfer learning applications.
摘要
许多现有的转移学习方法倾向于利用源数据中与目标数据相似的信息。然而,这种方法经常忽视了可能存在的不同 yet 相关的辅助样本中的有价值知识。当面临有限的目标数据和多种源模型时,我们的论文引入了一种新的方法:分布robust优化 для转移学习(TransDRO)。TransDRO是设计用于最大化内部的 adversarial 损失,而不是仅仅是依靠源数据的相似性。TransDRO 可以减轻转移学习中的压力,并帮助模型更好地适应不同的目标数据。我们证明了TransDRO的可识别性和其作为基准模型 closest 的weighted 平均的解释。此外,我们还证明了TransDRO在比目标数据更快地收敛的特点。我们的全面的数学研究和多所机构电子医疗记录数据使用TransDRO进行了详细的分析,证明了TransDRO在转移学习应用中的可靠性和准确性。
Exploring the Benefits of Differentially Private Pre-training and Parameter-Efficient Fine-tuning for Table Transformers
results: 实验结果表明,使用 PEFT 方法可以在 ACSIncome 数据集上提高下游任务的准确率和数据隐私保护,同时具有更好的参数效率、隐私和准确率之间的平衡。 codes 可以在 github.com/IBM/DP-TabTransformer 上下载。Abstract
For machine learning with tabular data, Table Transformer (TabTransformer) is a state-of-the-art neural network model, while Differential Privacy (DP) is an essential component to ensure data privacy. In this paper, we explore the benefits of combining these two aspects together in the scenario of transfer learning -- differentially private pre-training and fine-tuning of TabTransformers with a variety of parameter-efficient fine-tuning (PEFT) methods, including Adapter, LoRA, and Prompt Tuning. Our extensive experiments on the ACSIncome dataset show that these PEFT methods outperform traditional approaches in terms of the accuracy of the downstream task and the number of trainable parameters, thus achieving an improved trade-off among parameter efficiency, privacy, and accuracy. Our code is available at github.com/IBM/DP-TabTransformer.
摘要
为机器学习 tabular 数据,表Transformer(TabTransformer)是现代神经网络模型的state-of-the-art,而数据隐私(DP)则是保证数据隐私的必要组成部分。在这篇论文中,我们探讨将这两个方面结合在一起的场景——即具有数据隐私的预训练和精度调整 TabTransformers 的方法,包括 Adapter、LoRA 和 Prompt Tuning。我们在 ACSIncome 数据集进行了广泛的实验,发现这些 PEFT 方法在下游任务的准确率和可训练参数数量方面都高于传统方法,从而实现了参数效率、隐私和准确率之间的改善的平衡。我们的代码可以在github.com/IBM/DP-TabTransformer 找到。
A Q-learning Approach for Adherence-Aware Recommendations
results: 我们证明了我们提出的Q学习算法 converges to the optimal value,并评估了它在多种情况下的性能。Abstract
In many real-world scenarios involving high-stakes and safety implications, a human decision-maker (HDM) may receive recommendations from an artificial intelligence while holding the ultimate responsibility of making decisions. In this letter, we develop an "adherence-aware Q-learning" algorithm to address this problem. The algorithm learns the "adherence level" that captures the frequency with which an HDM follows the recommended actions and derives the best recommendation policy in real time. We prove the convergence of the proposed Q-learning algorithm to the optimal value and evaluate its performance across various scenarios.
摘要
在许多高风险高安全性的实际场景中,人工智能推荐(HDM)可能会接收人工智能推荐的建议,而最终负责做出决策。在这封信中,我们开发了一种“遵循程度意识Q学习”算法,以解决这个问题。该算法学习“遵循程度”, capture推荐行为的频率,并在实时 derivates最佳推荐策略。我们证明算法 converges to 优化值,并评估其性能在多种场景中。
Bayesian longitudinal tensor response regression for modeling neuroplasticity
paper_authors: Suprateek Kundu, Alec Reinhardt, Serena Song, M. Lawson Meadows, Bruce Crosson, Venkatagiri Krishnamurthy
For: 这个论文的主要目标是 investigate longitudinal neuroimaging 数据中 voxel 级别的 neural plasticity,并且使用 Bayesian tensor response regression 方法来做这种研究。* Methods: 这个方法使用 Markov chain Monte Carlo (MCMC) 采样来实现,并使用 low-rank decomposition 来降维并保持 voxel 的空间配置。它还可以通过联合可信区间来进行特征选择,以更准确地进行推断。* Results: 这个方法可以在 group 级别和个体级别进行推断,并且可以检测到不同干扰因素对 brain 活动的影响。在应用于一个 longitudinal aphasia 数据集上,这个方法发现,对于 control 治疗,brain 活动在长期内增加,而对于 intention treatment,brain 活动在短期内增加,两者都集中在特定的本地化区域。相比之下,voxel-wise regression 无法检测到任何 significannot neuroplasticity после多重性调整,这是生物学上不可能的和表明缺乏力。Abstract
A major interest in longitudinal neuroimaging studies involves investigating voxel-level neuroplasticity due to treatment and other factors across visits. However, traditional voxel-wise methods are beset with several pitfalls, which can compromise the accuracy of these approaches. We propose a novel Bayesian tensor response regression approach for longitudinal imaging data, which pools information across spatially-distributed voxels to infer significant changes while adjusting for covariates. The proposed method, which is implemented using Markov chain Monte Carlo (MCMC) sampling, utilizes low-rank decomposition to reduce dimensionality and preserve spatial configurations of voxels when estimating coefficients. It also enables feature selection via joint credible regions which respect the shape of the posterior distributions for more accurate inference. In addition to group level inferences, the method is able to infer individual-level neuroplasticity, allowing for examination of personalized disease or recovery trajectories. The advantages of the proposed approach in terms of prediction and feature selection over voxel-wise regression are highlighted via extensive simulation studies. Subsequently, we apply the approach to a longitudinal Aphasia dataset consisting of task functional MRI images from a group of subjects who were administered either a control intervention or intention treatment at baseline and were followed up over subsequent visits. Our analysis revealed that while the control therapy showed long-term increases in brain activity, the intention treatment produced predominantly short-term changes, both of which were concentrated in distinct localized regions. In contrast, the voxel-wise regression failed to detect any significant neuroplasticity after multiplicity adjustments, which is biologically implausible and implies lack of power.
摘要
一个主要兴趣点在长itudinal神经成像研究是调查征量级的神经重塑,因为不同因素的影响。然而,传统的征量级方法存在多种困难,可能会降低准确性。我们提出了一种新的 bayesian tensor response regression方法,用于长itudinal神经成像数据,该方法将信息归一化到空间分布的壳体上,以确定变化的主要因素,并对 covariates 进行补做。我们使用 markov chain Monte Carlo (MCMC) 采样实现该方法,并使用低级别分解来降低维度并保持壳体中各个壳体的空间配置。此外,该方法还允许功能选择,通过共同可信区域来更准确地进行推理。除了群体水平的推理,该方法还可以进行个体水平的神经重塑推理,以检测个体化的疾病或恢复轨迹。我们通过了EXTENSIVE 仪器实验来比较我们的方法与征量级 regression 的优势,并应用到一个长itudinal 语言障碍数据集,该数据集包含一组主动或接受治疗的 subjects 的任务功能 MRI 图像,从baseline开始,并在后续访问中进行追踪。我们的分析发现,控制疗法在长期内表现出增加的脑活动,而意图治疗则在短期内产生主要的变化,这些变化集中在特定的本地化区域。相比之下,征量级 regression 无法检测任何的神经重塑,这是生物学上不可能的,并 implies 缺乏力量。
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
paper_authors: Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat
For: 这篇论文主要探讨了一种基于AdaGrad家族的在线泛化优化算法Shampoo,用于训练神经网络。* Methods: 该算法使用了块对角矩阵预conditioner,其中每个块包含一个粗略 Kronecker产品方法来对神经网络每个参数进行AdaGrad优化。* Results: 作者提供了一个完整的算法描述以及实现方法,并通过对ImageNet ResNet50进行ablation研究,证明了Shampoo比标准的 diagonally-scaling-based adaptive gradient方法具有更好的性能。Abstract
Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step wall-clock time compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning.
摘要
<>首先,我们需要介绍一下Shampoo算法。Shampoo是一种在线和随机优化算法,属于AdaGrad家族的方法,用于训练神经网络。它构建了一个块对角预conditioner,每个块包含一个粗糙的Kronecker产品approxiamtion来对神经网络每个参数进行AdaGrad优化。在这篇文章中,我们将提供Shampoo算法的完整描述,以及我们的实现方法来使用PyTorch框架进行高效的多GPU分布式数据并行训练。我们的实现方法包括将每个参数的块分配到不同的GPU上,并使用PyTorch的DTensor数据结构进行分布式计算。在每次迭代中,我们使用AllGather primitives来合并所有GPU上的计算结果。这种主要的性能优化方法使得我们可以在每步wall-clock时间中减少训练时间的最大值为10%,相比标准的对 diagonalfactor-scaling 的 adaptive gradient方法。我们验证了我们的实现方法,通过对ImageNet ResNet50进行减少学习率的研究,并证明Shampoo的superiority。<>
Learning topological operations on meshes with application to block decomposition of polygons
results: 能够有效地减少节点度差与理想值的差异,即内部顶点的异常节点数量减少Abstract
We present a learning based framework for mesh quality improvement on unstructured triangular and quadrilateral meshes. Our model learns to improve mesh quality according to a prescribed objective function purely via self-play reinforcement learning with no prior heuristics. The actions performed on the mesh are standard local and global element operations. The goal is to minimize the deviation of the node degrees from their ideal values, which in the case of interior vertices leads to a minimization of irregular nodes.
摘要
我们提出了一种基于学习的框架,用于改善无结构三角形和四边形网格的质量。我们的模型通过自动化反射学习,不含任何先验知识,来改善网格质量 according to a prescribed objective function。操作的方法包括标准的本地和全局元素操作。目标是将节点度 deviation from their ideal values as minimal as possible,即在内部顶点情况下 minimize irregular nodes。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Flows for Flows: Morphing one Dataset into another with Maximum Likelihood Estimation
results: 我们研究了这种协议的多种变种,以explore如何使用normalizing flows 模型来 Statistically match two datasets。此外,我们还示了如何使用conditioning feature来创建一个基于特定特征的 morphing function,以便为每个特征值创建一个 morphing function。我们用了 Toy examples 和 collider physics 示例来 ilustrate our results。Abstract
Many components of data analysis in high energy physics and beyond require morphing one dataset into another. This is commonly solved via reweighting, but there are many advantages of preserving weights and shifting the data points instead. Normalizing flows are machine learning models with impressive precision on a variety of particle physics tasks. Naively, normalizing flows cannot be used for morphing because they require knowledge of the probability density of the starting dataset. In most cases in particle physics, we can generate more examples, but we do not know densities explicitly. We propose a protocol called flows for flows for training normalizing flows to morph one dataset into another even if the underlying probability density of neither dataset is known explicitly. This enables a morphing strategy trained with maximum likelihood estimation, a setup that has been shown to be highly effective in related tasks. We study variations on this protocol to explore how far the data points are moved to statistically match the two datasets. Furthermore, we show how to condition the learned flows on particular features in order to create a morphing function for every value of the conditioning feature. For illustration, we demonstrate flows for flows for toy examples as well as a collider physics example involving dijet events
摘要
很多高能物理数据分析中的组件需要将一个数据集转换成另一个数据集。通常通过重新权重来解决这个问题,但是有很多优点在保持权重并将数据点移动。正则化流是一种机器学习模型,它在多种粒子物理任务上具有出色的精度。然而,正则化流无法用于 morphing,因为它们需要知道开始数据集的概率密度。在大多数粒子物理任务中,我们可以生成更多的例子,但我们不知道概率密度的准确值。我们提出了一个协议called flows for flows,用于在不知道开始数据集的概率密度情况下,通过最大likelihood估计来训练正则化流。这种设置可以在相关任务中展示出非常高效。我们还研究了这个协议的变种,以探索如何在数据点之间移动的距离,以达到两个数据集的统计匹配。此外,我们还示出了如何基于特定的特征来创建一个适应于每个特征的 morphing 函数。为了说明,我们对假示例和一个撞击物理示例进行了演示。
On Computationally Efficient Learning of Exponential Family Distributions
results: 提供了finite sample guarantees,表明该估计方法可以在样本数量为$O({\sf poly}(k)/\alpha^2)$下 Achieves an error (in $\ell_2$-norm) of $\alpha$ in the parameter estimation,并且在特定的Markov random fields上实现order-optimal的sample complexity $O({\sf log}(k)/\alpha^2)$。Abstract
We consider the classical problem of learning, with arbitrary accuracy, the natural parameters of a $k$-parameter truncated \textit{minimal} exponential family from i.i.d. samples in a computationally and statistically efficient manner. We focus on the setting where the support as well as the natural parameters are appropriately bounded. While the traditional maximum likelihood estimator for this class of exponential family is consistent, asymptotically normal, and asymptotically efficient, evaluating it is computationally hard. In this work, we propose a novel loss function and a computationally efficient estimator that is consistent as well as asymptotically normal under mild conditions. We show that, at the population level, our method can be viewed as the maximum likelihood estimation of a re-parameterized distribution belonging to the same class of exponential family. Further, we show that our estimator can be interpreted as a solution to minimizing a particular Bregman score as well as an instance of minimizing the \textit{surrogate} likelihood. We also provide finite sample guarantees to achieve an error (in $\ell_2$-norm) of $\alpha$ in the parameter estimation with sample complexity $O({\sf poly}(k)/\alpha^2)$. Our method achives the order-optimal sample complexity of $O({\sf log}(k)/\alpha^2)$ when tailored for node-wise-sparse Markov random fields. Finally, we demonstrate the performance of our estimator via numerical experiments.
摘要
我们考虑了经典的学习问题,即用任意精度学习 naturalk parameter 的 $k$-parameter简化 exponential family 的 i.i.d. 样本。我们关注在支持和自然参数均有限制的设置下。传统的最大 LIKElihood estimator 是一个可靠、 asymptotically normal 和 asymptotically efficient 的 estimator,但是计算困难。在这种工作中,我们提出了一个新的损失函数和一个 computationally efficient 的 estimator,该 estimator 是一个可靠的 estimator ,并且在某些条件下具有 asymptotically normal 性。我们证明了,在人口水平,我们的方法可以视为一个 maximum likelihood estimation 的 re-parameterized distribution 的一个例子,这个 distribution 属于同一个类型的 exponential family。此外,我们还证明了我们的 estimator 可以被视为一个 minimize 的 Bregman score 和一个 surrogate likelihood 的解。我们还提供了 finite sample guarantees,可以在 $\ell_2$-norm 内达到 $\alpha$ 的错误水平, sample complexity 为 $O({\sf poly}(k)/\alpha^2)$。当我们特意适应 node-wise-sparse Markov random fields 时,我们的方法可以 дости到 order-optimal 的 sample complexity $O({\sf log}(k)/\alpha^2)$。最后,我们通过数值实验证明了我们的 estimator 的性能。
Using Reed-Muller Codes for Classification with Rejection and Recovery
results: RMAggNet可以减少错误率,同时保持good correctness,并且可以在不同的攻击budget下进行多种攻击。Abstract
When deploying classifiers in the real world, users expect them to respond to inputs appropriately. However, traditional classifiers are not equipped to handle inputs which lie far from the distribution they were trained on. Malicious actors can exploit this defect by making adversarial perturbations designed to cause the classifier to give an incorrect output. Classification-with-rejection methods attempt to solve this problem by allowing networks to refuse to classify an input in which they have low confidence. This works well for strongly adversarial examples, but also leads to the rejection of weakly perturbed images, which intuitively could be correctly classified. To address these issues, we propose Reed-Muller Aggregation Networks (RMAggNet), a classifier inspired by Reed-Muller error-correction codes which can correct and reject inputs. This paper shows that RMAggNet can minimise incorrectness while maintaining good correctness over multiple adversarial attacks at different perturbation budgets by leveraging the ability to correct errors in the classification process. This provides an alternative classification-with-rejection method which can reduce the amount of additional processing in situations where a small number of incorrect classifications are permissible.
摘要
traditional classifiers 传统的分类器adversarial perturbations 攻击性的偏移classification-with-rejection methods 分类与拒绝方法Reed-Muller error-correction codes 里德-迈尔Error correction codesRMAggNet 重元聚合网络incorrectness 错误性good correctness 好的正确性adversarial attacks 攻击性的攻击perturbation budgets 偏移预算additional processing 额外处理
Generalized Regret Analysis of Thompson Sampling using Fractional Posteriors
paper_authors: Prateek Jaiswal, Debdeep Pati, Anirban Bhattacharya, Bani K. Mallick
For: The paper is written to improve the performance of Thompson Sampling (TS) algorithm in stochastic multi-armed bandit problems by using a fractional posterior distribution.* Methods: The paper proposes an variant of TS called $\alpha$-TS, which uses a fractional or $\alpha$-posterior instead of the standard posterior distribution. The paper also provides regret bounds for $\alpha$-TS using recent theoretical developments in non-asymptotic concentration analysis and Bernstein-von Mises type results.* Results: The paper obtains instance-dependent and instance-independent regret bounds for $\alpha$-TS, which are better than those of standard TS. Specifically, the instance-dependent bound is $\mathcal{O}\left(\sum_{k \neq i^*} \Delta_k\left(\frac{\log(T)}{C(\alpha)\Delta_k^2} + \frac{1}{2} \right)\right)$ and the instance-independent bound is $\mathcal{O}(\sqrt{KT\log K})$. The paper also matches the performance of improved UCB algorithm.Abstract
Thompson sampling (TS) is one of the most popular and earliest algorithms to solve stochastic multi-armed bandit problems. We consider a variant of TS, named $\alpha$-TS, where we use a fractional or $\alpha$-posterior ($\alpha\in(0,1)$) instead of the standard posterior distribution. To compute an $\alpha$-posterior, the likelihood in the definition of the standard posterior is tempered with a factor $\alpha$. For $\alpha$-TS we obtain both instance-dependent $\mathcal{O}\left(\sum_{k \neq i^*} \Delta_k\left(\frac{\log(T)}{C(\alpha)\Delta_k^2} + \frac{1}{2} \right)\right)$ and instance-independent $\mathcal{O}(\sqrt{KT\log K})$ frequentist regret bounds under very mild conditions on the prior and reward distributions, where $\Delta_k$ is the gap between the true mean rewards of the $k^{th}$ and the best arms, and $C(\alpha)$ is a known constant. Both the sub-Gaussian and exponential family models satisfy our general conditions on the reward distribution. Our conditions on the prior distribution just require its density to be positive, continuous, and bounded. We also establish another instance-dependent regret upper bound that matches (up to constants) to that of improved UCB [Auer and Ortner, 2010]. Our regret analysis carefully combines recent theoretical developments in the non-asymptotic concentration analysis and Bernstein-von Mises type results for the $\alpha$-posterior distribution. Moreover, our analysis does not require additional structural properties such as closed-form posteriors or conjugate priors.
摘要
汤姆采样(TS)是最受欢迎并且最早的多重护卫机器人问题的算法之一。我们考虑了一种变体called $\alpha$-TS,其中使用一个分数或$\alpha$-后验($\alpha\in(0,1)$)而不是标准后验分布。为计算一个$\alpha$-后验,在标准后验定义中的概率被温和了一个因子$\alpha$。对于$\alpha$-TS,我们获得了两种不同的频见 regret bounds:一种是 $\mathcal{O}\left(\sum_{k \neq i^*} \Delta_k\left(\frac{\log(T)}{C(\alpha)\Delta_k^2} + \frac{1}{2} \right)\right)$,另一种是 $\mathcal{O}(\sqrt{KT\log K})$,其中 $\Delta_k$ 是真实奖励的 $k$ 个和最佳武器的差距,$C(\alpha)$ 是已知的常量。两者都满足非常轻量级的前提和奖励分布。我们的前提只需要其概率密度是正、连续和有界的。我们还提出了另一种与改进的 UCB(Auer 和 Ortner, 2010)的 regret upper bound,其中的 regret bound 与(除了常量外)相同。我们的 regret 分析结合了最近的非 asymptotic 归一化分析和 Bernstein-von Mises 类型的 $\alpha$- posterior 分布结果。此外,我们的分析不需要额外的结构性质,例如闭合形 posterior 或 conjugate priors。
Band-gap regression with architecture-optimized message-passing neural networks
results: 搜索出的最佳模型组成一个ensemble,与现有文献中的模型相比显著提高了性能。不确定性评估使用Monte Carlo Dropout和集成方法,集成方法更为成功。对适用范围进行分析,包括晶系、包括Hubbard参数在内的密集函数计算、以及材料中原子species。Abstract
Graph-based neural networks and, specifically, message-passing neural networks (MPNNs) have shown great potential in predicting physical properties of solids. In this work, we train an MPNN to first classify materials through density functional theory data from the AFLOW database as being metallic or semiconducting/insulating. We then perform a neural-architecture search to explore the model architecture and hyperparameter space of MPNNs to predict the band gaps of the materials identified as non-metals. The parameters in the search include the number of message-passing steps, latent size, and activation-function, among others. The top-performing models from the search are pooled into an ensemble that significantly outperforms existing models from the literature. Uncertainty quantification is evaluated with Monte-Carlo Dropout and ensembling, with the ensemble method proving superior. The domain of applicability of the ensemble model is analyzed with respect to the crystal systems, the inclusion of a Hubbard parameter in the density functional calculations, and the atomic species building up the materials.
摘要
GRaph-based neural networks和具体地是消息传递神经网络(MPNNs)在预测固体物理性质方面表现出了极高的潜力。在这项工作中,我们使用MPNN进行物料分类,通过density functional theory数据库AFLOW中的数据来判断物料是金属或半导体/隔离体。然后,我们进行神经网络架构和超参数的搜索,以提高MPNN预测非金属材料带隔的能力。搜索的参数包括消息传递步数、隐藏大小和活化函数等。我们从搜索中选拔出最佳性能的模型,并将其 Pooling 成 ensemble,该ensemble在已有文献中的模型表现出色。我们使用Monte Carlo Dropout和集成来评估uncertainty quantification,集成方法表现更优。我们还分析了ensemble模型的适用范围,包括晶系、包括Hubbard参数在density functional计算中,以及物质组成的原子种。
Modeling Supply and Demand in Public Transportation Systems
results: 我们预测了HDPT的服务缺陷,以帮助它提高运营效率和效果。Abstract
The Harrisonburg Department of Public Transportation (HDPT) aims to leverage their data to improve the efficiency and effectiveness of their operations. We construct two supply and demand models that help the department identify gaps in their service. The models take many variables into account, including the way that the HDPT reports to the federal government and the areas with the most vulnerable populations in Harrisonburg City. We employ data analysis and machine learning techniques to make our predictions.
摘要
哈里逊堡公共交通部门(HDPT)想要利用数据提高其运营效率和效果。我们构建了两个供应和需求模型,帮助公共交通部门识别服务中的缺陷。这些模型考虑了许多变量,包括公共交通部门向联邦政府报告的方式和哈里逊城市最为易受影响的地区。我们使用数据分析和机器学习技术进行预测。
paper_authors: Alexander Kleinsorge, Stefan Kupper, Alexander Fauck, Felix Rothe
For: 本研究提出了一种新的、快速(指数减速)、无参数(hyper-parameter-free)的梯度下降优化算法。* Methods: 该方法利用情况意识来适应学习率α,主要寻求与邻域梯度垂直的梯度。该方法具有高成功率和快速收敛率,不需手动调整参数,因此更通用。它可应用于任意维度n和问题规模上,并且只 Linearly 增长(order O(n))。* Results: 对于MNIST数据集,该方法与现有优化器进行比较,并达到了惊人的性能。作者认为,这将开启一个全新的研究方向,即梯度下降优化。Abstract
We present a novel, fast (exponential rate adaption), ab initio (hyper-parameter-free) gradient based optimizer algorithm. The main idea of the method is to adapt the learning rate $\alpha$ by situational awareness, mainly striving for orthogonal neighboring gradients. The method has a high success and fast convergence rate and does not rely on hand-tuned parameters giving it greater universality. It can be applied to problems of any dimensions n and scales only linearly (of order O(n)) with the dimension of the problem. It optimizes convex and non-convex continuous landscapes providing some kind of gradient. In contrast to the Ada-family (AdaGrad, AdaMax, AdaDelta, Adam, etc.) the method is rotation invariant: optimization path and performance are independent of coordinate choices. The impressive performance is demonstrated by extensive experiments on the MNIST benchmark data-set against state-of-the-art optimizers. We name this new class of optimizers after its core idea Exponential Learning Rate Adaption - ELRA. We present it in two variants c2min and p2min with slightly different control. The authors strongly believe that ELRA will open a completely new research direction for gradient descent optimize.
摘要
我们提出了一种新的、快速(指数级别适应)、无参数(hyper-parameter-free)的梯度基本优化算法。主要思想是根据情况意识来适应学习率α,主要寻求垂直邻域梯度的协调。该方法具有高成功率和快速收敛率,不依赖手动调整参数,因此更加通用。它可以应用于任意维度n和问题规模上,且只linearly(order O(n))随维度增长。它可以优化凸和非凸连续地征地形,并且提供梯度的一种形式。与Ada家族(AdaGrad、AdaMax、AdaDelta、Adam等)不同,该方法是坐标选择无关的:优化路径和性能独立于坐标选择。我们在MNIST数据集上进行了广泛的实验,并证明了该新类型的优化器对现状最优化器的性能具有很高的表现。我们称之为权重学习率适应(ELRA)。我们提出了两种变体:c2min和p2min,它们有些微的不同控制。作者们认为,ELRA将开启一个全新的研究方向,它将gradient descent优化至新的高度。
ssVERDICT: Self-Supervised VERDICT-MRI for Enhanced Prostate Tumour Characterisation
results: 比基eline方法(NLLS和Supervised DNN)更高的估计精度和降低偏差,以及更高的信任度对比 benign prostate tissue和癌细胞组织的分化。Abstract
MRI is increasingly being used in the diagnosis of prostate cancer (PCa), with diffusion MRI (dMRI) playing an integral role. When combined with computational models, dMRI can estimate microstructural information such as cell size. Conventionally, such models are fit with a nonlinear least squares (NLLS) curve fitting approach, associated with a high computational cost. Supervised deep neural networks (DNNs) are an efficient alternative, however their performance is significantly affected by the underlying distribution of the synthetic training data. Self-supervised learning is an attractive alternative, where instead of using a separate training dataset, the network learns the features of the input data itself. This approach has only been applied to fitting of trivial dMRI models thus far. Here, we introduce a self-supervised DNN to estimate the parameters of the VERDICT (Vascular, Extracellular and Restricted DIffusion for Cytometry in Tumours) model for prostate. We demonstrate, for the first time, fitting of a complex three-compartment biophysical model with machine learning without the requirement of explicit training labels. We compare the estimation performance to baseline NLLS and supervised DNN methods, observing improvement in estimation accuracy and reduction in bias with respect to ground truth values. Our approach also achieves a higher confidence level for discrimination between cancerous and benign prostate tissue in comparison to the other methods on a dataset of 20 PCa patients, indicating potential for accurate tumour characterisation.
摘要
Toward Discretization-Consistent Closure Schemes for Large Eddy Simulation Using Reinforcement Learning
for: developing discretization-consistent closure schemes for implicitly filtered Large Eddy Simulation (LES)
methods: Markov decision process with Reinforcement Learning (RL) to adapt the coefficients of LES closure models
results: accurate and consistent results, either matching or outperforming classical state-of-the-art models for different discretizations and resolutionsAbstract
We propose a novel method for developing discretization-consistent closure schemes for implicitly filtered Large Eddy Simulation (LES). In implicitly filtered LES, the induced filter kernel, and thus the closure terms, are determined by the properties of the grid and the discretization operator, leading to additional computational subgrid terms that are generally unknown in a priori analysis. Therefore, the task of adapting the coefficients of LES closure models is formulated as a Markov decision process and solved in an a posteriori manner with Reinforcement Learning (RL). This allows to adjust the model to the actual discretization as it also incorporates the interaction between the discretization and the model itself. This optimization framework is applied to both explicit and implicit closure models. An element-local eddy viscosity model is optimized as the explicit model. For the implicit modeling, RL is applied to identify an optimal blending strategy for a hybrid discontinuous Galerkin (DG) and finite volume scheme. All newly derived models achieve accurate and consistent results, either matching or outperforming classical state-of-the-art models for different discretizations and resolutions. Moreover, the explicit model is demonstrated to adapt its distribution of viscosity within the DG elements to the inhomogeneous discretization properties of the operator. In the implicit case, the optimized hybrid scheme renders itself as a viable modeling ansatz that could initiate a new class of high order schemes for compressible turbulence. Overall, the results demonstrate that the proposed RL optimization can provide discretization-consistent closures that could reduce the uncertainty in implicitly filtered LES.
摘要
我们提出了一种新的方法,用于开发基于隐式筛护的大气流体学(LES)中的精度适应闭合方法。在隐式筛护LES中,抽象筛护kernels和关闭项的计算是由网格和精度分解算法的性质决定的,导致额外的计算误差,通常在先前分析中不可知。因此,我们将adapting模型的精度适应问题转化为Markov决策过程,并使用再增强学习(RL)来解决。这种优化框架可以在Explicit和隐式关闭模型之间进行选择。我们使用了一种元素本地风格膜粘性模型作为Explicit模型。而对于隐式模型,我们使用RL来标识最佳混合策略。所有新 derive的模型都实现了准确和一致的结果,与经典的状态对照模型相当或超越。此外,Explicit模型还示出了适应不同精度和分辨率的 Distribution of viscosity 的能力。在隐式情况下,优化的混合方案表明了一种可能的新的高阶方法。总的来说,结果表明了我们提出的RL优化可以提供基于隐式筛护LES的精度适应闭合,从而减少了不确定性。
Speciality vs Generality: An Empirical Study on Catastrophic Forgetting in Fine-tuning Foundation Models
methods: 本研究使用了多种正则化方法,包括 continual learning 和 Wise-FT 方法,以mitigate the loss of generality 在精化过程中。
results: 研究发现,使用 Wise-FT 方法可以最 effectively 平衡特性和通用性,并且在多种任务和分布上表现最佳。Abstract
Foundation models, including Vision Language Models (VLMs) and Large Language Models (LLMs), possess the $generality$ to handle diverse distributions and tasks, which stems from their extensive pre-training datasets. The fine-tuning of foundation models is a common practice to enhance task performance or align the model's behavior with human expectations, allowing them to gain $speciality$. However, the small datasets used for fine-tuning may not adequately cover the diverse distributions and tasks encountered during pre-training. Consequently, the pursuit of speciality during fine-tuning can lead to a loss of {generality} in the model, which is related to catastrophic forgetting (CF) in deep learning. In this study, we demonstrate this phenomenon in both VLMs and LLMs. For instance, fine-tuning VLMs like CLIP on ImageNet results in a loss of generality in handling diverse distributions, and fine-tuning LLMs like Galactica in the medical domain leads to a loss in following instructions and common sense. To address the trade-off between the speciality and generality, we investigate multiple regularization methods from continual learning, the weight averaging method (Wise-FT) from out-of-distributional (OOD) generalization, which interpolates parameters between pre-trained and fine-tuned models, and parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA). Our findings show that both continual learning and Wise-ft methods effectively mitigate the loss of generality, with Wise-FT exhibiting the strongest performance in balancing speciality and generality.
摘要
基础模型,如视觉语言模型(VLM)和大型语言模型(LLM),拥有泛化能力,可以处理多样化分布和任务,这是由于它们的广泛预训练数据集的影响。然而,在精度调整过程中,使用小型数据集可能无法完全覆盖预训练中遇到的多样化分布和任务。因此,在精度调整过程中寻求特点可能会导致模型失去泛化能力,这与深度学习中的恶化学习(CF)有关。在这个研究中,我们 demonstarte了这种现象在VLM和LLM中。例如,在CLIP的ImageNet精度调整中,VLM失去了处理多样化分布的能力,而在医疗领域的Galactica精度调整中,LLM失去了遵循指令和常识。为了解决特点和泛化之间的负担,我们 investigate了多种CONTINUAL LEARNING的正则化方法,包括Weight Averaging Method(Wise-FT)和Parameter-Efficient Fine-Tuning Method(LoRA)。我们的发现表明,CONTINUAL LEARNING和Wise-FT方法都能有效避免失去泛化能力,其中Wise-FT方法在均衡特点和泛化能力方面表现最佳。
Rethinking Evaluation Metric for Probability Estimation Models Using Esports Data
paper_authors: Euihyeon Choi, Jooyoung Kim, Wonkyung Lee
for: 这个论文是为了评估电子竞技比赛中的胜率估计模型而写的。
methods: 这个论文使用了布莱尔分数和预期准确错误(ECE)作为评估胜率估计模型的性能评估指标,并提出了一个新的简单 yet effective的度量标准 called Balance score,该标准具有六种好的属性。
results: 经过广泛的 simulation studies 和实际游戏快照数据的评估,提出的 Balance score 显示出了可靠地评估电子竞技比赛中胜率估计模型的性能,并且可以作为一个全面的评估指标来评估总的比赛模型。Abstract
Probability estimation models play an important role in various fields, such as weather forecasting, recommendation systems, and sports analysis. Among several models estimating probabilities, it is difficult to evaluate which model gives reliable probabilities since the ground-truth probabilities are not available. The win probability estimation model for esports, which calculates the win probability under a certain game state, is also one of the fields being actively studied in probability estimation. However, most of the previous works evaluated their models using accuracy, a metric that only can measure the performance of discrimination. In this work, we firstly investigate the Brier score and the Expected Calibration Error (ECE) as a replacement of accuracy used as a performance evaluation metric for win probability estimation models in esports field. Based on the analysis, we propose a novel metric called Balance score which is a simple yet effective metric in terms of six good properties that probability estimation metric should have. Under the general condition, we also found that the Balance score can be an effective approximation of the true expected calibration error which has been imperfectly approximated by ECE using the binning technique. Extensive evaluations using simulation studies and real game snapshot data demonstrate the promising potential to adopt the proposed metric not only for the win probability estimation model for esports but also for evaluating general probability estimation models.
摘要
probabilistic estimation models play an important role in various fields, such as weather forecasting, recommendation systems, and sports analysis. among several models estimating probabilities, it is difficult to evaluate which model gives reliable probabilities since the ground-truth probabilities are not available. the win probability estimation model for esports, which calculates the win probability under a certain game state, is also one of the fields being actively studied in probability estimation. however, most of the previous works evaluated their models using accuracy, a metric that only can measure the performance of discrimination. in this work, we firstly investigate the Brier score and the Expected Calibration Error (ECE) as a replacement of accuracy used as a performance evaluation metric for win probability estimation models in esports field. based on the analysis, we propose a novel metric called Balance score which is a simple yet effective metric in terms of six good properties that probability estimation metric should have. under the general condition, we also found that the Balance score can be an effective approximation of the true expected calibration error which has been imperfectly approximated by ECE using the binning technique. extensive evaluations using simulation studies and real game snapshot data demonstrate the promising potential to adopt the proposed metric not only for the win probability estimation model for esports but also for evaluating general probability estimation models.
Consistency and adaptivity are complementary targets for the validation of variance-based uncertainty quantification metrics in machine learning regression tasks
results: 论文表明了consistency和adaptivity是补充的验证目标,并且一个good consistency不一定意味着good adaptivity。Abstract
Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.
摘要
通用不确定评估(UQ)在机器学习(ML)回归任务中的可靠性已成为许多材料和化学科学研究的焦点。现已经明确,平均调整不足,大多数研究采用附加方法测试受到不确定性的Conditional calibration,即一致性。一致性通常通过所谓的可靠性图表示。然而,还有另一种超出平均调整的方法,即基于输入特征的 Conditional calibration,即适应性。在实践中,适应性是最终用户的ML-UQ方法可靠性预测和不确定性的首要关心,寻求任何特征空间中的可靠性和不确定性。本文目标表明了一致性和适应性是补充 validate 目标,并且一个好的一致性不一定意味着一个好的适应性。适应性验证方法被提出并在一个代表性的示例中 ilustrated。
Risk-Aware Reinforcement Learning through Optimal Transport Theory
for: This paper is written for researchers and practitioners interested in developing risk-aware reinforcement learning (RL) algorithms that can operate in dynamic and uncertain environments.
methods: The paper integrates Optimal Transport (OT) theory with RL to create a risk-aware framework. The approach modifies the objective function to ensure that the resulting policy maximizes expected rewards while respecting risk constraints dictated by OT distances between state visitation distributions and desired risk profiles.
results: The paper offers a formulation that elevates risk considerations alongside conventional RL objectives, and provides a series of theorems that map the relationships between risk distributions, optimal value functions, and policy behaviors. The work demonstrates a promising direction for RL, ensuring a balanced fusion of reward pursuit and risk awareness.Abstract
In the dynamic and uncertain environments where reinforcement learning (RL) operates, risk management becomes a crucial factor in ensuring reliable decision-making. Traditional RL approaches, while effective in reward optimization, often overlook the landscape of potential risks. In response, this paper pioneers the integration of Optimal Transport (OT) theory with RL to create a risk-aware framework. Our approach modifies the objective function, ensuring that the resulting policy not only maximizes expected rewards but also respects risk constraints dictated by OT distances between state visitation distributions and the desired risk profiles. By leveraging the mathematical precision of OT, we offer a formulation that elevates risk considerations alongside conventional RL objectives. Our contributions are substantiated with a series of theorems, mapping the relationships between risk distributions, optimal value functions, and policy behaviors. Through the lens of OT, this work illuminates a promising direction for RL, ensuring a balanced fusion of reward pursuit and risk awareness.
摘要
在动态和不确定环境中,控制风险成为RL运算中关键的一个因素,以确保可靠的决策。传统RL方法,虽然能够优化奖励,但经常忽略潜在的风险风险。为应对这些问题,这篇论文提出了将Optimal Transport(OT)理论与RL结合,创建一个风险意识框架。我们的方法修改了目标函数,使得结果策略不仅最大化预期奖励,还遵循由OT距离状态访问分布和愿望风险规则的风险约束。通过OT的数学精度,我们提供了一种形ulation,使得风险考虑与传统RL目标联系起来。我们的贡献得到了一系列定理的证明,映射了风险分布、优化值函数和策略行为之间的关系。通过OT的视角,这项工作突出了RL中风险意识的重要性,并提供了一个平衡RL目标和风险考虑的可行方法。
A Consistent and Scalable Algorithm for Best Subset Selection in Single Index Models
results: simulations 表明,该算法不仅可以快速计算出最佳子集,而且可以准确地回归最佳子集,并且不需要进行模型选择调整。Abstract
Analysis of high-dimensional data has led to increased interest in both single index models (SIMs) and best subset selection. SIMs provide an interpretable and flexible modeling framework for high-dimensional data, while best subset selection aims to find a sparse model from a large set of predictors. However, best subset selection in high-dimensional models is known to be computationally intractable. Existing methods tend to relax the selection, but do not yield the best subset solution. In this paper, we directly tackle the intractability by proposing the first provably scalable algorithm for best subset selection in high-dimensional SIMs. Our algorithmic solution enjoys the subset selection consistency and has the oracle property with a high probability. The algorithm comprises a generalized information criterion to determine the support size of the regression coefficients, eliminating the model selection tuning. Moreover, our method does not assume an error distribution or a specific link function and hence is flexible to apply. Extensive simulation results demonstrate that our method is not only computationally efficient but also able to exactly recover the best subset in various settings (e.g., linear regression, Poisson regression, heteroscedastic models).
摘要
Translated into Simplified Chinese:高维数据分析引发了对单指数模型(SIMs)和最佳子集选择的更多的关注。SIMs提供了可解释和灵活的模型框架,而最佳子集选择则目标是从大量预测变量中找到一个稀疏的模型。然而,高维模型中的最佳子集选择是计算 tractable。现有方法通常会放弃选择,而不是获得最佳子集解决方案。在这篇论文中,我们直接面临到了 tractability 的问题,并提出了高维 SIMs 中首个可扩展性的最佳子集选择算法。我们的算法解决了模型选择决策,并且具有可算法性和高概率 oracle 性。我们的方法不仅不需要错误分布或特定的链函数,而且可以适应应用。我们的实验结果表明,我们的方法不仅是计算高效的,而且能够在不同的设置(如线性回归、波利回归、不同的风险函数)中准确回归最佳子集。
Long-term drought prediction using deep neural networks based on geospatial weather data
results: 比基线模型更高的ROC AUC scores,表明Convolutional LSTM和transformer模型具有更高的预测精度。Abstract
The accurate prediction of drought probability in specific regions is crucial for informed decision-making in agricultural practices. It is important to make predictions one year in advance, particularly for long-term decisions. However, forecasting this probability presents challenges due to the complex interplay of various factors within the region of interest and neighboring areas. In this study, we propose an end-to-end solution to address this issue based on various spatiotemporal neural networks. The models considered focus on predicting the drought intensity based on the Palmer Drought Severity Index (PDSI) for subregions of interest, leveraging intrinsic factors and insights from climate models to enhance drought predictions. Comparative evaluations demonstrate the superior accuracy of Convolutional LSTM (ConvLSTM) and transformer models compared to baseline gradient boosting and logistic regression solutions. The two former models achieved impressive ROC AUC scores from 0.90 to 0.70 for forecast horizons from one to six months, outperforming baseline models. The transformer showed superiority for shorter horizons, while ConvLSTM did so for longer horizons. Thus, we recommend selecting the models accordingly for long-term drought forecasting. To ensure the broad applicability of the considered models, we conduct extensive validation across regions worldwide, considering different environmental conditions. We also run several ablation and sensitivity studies to challenge our findings and provide additional information on how to solve the problem.
摘要
“精准预测特定地区的旱情机会是农业实践中重要的决策依据。特别是在长期决策中,一年前的预测非常重要。然而,预测这些机会受到当地和邻近地区的复杂因素之间的互动所困扰。在这个研究中,我们提出了一个终端解决方案,基于不同的时空神经网络。我们考虑的模型集中心于预测旱情强度,基于Palmer旱情严重指数(PDSI)的子区域,利用自然因素和气候模型的内在知识来增强旱情预测。比较评估显示,Convolutional LSTM(ConvLSTM)和transformer模型在基于梯度提升和逻辑回传模型的比较下表现出更高的ROC AUC分数,分别在1至6个月的预测时间范围内。transformer模型在短期预测中表现出色,而ConvLSTM模型在长期预测中表现出色。因此,我们建议在长期旱情预测中选择这两种模型。为确保考虑的模型在不同环境下具有广泛的应用性,我们进行了广泛的验证,考虑了不同的环境条件。我们还进行了多个ablation和敏感性研究,以提供额外的信息和解决方案。”
Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth Soft-Thresholding
paper_authors: Shaik Basheeruddin Shah, Pradyumna Pradhan, Wei Pu, Ramunaidu Randhi, Miguel R. D. Rodrigues, Yonina C. Eldar
for: 本研究探讨了线性逆问题的解决方法,具体来说是使用iterative soft-thresholding algorithm (LISTA)和alternating direction method of multipliers compressive sensing network (ADMM-CSNet)来有效地Addressing these problems.
results: 本研究表明,在over-parameterized(OP) regime下,通过利用一种修改后的Polyak-Lojasiewicz(PL*)条件,可以确保 Training loss减少到 Near-zero 的情况,并且存在global minimum和抽象减少从初始化点使用梯度下降方法。此外,我们还证明了,随着网络宽度的增加, unfolded networks 的阈值会增加,而 FFNN 的阈值则会减少。Abstract
Solving linear inverse problems plays a crucial role in numerous applications. Algorithm unfolding based, model-aware data-driven approaches have gained significant attention for effectively addressing these problems. Learned iterative soft-thresholding algorithm (LISTA) and alternating direction method of multipliers compressive sensing network (ADMM-CSNet) are two widely used such approaches, based on ISTA and ADMM algorithms, respectively. In this work, we study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs, for finite-layer unfolded networks such as LISTA and ADMM-CSNet with smooth soft-thresholding in an over-parameterized (OP) regime. We achieve this by leveraging a modified version of the Polyak-Lojasiewicz, denoted PL$^*$, condition. Satisfying the PL$^*$ condition within a specific region of the loss landscape ensures the existence of a global minimum and exponential convergence from initialization using gradient descent based methods. Hence, we provide conditions, in terms of the network width and the number of training samples, on these unfolded networks for the PL$^*$ condition to hold. We achieve this by deriving the Hessian spectral norm of these networks. Additionally, we show that the threshold on the number of training samples increases with the increase in the network width. Furthermore, we compare the threshold on training samples of unfolded networks with that of a standard fully-connected feed-forward network (FFNN) with smooth soft-thresholding non-linearity. We prove that unfolded networks have a higher threshold value than FFNN. Consequently, one can expect a better expected error for unfolded networks than FFNN.
摘要
解决线性逆问题在许多应用中发挥关键作用。基于数据驱动的算法折叠方法在这些问题上获得了广泛的关注。例如,学习迭代软阈值算法(LISTA)和多向量方法混合压缩感知网络(ADMM-CSNet)是两种广泛使用的方法,基于ISTA和ADMM算法 соответственно。在这种情况下,我们研究了优化保证,即随着学习epoch数量的增加,训练损失逼近零。我们通过利用修改后的波利亚克-洛亚西埃茨(PL$^*$)条件来实现这一点。在特定的损失图像中满足PL$^*$条件可以保证存在全局最小值,并使用梯度下降方法进行快速收敛。因此,我们提供了基于网络宽度和训练样本数量的条件,以确保PL$^*$条件在 unfolded networks 中成立。我们通过计算这些网络的梯度特征值来实现这一点。此外,我们还证明了随着网络宽度的增加,训练样本数量的阈值也会增加。此外,我们比较了 unfolded networks 和标准的全连接径Feedforward Network(FFNN)的训练样本数量的阈值,并证明了 unfolded networks 的阈值高于 FFNN。因此,我们可以预期 unfolded networks 的预期错误更低。
Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments
paper_authors: Philippe Gonzalez, Tommy Sonne Alstrøm, Tobias May for: 这种研究旨在解决学习型声音提高系统在不同条件下的一致性问题。methods: 研究使用了一个参照模型,该模型在测试条件下进行训练,以便用于评估测试条件的难度。results: 研究发现,所有模型在语音匹配情况下表现最差,而好的噪声和房间泛化可以通过训练多个数据库来实现。此外,最新的模型在匹配情况下表现最好,但在不匹配情况下表现很差,甚至可能 inferior to FFNN-based system。Abstract
The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.
摘要
难以分辨的声音混响样本中的声音特征是多种因素的影响,包括目标说话人的spectro-temporal特征、干扰噪声和room特性。这种大量的变化对学习型声音提高系统来说是一个主要挑战,因为在测试和训练条件不同时,系统的性能可能会下降substantially。通常来衡量系统的普适性是通过在测试数据集中测试系统,并对其进行cross-validation验证。然而,任务难度可能会在不同的数据集中发生变化,这会对结果产生很大的影响。本研究提出了一种普适性评估框架,通过使用测试条件下的参考模型,以便用其作为测试条件的困难度的代理。这 позволяет分解把握新数据的效果与把握任务难度的效果分开,并定义一个新的普适性度量,称为普适差(generalization gap)。该框架在多个语音、噪声和BRIR数据集中重复应用,以准确估计普适性差。研究发现,对所有模型来说,性能最大程度下降是在语音匹配中,而好的噪声和room普适性可以通过训练多个数据集来实现。此外,最新的模型在匹配条件下表现出色,但在匹配不符条件下表现很差,可能变成较为老的FFNN基于系统的性能。
Efficient Memory Management for Large Language Model Serving with PagedAttention
results: 我们的评估结果显示,vLLM可以提高受欢迎的LLM的吞吐量,比现状态之系统(如FasterTransformer和Orca)高出2-4倍,同时保持同样的响应时间。这种改进更加明显地出现在更长的序列、更大的模型和更复杂的解码算法中。vLLM的源代码可以在https://github.com/vllm-project/vllm上获取。Abstract
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm
摘要
高速服务大语言模型(LLM)需要批处理足够多的请求。然而,现有系统受到缓存(KV cache)的内存占用增加和减少的影响,导致批处理大小受限。当管理不善时,这些内存可能会受到浪费,因为碎片和重复备份,限制批处理大小。为解决这问题,我们提出了 PagedAttention,一种基于经典虚拟内存和分页技术的注意力算法。此外,我们建立了 vLLM,一个实现了(1)缓存内存几乎为零和(2)在请求之间和请求中 flexible 分享缓存的 LLM 服务系统。我们的评估显示,vLLM 可以在相同的延迟水平下提高流行的 LLM 的 Throughput 2-4 倍,比现状态的系统(如 FasterTransformer 和 Orca)更高效。这种改进更加明显,当遇到长序、大模型和复杂的解码算法时。vLLM 的源代码可以在 上获取。
Accelerating Edge AI with Morpher: An Integrated Design, Compilation and Simulation Framework for CGRAs
results: 该论文表明,Morpher框架可以自动将AI应用程序核心编译到用户定义的CGRA架构上,并验证其功能。这些结果表明CGRA在边缘计算中的应用潜力,以及Morpher框架的可靠性和灵活性。Abstract
Coarse-Grained Reconfigurable Arrays (CGRAs) hold great promise as power-efficient edge accelerator, offering versatility beyond AI applications. Morpher, an open-source, architecture-adaptive CGRA design framework, is specifically designed to explore the vast design space of CGRAs. The comprehensive ecosystem of Morpher includes a tailored compiler, simulator, accelerator synthesis, and validation framework. This study provides an overview of Morpher, highlighting its capabilities in automatically compiling AI application kernels onto user-defined CGRA architectures and verifying their functionality. Through the Morpher framework, the versatility of CGRAs is harnessed to facilitate efficient compilation and verification of edge AI applications, covering important kernels representative of a wide range of embedded AI workloads. Morpher is available online at https://github.com/ecolab-nus/morpher-v2.
摘要
便捷grained重构阵列(CGRA)具有很大的潜力,作为智能边缘加速器,它们的灵活性超出了人工智能应用。Morpher是一个开源的、建筑架构适应的CGRA设计框架,专门为探索CGRA的庞大设计空间而设计。Morpher的全面生态系统包括特制的编译器、模拟器、加速器合成和验证框架。本文提供了Morpher的概述,强调它在自动将人工智能应用程序核心编译到用户定义的CGRA架构上并验证其功能的能力。通过Morpher框架,CGRA的灵活性得以充分发挥,以便高效地编译和验证边缘AI应用程序,覆盖了各种嵌入式AI工作负荷中的重要核心。Morpher可在https://github.com/ecolab-nus/morpher-v2上下载。
A robust synthetic data generation framework for machine learning in High-Resolution Transmission Electron Microscopy (HRTEM)
results: 研究人员通过使用 Construction Zone 和 simulated databases,成功地实现了高精度的自动分析工具,并在多个实验室 benchmark 上达到了州际精度。Abstract
Machine learning techniques are attractive options for developing highly-accurate automated analysis tools for nanomaterials characterization, including high-resolution transmission electron microscopy (HRTEM). However, successfully implementing such machine learning tools can be difficult due to the challenges in procuring sufficiently large, high-quality training datasets from experiments. In this work, we introduce Construction Zone, a Python package for rapidly generating complex nanoscale atomic structures, and develop an end-to-end workflow for creating large simulated databases for training neural networks. Construction Zone enables fast, systematic sampling of realistic nanomaterial structures, and can be used as a random structure generator for simulated databases, which is important for generating large, diverse synthetic datasets. Using HRTEM imaging as an example, we train a series of neural networks on various subsets of our simulated databases to segment nanoparticles and holistically study the data curation process to understand how various aspects of the curated simulated data -- including simulation fidelity, the distribution of atomic structures, and the distribution of imaging conditions -- affect model performance across several experimental benchmarks. Using our results, we are able to achieve state-of-the-art segmentation performance on experimental HRTEM images of nanoparticles from several experimental benchmarks and, further, we discuss robust strategies for consistently achieving high performance with machine learning in experimental settings using purely synthetic data.
摘要
机器学习技术是为发展高精度自动分析工具的可能性非常高,特别是在高解度电子镜icroscopy (HRTEM) 领域。然而,实现这些机器学习工具可能会困难,因为实验获得足够大、高质量训练数据的挑战。在这项工作中,我们介绍了 Construction Zone,一个 Python 包,用于快速生成复杂的 nanoscale 原子结构,并开发了一个端到端的工作流程,用于创建大规模的 simulated 数据库。Construction Zone 允许快速、系统地采样真实的 nanomaterial 结构,可以用作随机结构生成器,这是生成大、多样化的 sintetic 数据库的重要工具。使用 HRTEM 成像为例,我们在不同的 sub-dataset 上训练了一系列神经网络,以分类 nanoparticles,并对数据准备过程进行了全面的研究,以了解不同的 simulated 数据特性如 simulation fidelity、原子结构分布和成像条件分布对模型性能的影响。使用我们的结果,我们可以在多个实验室benchmark上达到 estado-of-the-art 的分类性能,并讨论了在实验设置中使用纯 synthetic 数据获得高性能的机器学习策略。
Overview of Human Activity Recognition Using Sensor Data
paper_authors: Rebeen Ali Hamad, Wai Lok Woo, Bo Wei, Longzhi Yang
for: 这篇论文主要是为了探讨人活动识别(HAR)领域的最新进展和挑战。
methods: 本论文使用了多种感知器和深度学习技术来探讨人活动识别。
results: 论文提出了一些关键应用场景,如家居和办公室自动化、安全监测和医疗保健等,同时也探讨了深度学习技术在HAR领域的应用。Abstract
Human activity recognition (HAR) is an essential research field that has been used in different applications including home and workplace automation, security and surveillance as well as healthcare. Starting from conventional machine learning methods to the recently developing deep learning techniques and the Internet of things, significant contributions have been shown in the HAR area in the last decade. Even though several review and survey studies have been published, there is a lack of sensor-based HAR overview studies focusing on summarising the usage of wearable sensors and smart home sensors data as well as applications of HAR and deep learning techniques. Hence, we overview sensor-based HAR, discuss several important applications that rely on HAR, and highlight the most common machine learning methods that have been used for HAR. Finally, several challenges of HAR are explored that should be addressed to further improve the robustness of HAR.
摘要
人类活动识别(HAR)是一个重要的研究领域,在不同的应用中都有广泛的应用,包括家庭和工作场所自动化、安全监测以及医疗保健等。过去十年,在传统的机器学习方法基础上,深度学习技术的发展以及互联网物联网技术的应用,在HAR领域内部维护了重要的贡献。然而,当前的文献综述和评视研究却缺乏关于基于感知器的HAR的概要,包括穿戴式感知器和智能家居感知器数据的使用情况,以及基于HAR和深度学习技术的应用。因此,本文将对感知器基于HAR进行概要,讨论一些重要的依赖于HAR的应用,并高亮一些最常用的机器学习方法,最后探讨HAR面临的一些挑战,以便进一步改善HAR的稳定性。
A General Verification Framework for Dynamical and Control Models via Certificate Synthesis
results: 该论文通过开发一个软件工具,测试了其核心框架的可靠性和灵活性。 Results show that the proposed approach can effectively synthesize controllers and certificates for a wide range of system specifications, and provide a formal guarantee of correctness.Abstract
An emerging branch of control theory specialises in certificate learning, concerning the specification of a desired (possibly complex) system behaviour for an autonomous or control model, which is then analytically verified by means of a function-based proof. However, the synthesis of controllers abiding by these complex requirements is in general a non-trivial task and may elude the most expert control engineers. This results in a need for automatic techniques that are able to design controllers and to analyse a wide range of elaborate specifications. In this paper, we provide a general framework to encode system specifications and define corresponding certificates, and we present an automated approach to formally synthesise controllers and certificates. Our approach contributes to the broad field of safe learning for control, exploiting the flexibility of neural networks to provide candidate control and certificate functions, whilst using SMT-solvers to offer a formal guarantee of correctness. We test our framework by developing a prototype software tool, and assess its efficacy at verification via control and certificate synthesis over a large and varied suite of benchmarks.
摘要
一种新般的控制理论分支是证书学习,关注自动或控制模型的行为规范,并通过函数基本证明其分析。然而,实现这些复杂要求的控制器设计通常是一个非rivial任务,可能会让控制工程师感到惑乱。这导致了一种自动化的技术需求,能够设计控制器并分析各种复杂规范。在这篇论文中,我们提供一个通用框架来编码系统规范和相应的证书,并提出一种自动化的控制器和证书synthesis方法。我们的方法在安全学习控制领域中发挥作用,利用神经网络提供候选控制和证书函数,而使用SMT-解决器提供正式的正确性保证。我们测试了我们的框架,开发了一个原型软件工具,并通过控制和证书验证 benchmarks 进行验证。
Information Flow in Graph Neural Networks: A Clinical Triage Use Case
for: This paper aims to investigate the effect of embedding information flow within Graph Neural Networks (GNNs) on the prediction of links in Knowledge Graphs (KGs), with a specific use case in clinical triage.
methods: The paper proposes a mathematical model that decouples the GNN connectivity from the connectivity of the graph data, and evaluates the performance of GNNs with different connectivity strategies.
results: The results show that incorporating domain knowledge into the GNN connectivity leads to better performance than using the same connectivity as the KG or allowing unconstrained embedding propagation. Additionally, the paper finds that negative edges play a crucial role in achieving good predictions, and that using too many GNN layers can degrade performance.Here’s the simplified Chinese text for the three information points:
results: 结果表明,基于域知识的GNN连接策略可以在预测链接方面获得更好的性能,而使用同KG连接策略或允许无约 embedding 传播的策略则不如其他策略。此外,论文还发现,负边在预测链接方面扮演着关键的角色,并且使用过多GNN层可能会降低性能。Abstract
Graph Neural Networks (GNNs) have gained popularity in healthcare and other domains due to their ability to process multi-modal and multi-relational graphs. However, efficient training of GNNs remains challenging, with several open research questions. In this paper, we investigate how the flow of embedding information within GNNs affects the prediction of links in Knowledge Graphs (KGs). Specifically, we propose a mathematical model that decouples the GNN connectivity from the connectivity of the graph data and evaluate the performance of GNNs in a clinical triage use case. Our results demonstrate that incorporating domain knowledge into the GNN connectivity leads to better performance than using the same connectivity as the KG or allowing unconstrained embedding propagation. Moreover, we show that negative edges play a crucial role in achieving good predictions, and that using too many GNN layers can degrade performance.
摘要
graph neural networks (GNNs) 在医疗和其他领域中得到广泛应用,这是因为它们可以处理多modal和多关系图。然而,efficiently training GNNs 仍然是一个开放的研究问题,有几个未解决的问题。在这篇论文中,我们研究了在知识图(KGs)中预测链接的情况下,GNNs 中嵌入信息的流动如何影响预测性能。我们提出了一个数学模型,该模型将GNN 连接分离于图数据的连接,并评估了在临床排序用例中GNNs 的性能。我们的结果表明,在GNN 连接中 incorporate 域知识可以提高预测性能,并且使用相同的KG连接或不受限制的嵌入传播也可以提高性能。此外,我们发现,使用负边可以获得好的预测结果,并且使用太多GNN层可以降低性能。
Verifiable Fairness: Privacy-preserving Computation of Fairness for Machine Learning Systems
paper_authors: Ehsan Toreini, Maryam Mehrnezhad, Aad van Moorsel
for: 这篇论文目的是提出一种安全、可靠、隐私保护的 Fairness as a Service (FaaS) 协议,用于计算和验证机器学习 (ML) 模型的公平性。
methods: 该协议使用密文和零知识证明来保证数据和结果的隐私和有效性。它是模型无关的,可以用来审核任何 ML 模型的公平性。
results: 我们实现了 FaaS,并对一个公共数据集进行了实验,证明了它的性能和可行性。Abstract
Fair machine learning is a thriving and vibrant research topic. In this paper, we propose Fairness as a Service (FaaS), a secure, verifiable and privacy-preserving protocol to computes and verify the fairness of any machine learning (ML) model. In the deisgn of FaaS, the data and outcomes are represented through cryptograms to ensure privacy. Also, zero knowledge proofs guarantee the well-formedness of the cryptograms and underlying data. FaaS is model--agnostic and can support various fairness metrics; hence, it can be used as a service to audit the fairness of any ML model. Our solution requires no trusted third party or private channels for the computation of the fairness metric. The security guarantees and commitments are implemented in a way that every step is securely transparent and verifiable from the start to the end of the process. The cryptograms of all input data are publicly available for everyone, e.g., auditors, social activists and experts, to verify the correctness of the process. We implemented FaaS to investigate performance and demonstrate the successful use of FaaS for a publicly available data set with thousands of entries.
摘要
《公平机器学习》是一个繁殖的研究领域,在这篇论文中,我们提出了《公平服务(FaaS)》,一种安全、可靠、隐私保护的协议,用于计算和验证任何机器学习(ML)模型的公平性。在FaaS的设计中,数据和结果都是通过密文来保护隐私。此外,零知识证明 garantice了密文和下面数据的正确性。FaaS是模型无关的,可以支持多种公平度量,因此可以作为对任何ML模型的公平性进行审核的服务。我们的解决方案不需要任何不信任第三方或私人通道来计算公平度量。安全保证和承诺都是在安全可靠的方式实现的,从计算开始到结束,每一步都是安全透明的。所有输入数据的密文都是公开可见的,例如,审计人、社会活动人士和专家可以随时查看和验证过程的正确性。我们实现了FaaS,以 investigate性能和 demonstarte其在公共数据集上的成功应用。
Frequency Convergence of Complexon Shift Operators (Extended Version)
For: 这个论文研究了高阶结构( simplicial complex)模型的转移性,并使用了一种扩展的高阶 graphon( complexon)来模型这些结构。* Methods: 作者使用了一种基于 integral operator 的 complexon shift operator(CSO)来研究复杂子复杂体系的特性。他们还研究了 CSO 的 eigenvalues 和 eigenvectors,并与一种新的质量权重Matrix 之间的关系。* Results: 作者证明了,当 simplicial complex sequence converges to a complexon,then the eigenvalues of the corresponding CSOs converge to that of the limit complexon。这个结论通过一个数值实验得到了证明。这些结果提出了在大 simplicial complex 或 simplicial complex sequence 上进行学习转移性的可能性,这种框架将 graphon signal processing 扩展到更高阶结构上。Abstract
Topological Signal Processing (TSP) utilizes simplicial complexes to model structures with higher order than vertices and edges. In this paper, we study the transferability of TSP via a generalized higher-order version of graphon, known as complexon. We recall the notion of a complexon as the limit of a simplicial complex sequence. Inspired by the integral operator form of graphon shift operators, we construct a marginal complexon and complexon shift operator (CSO) according to components of all possible dimensions from the complexon. We investigate the CSO's eigenvalues and eigenvectors, and relate them to a new family of weighted adjacency matrices. We prove that when a simplicial complex sequence converges to a complexon, the eigenvalues of the corresponding CSOs converge to that of the limit complexon. This conclusion is further verified by a numerical experiment. These results hint at learning transferability on large simplicial complexes or simplicial complex sequences, which generalize the graphon signal processing framework.
摘要
《拓扑信号处理(TSP)利用 simplicial 复lexes 来模型高阶结构。在本文中,我们研究了 TSP 的传送性,使用一种普遍化高阶 graphon 的概念—— complexon。我们提及 complexon 的定义为 simplicial 复lex 序列的极限。受 graphon Shift 算子的积分Operator 启发,我们构建了 marginal complexon 和 complexon shift operator(CSO),根据所有可能的维度组成部分。我们研究了 CSO 的 eigenvalues 和 eigenvectors,并与它们相关的一个新的加权邻接矩阵。我们证明,当 simplicial 复lex 序列 converge 到 complexon 时,相应的 CSO 的 eigenvalues 会 converge 到 complexon 的限制。这一结论通过数值实验得到了证明。这些结果表明在大 simplicial 复lex 或 simplicial 复lex 序列上进行学习传送性是可能的,这种框架将 graphon 信号处理扩展到更高的维度。》Note: Simplified Chinese is a written language that uses simpler characters and grammar than Traditional Chinese. It is commonly used in mainland China and is the official language of the People's Republic of China.
A Perceptron-based Fine Approximation Technique for Linear Separation
results: 实验结果表明,该方法可以更高效地than Perceptron算法,特别是当数据集大于数据维度时。Abstract
This paper presents a novel online learning method that aims at finding a separator hyperplane between data points labelled as either positive or negative. Since weights and biases of artificial neurons can directly be related to hyperplanes in high-dimensional spaces, the technique is applicable to train perceptron-based binary classifiers in machine learning. In case of large or imbalanced data sets, use of analytical or gradient-based solutions can become prohibitive and impractical, where heuristics and approximation techniques are still applicable. The proposed method is based on the Perceptron algorithm, however, it tunes neuron weights in just the necessary extent during searching the separator hyperplane. Due to an appropriate transformation of the initial data set we need not to consider data labels, neither the bias term. respectively, reducing separability to a one-class classification problem. The presented method has proven converge; empirical results show that it can be more efficient than the Perceptron algorithm, especially, when the size of the data set exceeds data dimensionality.
摘要
The proposed method is based on the Perceptron algorithm, but it only tunes the neuron weights to the necessary extent during the search for the separator hyperplane. This is achieved through an appropriate transformation of the initial data set, which eliminates the need to consider data labels or the bias term. As a result, the method reduces the separability problem to a one-class classification problem.Empirical results show that the proposed method is more efficient than the Perceptron algorithm, especially when the size of the data set exceeds the dimensionality of the data. The method has been proven to converge, and the results demonstrate its effectiveness in finding the optimal separator hyperplane.
Normality Learning-based Graph Anomaly Detection via Multi-Scale Contrastive Learning
results: 实验结果显示,提出的方法可以提高GAD的检测性能(最高提升5.89%的AUC),相比之前的方法。code可以在https://github.com/FelixDJC/NLGAD中下载。Abstract
Graph anomaly detection (GAD) has attracted increasing attention in machine learning and data mining. Recent works have mainly focused on how to capture richer information to improve the quality of node embeddings for GAD. Despite their significant advances in detection performance, there is still a relative dearth of research on the properties of the task. GAD aims to discern the anomalies that deviate from most nodes. However, the model is prone to learn the pattern of normal samples which make up the majority of samples. Meanwhile, anomalies can be easily detected when their behaviors differ from normality. Therefore, the performance can be further improved by enhancing the ability to learn the normal pattern. To this end, we propose a normality learning-based GAD framework via multi-scale contrastive learning networks (NLGAD for abbreviation). Specifically, we first initialize the model with the contrastive networks on different scales. To provide sufficient and reliable normal nodes for normality learning, we design an effective hybrid strategy for normality selection. Finally, the model is refined with the only input of reliable normal nodes and learns a more accurate estimate of normality so that anomalous nodes can be more easily distinguished. Eventually, extensive experiments on six benchmark graph datasets demonstrate the effectiveness of our normality learning-based scheme on GAD. Notably, the proposed algorithm improves the detection performance (up to 5.89% AUC gain) compared with the state-of-the-art methods. The source code is released at https://github.com/FelixDJC/NLGAD.
摘要
《图像异常检测(GAD)在机器学习和数据挖掘领域受到了越来越多的关注。 latest works mainly focus on how to capture richer information to improve the quality of node embeddings for GAD. Despite their significant advances in detection performance, there is still a relative dearth of research on the properties of the task. GAD aims to discern the anomalies that deviate from most nodes. However, the model is prone to learn the pattern of normal samples which make up the majority of samples. Meanwhile, anomalies can be easily detected when their behaviors differ from normality. Therefore, the performance can be further improved by enhancing the ability to learn the normal pattern. To this end, we propose a normality learning-based GAD framework via multi-scale contrastive learning networks (NLGAD for abbreviation). Specifically, we first initialize the model with the contrastive networks on different scales. To provide sufficient and reliable normal nodes for normality learning, we design an effective hybrid strategy for normality selection. Finally, the model is refined with the only input of reliable normal nodes and learns a more accurate estimate of normality so that anomalous nodes can be more easily distinguished. Eventually, extensive experiments on six benchmark graph datasets demonstrate the effectiveness of our normality learning-based scheme on GAD. Notably, the proposed algorithm improves the detection performance (up to 5.89% AUC gain) compared with the state-of-the-art methods. The source code is released at https://github.com/FelixDJC/NLGAD.》Note: "GAD" in the text refers to "Graph Anomaly Detection".
Energy-Aware Federated Learning with Distributed User Sampling and Multichannel ALOHA
results: numerical results表明这种方法在某些重要的设置下具有更好的优化性和电池水平,并且比一种 нор based 解决方案更快地训练。Abstract
Distributed learning on edge devices has attracted increased attention with the advent of federated learning (FL). Notably, edge devices often have limited battery and heterogeneous energy availability, while multiple rounds are required in FL for convergence, intensifying the need for energy efficiency. Energy depletion may hinder the training process and the efficient utilization of the trained model. To solve these problems, this letter considers the integration of energy harvesting (EH) devices into a FL network with multi-channel ALOHA, while proposing a method to ensure both low energy outage probability and successful execution of future tasks. Numerical results demonstrate the effectiveness of this method, particularly in critical setups where the average energy income fails to cover the iteration cost. The method outperforms a norm based solution in terms of convergence time and battery level.
摘要
随着联合学习(FL)的出现,分布式学习在边缘设备上已经吸引了更多的注意力。然而,边缘设备通常具有有限的电池和多样化的能源供应,而多轮FL的需求会加剧能效率问题。如果不得要遇到能源枯竭,它会阻碍训练过程和模型的有效使用。为解决这些问题,本文考虑了在FL网络中 интеGRATE了能量收集(EH)设备,并提出了一种方法来保证低能源停机概率和未来任务的成功执行。numerical results表明该方法的效果,特别在 average energy income 不足以覆盖迭代成本的情况下。该方法在 convergence time 和电池水平方面也超越了 norm 基于解决方案。
Emergent Communication in Multi-Agent Reinforcement Learning for Future Wireless Networks
results: 本文介绍了EC-MARL在Future 6G无线网络中的应用潜在性和研究机遇,包括自动驾驶、机器人导航、飞行基站网络规划和智能城市应用。Abstract
In different wireless network scenarios, multiple network entities need to cooperate in order to achieve a common task with minimum delay and energy consumption. Future wireless networks mandate exchanging high dimensional data in dynamic and uncertain environments, therefore implementing communication control tasks becomes challenging and highly complex. Multi-agent reinforcement learning with emergent communication (EC-MARL) is a promising solution to address high dimensional continuous control problems with partially observable states in a cooperative fashion where agents build an emergent communication protocol to solve complex tasks. This paper articulates the importance of EC-MARL within the context of future 6G wireless networks, which imbues autonomous decision-making capabilities into network entities to solve complex tasks such as autonomous driving, robot navigation, flying base stations network planning, and smart city applications. An overview of EC-MARL algorithms and their design criteria are provided while presenting use cases and research opportunities on this emerging topic.
摘要
不同无线网络enario中,多个网络元件需要合作以实现最小的延迟和能源消耗,实现高维度资料交换。未来的无线网络将实施高维度连续控制问题,因此通信控制任务将变得更加困难和复杂。多智能推劝学习(EC-MARL)是一种可能的解决方案,它可以在不可见的状态下,通过协调多个智能推劝学习代理人,解决复杂的控制问题。本文说明了EC-MARL在未来6G无线网络中的重要性,具体来说,它将具有自主决策能力,实现无人驾驶、机器人导航、飞行基站网络规划和智慧城市应用等复杂任务。文中还提供了EC-MARL算法和设计需求,以及实际应用和研究机会。
Interpolation, Approximation and Controllability of Deep Neural Networks
for: investigate the expressive power of deep residual neural networks idealized as continuous dynamical systems through control theory.
methods: consider two properties from supervised learning: universal interpolation and universal approximation, and give a characterization of universal interpolation.
results: show that universal interpolation holds for essentially any architecture with non-linearity, and elucidate the relationship between universal interpolation and universal approximation in the context of general control systems.Abstract
We investigate the expressive power of deep residual neural networks idealized as continuous dynamical systems through control theory. Specifically, we consider two properties that arise from supervised learning, namely universal interpolation - the ability to match arbitrary input and target training samples - and the closely related notion of universal approximation - the ability to approximate input-target functional relationships via flow maps. Under the assumption of affine invariance of the control family, we give a characterisation of universal interpolation, showing that it holds for essentially any architecture with non-linearity. Furthermore, we elucidate the relationship between universal interpolation and universal approximation in the context of general control systems, showing that the two properties cannot be deduced from each other. At the same time, we identify conditions on the control family and the target function that ensures the equivalence of the two notions.
摘要
我们研究深度径 residual神经网络作为连续动力系统的表达力,通过控制理论进行调查。具体来说,我们考虑了两个从supervised learning中得到的性质, namely universal interpolation和 closely related universal approximation。我们假设控制家族具有平移变换的可变性,然后给出universal interpolation的特征化,证明它对于任意架构都成立。此外,我们还解释了universal interpolation和 universal approximation在通用控制系统中的关系,并证明这两个性质之间无法从一个到另一个推导。同时,我们还提出了控制家族和目标函数的条件,以确保这两个概念之间的等价性。
Learning Unbiased News Article Representations: A Knowledge-Infused Approach
paper_authors: Sadia Kamal, Jimmy Hartford, Jeremy Willis, Arunkumar Bagavathi
for: This paper aims to quantify the political leaning of online news articles and mitigate the algorithmic political bias in machine learning models used for this task.
methods: The proposed knowledge-infused deep learning model uses relatively reliable external data resources to learn unbiased representations of news articles based on their global and local contexts.
results: The proposed model outperforms baseline methods to predict the political leaning of news articles with up to 73% accuracy, mitigating algorithmic political bias.Here’s the Chinese translation of the three pieces of information:
for: 这篇论文目的是量化在线新闻文章的政治倾向,并使用机器学习模型来mitigate这种algorithmic political bias。
results: 提议的模型在测试集中的准确率达到73%,比基eline方法高。Abstract
Quantification of the political leaning of online news articles can aid in understanding the dynamics of political ideology in social groups and measures to mitigating them. However, predicting the accurate political leaning of a news article with machine learning models is a challenging task. This is due to (i) the political ideology of a news article is defined by several factors, and (ii) the innate nature of existing learning models to be biased with the political bias of the news publisher during the model training. There is only a limited number of methods to study the political leaning of news articles which also do not consider the algorithmic political bias which lowers the generalization of machine learning models to predict the political leaning of news articles published by any new news publishers. In this work, we propose a knowledge-infused deep learning model that utilizes relatively reliable external data resources to learn unbiased representations of news articles using their global and local contexts. We evaluate the proposed model by setting the data in such a way that news domains or news publishers in the test set are completely unseen during the training phase. With this setup we show that the proposed model mitigates algorithmic political bias and outperforms baseline methods to predict the political leaning of news articles with up to 73% accuracy.
摘要
政治倾向量化在在线新闻文章中可以帮助我们理解社会团体中政治 идеологи的动态和 mitigate其中的问题。然而,使用机器学习模型预测新闻文章的政治倾向是一项具有挑战性的任务。这是因为(i)政治 идеологи的定义是由多个因素组成,(ii)现有的学习模型具有新闻发布商的政治偏见,从而降低了机器学习模型预测新闻文章的政治倾向的泛化性。目前只有有限的方法可以研究新闻文章的政治倾向,而且这些方法不考虑算法政治偏见。在这项工作中,我们提出一种知识感知深度学习模型,该模型使用可靠的外部数据资源来学习不受偏见的新闻文章表示。我们通过将训练集中的新闻域或新闻发布商完全不见于测试集来评估该模型。 results show that our proposed model can mitigate algorithmic political bias and outperform baseline methods in predicting the political leaning of news articles with up to 73% accuracy.
CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram
results: 对于多种 objective 和 subjective 评估指标,CleanUNet 2 的性能都高于先前的方法,并且可以在不同的 speech 质量和频率范围下提供高质量的 denoising 结果。Abstract
In this work, we present CleanUNet 2, a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds. CleanUNet 2 uses a two-stage framework inspired by popular speech synthesis methods that consist of a waveform model and a spectrogram model. Specifically, CleanUNet 2 builds upon CleanUNet, the state-of-the-art waveform denoiser, and further boosts its performance by taking predicted spectrograms from a spectrogram denoiser as the input. We demonstrate that CleanUNet 2 outperforms previous methods in terms of various objective and subjective evaluations.
摘要
在这项工作中,我们介绍CleanUNet 2,一种混合波形去噪器和spectrogram去噪器的语音去噪模型,实现了两者的优点。CleanUNet 2采用了两个阶段框架,受到流行的语音合成方法的启发,包括波形模型和spectrogram模型。具体来说,CleanUNet 2基于CleanUNet,当前的波形去噪器顶峰性能,再加以使用预测的spectrogram去噪器输入,进一步提高其性能。我们展示了CleanUNet 2在多个对象和主观评价中的优越性。
for: 这项研究证明了神经网络(NN)编码定理的对偶,即每个稳定地聚合的NN中的权重矩阵实际上编码了一个连续函数,该函数在训练集中的bounded域内准确地approximates the training dataset.
methods: 该研究使用了Eckart-Young定理和减少特征值分解来描述NN层的weight矩阵,从而理解 latent space manifold 的结构和NN层中的数学运算的几何特性。
results: 研究发现,NN可以通过储存容量来破坏维度约束,并且这两者是补偿的。此外,层矩阵分解(LMD)还发现了神经网络层的归一化矩阵和Hopfield网络和Transformer NN模型的最新发展之间的密切关系。Abstract
We prove the converse of the universal approximation theorem, i.e. a neural network (NN) encoding theorem which shows that for every stably converged NN of continuous activation functions, its weight matrix actually encodes a continuous function that approximates its training dataset to within a finite margin of error over a bounded domain. We further show that using the Eckart-Young theorem for truncated singular value decomposition of the weight matrix for every NN layer, we can illuminate the nature of the latent space manifold of the training dataset encoded and represented by every NN layer, and the geometric nature of the mathematical operations performed by each NN layer. Our results have implications for understanding how NNs break the curse of dimensionality by harnessing memory capacity for expressivity, and that the two are complementary. This Layer Matrix Decomposition (LMD) further suggests a close relationship between eigen-decomposition of NN layers and the latest advances in conceptualizations of Hopfield networks and Transformer NN models.
摘要
我们证明了射频学 theorem 的对偶,即一个神经网络(NN)编码 theorem,表明每个稳定地收敛的 NN 的权重矩阵实际上编码了一个连续函数,该函数在培训数据集中的 bounded edomain 内对数据集进行approximation,并且有finite 的误差。我们还证明了,通过 truncated singular value decomposition(Eckart-Young 定理)的 weight matrix 对每个 NN 层,可以照明 latent space manifold 被NN层所编码和表示的训练数据集的结构和 mathe matical 操作的几何特性。我们的结论有关于如何NN突破维度约束,通过吸收记忆来提高表达力,以及这两者之间的关系。此外,我们的层矩阵分解(LMD)还 suggets一种close relationship между NN层的eigen-decomposition和最新的 Hopfield 网络和Transformer NN 模型的概念化。
GLAD: Content-aware Dynamic Graphs For Log Anomaly Detection
results: 在三个数据集上测试 GLAD,结果表明 GLAD 能够有效地检测异常,异常的关系模式也能够被识别出来。Abstract
Logs play a crucial role in system monitoring and debugging by recording valuable system information, including events and states. Although various methods have been proposed to detect anomalies in log sequences, they often overlook the significance of considering relations among system components, such as services and users, which can be identified from log contents. Understanding these relations is vital for detecting anomalies and their underlying causes. To address this issue, we introduce GLAD, a Graph-based Log Anomaly Detection framework designed to detect relational anomalies in system logs. GLAD incorporates log semantics, relational patterns, and sequential patterns into a unified framework for anomaly detection. Specifically, GLAD first introduces a field extraction module that utilizes prompt-based few-shot learning to identify essential fields from log contents. Then GLAD constructs dynamic log graphs for sliding windows by interconnecting extracted fields and log events parsed from the log parser. These graphs represent events and fields as nodes and their relations as edges. Subsequently, GLAD utilizes a temporal-attentive graph edge anomaly detection model for identifying anomalous relations in these dynamic log graphs. This model employs a Graph Neural Network (GNN)-based encoder enhanced with transformers to capture content, structural and temporal features. We evaluate our proposed method on three datasets, and the results demonstrate the effectiveness of GLAD in detecting anomalies indicated by varying relational patterns.
摘要
系统监控和调试中,日志扮演着关键角色,记录了系统中的重要信息,包括事件和状态。虽然有很多方法用于检测日志序列中的异常,但它们通常忽略了考虑系统组件之间的关系,例如服务和用户,这些关系可以从日志内容中得到。了解这些关系非常重要,可以帮助检测异常和其下面的原因。为解决这个问题,我们提出了 GLAD,一个基于图的日志异常检测框架,可以检测系统日志中的关系异常。GLAD 将日志Semantics、关系模式和时序模式集成到一个统一的异常检测框架中。具体来说,GLAD 首先引入一个字段提取模块,使用提示based few-shot learning来确定日志内容中的重要字段。然后,GLAD 构建了动态日志图,将提取的字段和日志事件与日志解析器解析的日志事件相连接。这些图表示事件和字段作为节点,以及它们之间的关系作为边。接着,GLAD 利用一个 temporal-attentive 图边异常检测模型来检测动态日志图中的异常关系。这个模型使用图神经网络(GNN)基本encoder和转换器来捕捉内容、结构和时序特征。我们对 GLAD 进行了三个数据集的测试,结果表明 GLAD 能够根据不同的关系模式检测异常。
Quantized Non-Volatile Nanomagnetic Synapse based Autoencoder for Efficient Unsupervised Network Anomaly Detection
results: 该方法在 NSL-KDD 数据集上进行异常检测,并证明了对异常检测的改进(90.98%),并且在训练过程中减少了至少三个数量级的weight更新,从而实现了显著的能源节省。Abstract
In the autoencoder based anomaly detection paradigm, implementing the autoencoder in edge devices capable of learning in real-time is exceedingly challenging due to limited hardware, energy, and computational resources. We show that these limitations can be addressed by designing an autoencoder with low-resolution non-volatile memory-based synapses and employing an effective quantized neural network learning algorithm. We propose a ferromagnetic racetrack with engineered notches hosting a magnetic domain wall (DW) as the autoencoder synapses, where limited state (5-state) synaptic weights are manipulated by spin orbit torque (SOT) current pulses. The performance of anomaly detection of the proposed autoencoder model is evaluated on the NSL-KDD dataset. Limited resolution and DW device stochasticity aware training of the autoencoder is performed, which yields comparable anomaly detection performance to the autoencoder having floating-point precision weights. While the limited number of quantized states and the inherent stochastic nature of DW synaptic weights in nanoscale devices are known to negatively impact the performance, our hardware-aware training algorithm is shown to leverage these imperfect device characteristics to generate an improvement in anomaly detection accuracy (90.98%) compared to accuracy obtained with floating-point trained weights. Furthermore, our DW-based approach demonstrates a remarkable reduction of at least three orders of magnitude in weight updates during training compared to the floating-point approach, implying substantial energy savings for our method. This work could stimulate the development of extremely energy efficient non-volatile multi-state synapse-based processors that can perform real-time training and inference on the edge with unsupervised data.
摘要
在基于自适应器的异常检测 paradigm中,在边缘设备中实现自适应器是极其挑战性的,主要是因为边缘设备的硬件、能源和计算资源有限。我们表明,这些限制可以通过设计一个具有低分辨率、不可变存储器 synapse的 autoencoder,并使用有效的量化神经网络学习算法来解决。我们提议一种磁気轨道上的磁Domain墙(DW)作为 autoencoder synapse,其中有限状态(5状) synaptic веса通过磁力辐射(SOT)电流脉冲来 manipulate。我们对提议的 autoencoder 模型在 NSL-KDD 数据集上进行异常检测性能的评估。我们采用了限制分辨率和 DW 设备不确定性的意识training autoencoder,而不是使用浮点数精度 weights。虽然有限数量的量化状态和 nanoscale 设备内的固有不确定性会对性能产生负面影响,但我们的硬件意识训练算法可以利用这些不完美设备特性来提高异常检测精度(90.98%)相比浮点数训练 weights。此外,我们的 DW 方法显示在训练期间对 weight updates 的减少是至少三个数量级,这意味着我们的方法可以获得显著的能源抑制。这种工作可能会促进非常能效的非易失multi-state synapse基于处理器的开发,以便在边缘上进行实时训练和推理,并使用不supervised数据。
paper_authors: Fanwen Wang, Pedro F. Ferreira, Yinzhe Wu, Andrew D. Scott, Camila Munoz, Ke Wen, Yaqing Luo, Jiahao Huang, Sonia Nielles-Vallespin, Dudley J. Pennell, Guang Yang
For: 提供非侵入式的心肺功能测量方法* Methods: 使用深度学习基于B-spline网络对DT-CMR图像进行抗干扰注射注射的抽象注射注射* Results: 提高了图像使用效率、手动剪辑和计算速度Abstract
Diffusion tensor based cardiac magnetic resonance (DT-CMR) is a method capable of providing non-invasive measurements of myocardial microstructure. Image registration is essential to correct image shifts due to intra and inter breath-hold motion. Registration is challenging in DT-CMR due to the low signal-to-noise and various contrasts induced by the diffusion encoding in the myocardial and surrounding organs. Traditional deformable registration destroys the texture information while rigid registration inefficiently discards frames with local deformation. In this study, we explored the possibility of deep learning-based deformable registration on DT- CMR. Based on the noise suppression using low-rank features and diffusion encoding suppression using variational auto encoder-decoder, a B-spline based registration network extracted the displacement fields and maintained the texture features of DT-CMR. In this way, our method improved the efficiency of frame utilization, manual cropping, and computational speed.
摘要
Diffusion tensor based cardiac magnetic resonance (DT-CMR) 是一种可以提供非侵入式的肌肉微结构测量方法。图像匹配是必要的,以正确补做图像偏移 Due to intra-和inter- breath-hold 运动。但是,在 DT-CMR 中,匹配是具有挑战性,因为Diffusion encoding在肌肉和周围的器官中induced的低信号至噪声和多种对比。传统的可变形注册会destroys the texture information,而rigid注册则不 efficiently discards frames with local deformation。在这种研究中,我们探索了深度学习基于的可变形注册方法在 DT-CMR 中。基于噪声抑制使用低级特征和Diffusion encoding抑制使用自适应变换器-解码器,一个基于 B-spline 的注册网络提取出了displacement fields 并保留了 DT-CMR 中的 texture features。这种方法可以提高帧使用效率、手动剪辑和计算速度。
results: 比州前方法提高1.2dB至1.8dB的PSNR值,可见地提高图像的分辨率Abstract
This paper introduces a novel method for RGB-Guided Resolution Enhancement of infrared (IR) images called Guided IR Resolution Enhancement (GIRRE). In the area of single image super resolution (SISR) there exists a wide variety of algorithms like interpolation methods or neural networks to improve the spatial resolution of images. In contrast to SISR, even more information can be gathered on the recorded scene when using multiple cameras. In our setup, we are dealing with multi image super resolution, especially with stereo super resolution. We consider a color camera and an IR camera. Current IR sensors have a very low resolution compared to color sensors so that recent color sensors take up 100 times more pixels than IR sensors. To this end, GIRRE increases the spatial resolution of the low-resolution IR image. After that, the upscaled image is filtered with the aid of the high-resolution color image. We show that our method achieves an average PSNR gain of 1.2 dB and at best up to 1.8 dB compared to state-of-the-art methods, which is visually noticeable.
摘要
Here is the Simplified Chinese translation of the text:这篇论文提出了一种基于RGB指导的红外图像分辨率提高方法,称为指导红外分辨率提高(GIRRE)。与传统单张图像超解(SISR)方法不同的是,GIRRE利用多个摄像头来提高低分辨率红外图像的空间分辨率。具体来说,我们使用了一个颜色摄像头和一个红外摄像头,高分辨率颜色图像用于指导低分辨率红外图像的扩展。我们的方法实现了平均PSNR提升1.2dB,最高可达1.8dB compared to state-of-the-art方法,可见程度有所提高。
results: 研究在模拟 urbana 环境中进行了实验,并结果表明,通过使用计算机视觉技术可以增强探测敏感度和场景认知。Abstract
The addition of contextual sensors to mobile radiation sensors provides valuable information about radiological source encounters that can assist in adjudication of alarms. This study explores how computer-vision based object detection and tracking analyses can be used to augment radiological data from a mobile detector system. We study how contextual information (streaming video and LiDAR) can be used to associate dynamic pedestrians or vehicles with radiological alarms to enhance both situational awareness and detection sensitivity. Possible source encounters were staged in a mock urban environment where participants included pedestrians and vehicles moving in the vicinity of an intersection. Data was collected with a vehicle equipped with 6 NaI(Tl) 2 inch times 4 inch times 16 inch detectors in a hexagonal arrangement and multiple cameras, LiDARs, and an IMU. Physics-based models that describe the expected count rates from tracked objects are used to correlate vehicle and/or pedestrian trajectories to measured count-rate data through the use of Poisson maximum likelihood estimation and to discern between source-carrying and non-source-carrying objects. In this work, we demonstrate the capabilities of our source-object attribution approach as applied to a mobile detection system in the presence of moving sources to improve both detection sensitivity and situational awareness in a mock urban environment.
摘要
通过添加上下文感知器到移动辐射检测器,可以获得辐射源遇到的有价值信息,以帮助解决警报。本研究探讨了如何使用计算机视觉基于对象检测和跟踪分析来增强移动检测系统中的辐射数据。我们研究了如何使用上下文信息(流动视频和LiDAR)将动态行人或车辆与辐射警报相关联,以提高 situational awareness 和检测敏感度。在模拟城市环境中,参与者包括行人和车辆在交叉口附近移动。数据收集使用装备了6个 NaI(Tl) 2英寸×4英寸×16英寸检测器的车辆,以及多个摄像头、LiDAR和IMU。我们使用物理学基于的模型来 correlate 跟踪物体的轨迹和测量的计数率数据,并使用波利奥最大 likelihood 估计来准确归类源包含和不包含源的物体。在这种情况下,我们展示了我们的源-物体归类方法在移动检测系统中的应用,以提高检测敏感度和 situational awareness 在模拟城市环境中。
Bayesian topology inference on partially known networks from input-output pairs
results: 通过对实际和Synthetic网络的数据进行数值实验, authors 示出了 integrating prior knowledge 可以提高估计性能。Abstract
We propose a sampling algorithm to perform system identification from a set of input-output graph signal pairs. The dynamics of the systems we study are given by a partially known adjacency matrix and a generic parametric graph filter of unknown parameters. The methodology we employ is built upon the principles of annealed Langevin diffusion. This enables us to draw samples from the posterior distribution instead of following the classical approach of point estimation using maximum likelihood. We investigate how to harness the prior information inherent in a dataset of graphs of different sizes through the utilization of graph neural networks. We demonstrate, via numerical experiments involving both real-world and synthetic networks, that integrating prior knowledge into the estimation process enhances estimation performance.
摘要
我们提出一种采样算法来进行系统识别从输入-输出图信号对的集合。我们研究的系统动力是由一个部分知道的邻接矩阵和一个通用参数图 filter 的未知参数给出的。我们使用渐进的兰格易托 diffusion 的原则来实现采样,这使得我们可以从 posterior 分布中采样而不是使用传统的点估计方法。我们研究如何在不同大小的图 dataset 中利用图神经网络来加持先验知识。我们通过实验表明,通过将先验知识 integrate 到估计过程中,可以提高估计性能。Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Design and Implementation of DC-to-5~MHz Wide-Bandwidth High-Power High-Fidelity Converter
results: 我们开发了一个轻量级嵌入式控制解决方案,包括改进的静止表示数字synthesizer和一种新的适应偏好束拟合 nearest-level 模ulation。这种解决方案可以有效解决高功率和高输出频率之间的冲突,并且可以在两个维度上扩展。我们的原型在DC至5MHz的频谱范围内,实现了 <18%的总电压误差,同时实现了 >5kW 的功率水平。我们进行了输出频率的扫描和两个混合试验,包括一个实际的脑神经科学应用的刺激脉冲和一个娱乐性的试验,重现了著名的Arecibo信号。Abstract
Advances in power electronics have made it possible to achieve high power levels, e.g., reaching GW in grids, or alternatively high output bandwidths, e.g., beyond MHz in communication. Achieving both simultaneously, however, remains challenging. Various applications, ranging from efficient multichannel wireless power transfer to cutting-edge medical and neuroscience applications, are demanding both high power and wide bandwidth. Conventional inverters can achieve high power and high quality at grid or specific frequency ranges but lose their fidelity when reaching higher output frequencies. Resonant circuits can promise a high output frequency but only a narrow bandwidth. We overcome the hardware challenges by combining gallium-nitride (GaN) transistors with modular cascaded double-H bridge circuits and control that can manage typical timing and balancing issues. We developed a lightweight embedded control solution that includes an improved look-up-table digital synthesizer and a novel adaptive-bias-elimination nearest-level modulation. This solution effectively solves the conflict between a high power level and high output bandwidth and can--in contrast to previous approaches--in principle be scaled in both dimensions. Our prototype exhibits a frequency range from DC to 5 MHz with <18% total voltage distortion across the entire frequency spectrum, while achieving a power level of >5 kW. We conducted tests by sweeping the output frequency and two channel-mixing trials, which included a practical magnetogenetics-oriented stimulation pulse and an entertaining trial to reproduce the famous Arecibo message with the current spectrum.
摘要
Conventional inverters can achieve high power and high quality at grid or specific frequency ranges, but lose their fidelity when reaching higher output frequencies. Resonant circuits can promise high output frequencies but only have a narrow bandwidth. To overcome these hardware challenges, we combined gallium-nitride (GaN) transistors with modular cascaded double-H bridge circuits and control that can manage typical timing and balancing issues.We developed a lightweight embedded control solution that includes an improved look-up-table digital synthesizer and a novel adaptive-bias-elimination nearest-level modulation. This solution effectively solves the conflict between high power level and high output bandwidth and can, in contrast to previous approaches, be scaled in both dimensions. Our prototype exhibits a frequency range from DC to 5 MHz with <18% total voltage distortion across the entire frequency spectrum, while achieving a power level of >5 kW.We conducted tests by sweeping the output frequency and two channel-mixing trials, including a practical magnetogenetics-oriented stimulation pulse and an entertaining trial to reproduce the famous Arecibo message with the current spectrum.
Opportunistic Reflection in Reconfigurable Intelligent Surface-Assisted Wireless Networks
results: OMUR方法可以提高网络吞吐量和功率耗用率,并且可以在多用户多RIS情况下实现高速下载和上传。此外,我们还提出了一种简化版OMUR方法,通过随机阶段偏移来避免RIS通道估计的复杂性。Abstract
This paper focuses on multiple-access protocol design in a wireless network assisted by multiple reconfigurable intelligent surfaces (RISs). By extending the existing approaches in single-user or single-RIS cases, we present two benchmark schemes for this multi-user multi-RIS scenario. Inspecting their shortcomings, a simple but efficient method coined opportunistic multi-user reflection (OMUR) is proposed. The key idea is to opportunistically select the best user as the anchor for optimizing the RISs, and non-orthogonally transmitting all users' signals simultaneously. A simplified version of OMUR exploiting random phase shifts is also proposed to avoid the complexity of RIS channel estimation.
摘要
In Simplified Chinese:这篇论文关注无线网络中多个用户和多个智能表面(RIS)之间的协议设计。我们通过扩展单用户或单RIS情况下的现有方法,提出了多用户多RIS场景中的两个标准方案。此外,我们还提出了一种名为启发式多用户反射(OMUR)的简单 yet efficient方法,其中选择最佳用户作为RIS优化的anchor,并同时非正交发送所有用户的信号。此外,我们还提出了一种使用随机频率偏移的简化版OMUR,以避免RIS频率估计的复杂性。
A Simple Multiple-Access Design for Reconfigurable Intelligent Surface-Aided Systems
results: 研究人员提出了一种简单而高效的方法,可以在实际系统中实现高性能。通过随机相位偏移来避免RIS通道估计高开销。Abstract
This paper focuses on the design of transmission methods and reflection optimization for a wireless system assisted by a single or multiple reconfigurable intelligent surfaces (RISs). The existing techniques are either too complex to implement in practical systems or too inefficient to achieve high performance. To overcome the shortcomings of the existing schemes, we propose a simple but efficient approach based on \textit{opportunistic reflection} and \textit{non-orthogonal transmission}. The key idea is opportunistically selecting the best user that can reap the maximal gain from the optimally reflected signals via RIS. That is to say, only the channel state information of the best user is used for RIS reflection optimization, which can in turn lower complexity substantially. In addition, the second user is selected to superpose its signal on that of the primary user, where the benefits of non-orthogonal transmission, i.e., high system capacity and improved user fairness, are obtained. Additionally, a simplified variant exploiting random phase shifts is proposed to avoid the high overhead of RIS channel estimation.
摘要
Distributed Adaptive Signal Fusion for Fractional Programs
paper_authors: Cem Ates Musluoglu, Alexander Bertrand
for: 解决卫星感知网络中的空间滤波优化问题
methods: 使用分布式适应Signal Fusion(DASF)框架和一种基于迭代的解决方法
results: 提高了计算效率和精度,并且能够在带宽限制下进行分布式计算Abstract
The distributed adaptive signal fusion (DASF) framework allows to solve spatial filtering optimization problems in a distributed and adaptive fashion over a bandwidth-constrained wireless sensor network. The DASF algorithm requires each node to sequentially build a compressed version of the original network-wide problem and solve it locally. However, these local problems can still result in a high computational load at the nodes, especially when the required solver is iterative. In this paper, we study the particular case of fractional programs, i.e., problems for which the objective function is a fraction of two continuous functions, which indeed require such iterative solvers. By exploiting the structure of a commonly used method for solving fractional programs and interleaving it with the iterations of the standard DASF algorithm, we obtain a distributed algorithm with a significantly reduced computational cost compared to the straightforward application of DASF as a meta-algorithm. We prove convergence and optimality of this "fractional DASF" (FDASF) algorithm and demonstrate its performance via numerical simulations.
摘要
distributed adaptive signal fusion (DASF) 框架可以在分布式和适应性的方式下解决宽频率约束无线传感网络上的空间滤波优化问题。 DASF 算法要求每个节点先后建立一个压缩版本的原始网络范围内问题,并解决它们本地。然而,这些本地问题仍可能导致节点上的计算负担很大,特别是当需要的解决器是迭代的。在这篇论文中,我们研究了分数程序,即两个连续函数的比例的问题。这些问题确实需要迭代的解决器。我们利用一种广泛使用的方法解决分数程序的方法,并与标准 DASF 算法的迭代结合起来,从而获得了一个分布式算法,与直接在 DASF 作为元算法应用时的计算成本相比,有显著的减少。我们证明了 FDASF 算法的收敛和优化性,并通过数字实验来证明其性能。
Base Station Beamforming Design for Near-field XL-IRS Beam Training
results: numerical results indicate that the proposed AO based BS beamforming design outperforms the SVD/angle based BS beamforming in terms of training accuracy and achievable received SNR.Abstract
Existing research on extremely large-scale intelligent reflecting surface (XL-IRS) beam training has assumed the far-field channel model for base station (BS)-IRS link. However, this approach may cause degraded beam training performance in practice due to the near-field channel model of the BS-IRS link. To address this issue, we propose two efficient schemes to optimize BS beamforming for improving the XL-IRS beam training performance. Specifically, the first scheme aims to maximize total received signal power on the XL-IRS, which generalizes the existing angle based BS beamforming design and can be resolved using the singular value decomposition (SVD) method. The second scheme aims to maximize the $\ell_1$-norm of incident signals on the XL-IRS, which is shown to achieve the maximum received power at the user. To solve the non-convex $\ell_1$-norm maximization problem, we propose an eficient algorithm by using the alternating optimization (AO) technique. Numerical results show that the proposed AO based BS beamforming design outperforms the SVD/angle based BS beamforming in terms of training accuracy and achievable received signal-to-noise ratio (SNR).
摘要
原研究中的巨大智能反射Surface(XL-IRS)的杆形训练假设了BS-IRS链路的远场通道模型。然而,这种方法可能会导致实际中的杆形训练性能下降,因为BS-IRS链路的近场通道模型。为解决这个问题,我们提出了两种高效的方案来优化BS杆形干扰。 Specifically, the first scheme aims to maximize the total received signal power on the XL-IRS, which generalizes the existing angle-based BS beamforming design and can be resolved using the singular value decomposition (SVD) method. The second scheme aims to maximize the $\ell_1$-norm of incident signals on the XL-IRS, which is shown to achieve the maximum received power at the user. To solve the non-convex $\ell_1$-norm maximization problem, we propose an efficient algorithm by using the alternating optimization (AO) technique. Numerical results show that the proposed AO-based BS beamforming design outperforms the SVD/angle-based BS beamforming in terms of training accuracy and achievable received signal-to-noise ratio (SNR).Here's the translation of the text in Traditional Chinese:先前的研究中的巨大智能反射Surface(XL-IRS)的杆形训练假设了BS-IRS链路的远场通道模型。然而,这种方法可能会导致实际中的杆形训练性能下降,因为BS-IRS链路的近场通道模型。为解决这个问题,我们提出了两种高效的方案来优化BS杆形干扰。 Specifically, the first scheme aims to maximize the total received signal power on the XL-IRS, which generalizes the existing angle-based BS beamforming design and can be resolved using the singular value decomposition (SVD) method. The second scheme aims to maximize the $\ell_1$-norm of incident signals on the XL-IRS, which is shown to achieve the maximum received power at the user. To solve the non-convex $\ell_1$-norm maximization problem, we propose an efficient algorithm by using the alternating optimization (AO) technique. Numerical results show that the proposed AO-based BS beamforming design outperforms the SVD/angle-based BS beamforming in terms of training accuracy and achievable received signal-to-noise ratio (SNR).
Comparing Iterative and Least-Squares Based Phase Noise Tracking in Receivers with 1-bit Quantization and Oversampling
results: Iterative phase noise tracking 在高信噪比下具有较低的估计误差方差,但是它对 spectral efficiency 的提高几乎只有微妙的提高,即高于 Nyquist 幂 ZXM 模ulation。Abstract
High data rates require vast bandwidths, that can be found in the sub-THz band, and high sampling frequencies, which are predicted to lead to a problematically high analog-to-digital converter (ADC) power consumption. It was proposed to use 1-bit ADCs to mitigate this problem. Moreover, oscillator phase noise is predicted to be especially high at sub-THz carrier frequencies. For synchronization the phase must be tracked based on 1-bit quantized observations. We study iterative data-aided phase estimation, i.e., the expectation-maximization and the Fisher-scoring algorithm, compared to least-squares (LS) phase estimation. For phase interpolation at the data symbols, we consider the Kalman filter and the Rauch-Tung-Striebel algorithm. Compared to LS estimation, iterative phase noise tracking leads to a significantly lower estimation error variance at high signal-to-noise ratios. However, its benefit for the spectral efficiency using zero-crossing modulation (ZXM) is limited to marginal gains for high faster-than-Nyquist signaling factors, i.e., higher order ZXM modulation.
摘要
高数据率需要庞大的带宽,可以在Sub-THz频段找到,同时需要高的采样频率,这将导致问题atically高的Analog-to-Digital Converter(ADC)电力消耗。提出使用1比特ADC来缓解这个问题。此外,oscillator阶段噪声预测在Sub-THz振荡频率下特别高。为了同步化,需要根据1比特量化的观察结果追踪阶段。我们研究了iterative数据援引phase估计算法,包括期望最大化算法和fisherscoring算法,与最小二乘(LS)phase估计相比。在数据符号上进行phase interpolating时,我们考虑了卡尔曼筛和Rauch-Tung-Striebel算法。与LS估计相比,iterative phase噪声追踪导致高信号噪声比在高信号至噪声比下显著降低。然而,对于频率使用零交叉模ulation(ZXM)的spectral efficiency来说,其利益受到高速于Nyquist频率因子的限制,即高阶ZXM模ulation。
Tuning of Ray-Based Channel Model for 5G Indoor Industrial Scenarios
paper_authors: Gurjot Singh Bhatia, Yoann Corre, Marco Di Renzo
for: 本文提出了一种用于生成5G工业互联网的决定性通道模型的创新方法。
methods: 本文使用了折射跟踪(RT)通道模拟器,可以很好地捕捉到各种工业环境下的具体特性。
results: 本文对3.7GHz和28GHz两频段的5G工业网络进行了比较,并与文献中的场测数据进行了比较,以生成准确的折射基本通道模型。Abstract
This paper presents an innovative method that can be used to produce deterministic channel models for 5G industrial internet-of-things (IIoT) scenarios. Ray-tracing (RT) channel emulation can capture many of the specific properties of a propagation scenario, which is incredibly beneficial when facing various industrial environments and deployment setups. But the environment's complexity, composed of many metallic objects of different sizes and shapes, pushes the RT tool to its limits. In particular, the scattering or diffusion phenomena can bring significant components. Thus, in this article, the Volcano RT channel simulation is tuned and benchmarked against field measurements found in the literature at two frequencies relevant to 5G industrial networks: 3.7 GHz (mid-band) and 28 GHz (millimeter-wave (mmWave) band), to produce calibrated ray-based channel model. Both specular and diffuse scattering contributions are calculated. Finally, the tuned RT data is compared to measured large-scale parameters, such as the power delay profile (PDP), the cumulative distribution function (CDF) of delay spreads (DSs), both in line-of-sight (LoS) and non-LoS (NLoS) situations and relevant IIoT channel properties are further explored.
摘要
To address these challenges, the Volcano RT channel simulation is tuned and benchmarked against field measurements found in the literature at two frequencies relevant to 5G industrial networks: 3.7 GHz (mid-band) and 28 GHz (millimeter-wave band). The calibrated ray-based channel model takes into account both specular and diffuse scattering contributions.The tuned RT data is compared to measured large-scale parameters, such as the power delay profile (PDP), the cumulative distribution function (CDF) of delay spreads (DSs), both in line-of-sight (LoS) and non-LoS (NLoS) situations. Additionally, relevant IIoT channel properties are further explored.
Hybrid NOMA assisted Integrated Sensing and Communication via RIS
results: 根据实验结果,提出的低复杂度交叉优化算法可以提高最小探测响应强度(MBPG),并且对探测和通信之间的负荷调整进行分析。Abstract
This paper investigates the optimization of reconfigurable intelligent surface (RIS) in an integrated sensing and communication (ISAC) system. \red{To meet the demand of growing number of devices, power domain non-orthogonal multiple access (NOMA) is considered. However, traditional NOMA with a large number of devices is challenging due to large decoding delay and propagation error introduced by successive interference cancellation (SIC). Thus, OMA is integrated into NOMA to support more devices}. We formulate a max-min problem to optimize the sensing beampattern \red{with constraints on communication rate}, through joint power allocation, active beamforming and RIS phase shift design. To solve the non-convex problem with a non-smooth objective function, we propose a low complexity alternating optimization (AO) algorithm, where a closed form expression for the intra-cluster power allocation (intra-CPA) is derived, and penalty and successive convex approximation (SCA) methods are used to optimize the beamforming and phase shift design. Simulation results show the effectiveness of the proposed algorithm in terms of improving minimum beampattern gain (MBPG) compared with other baselines. Furthermore, the trade-off between sensing and communication is analyzed and demonstrated in the simulation results.
摘要
The paper formulates a max-min problem to optimize the sensing beampattern with constraints on communication rate, using joint power allocation, active beamforming, and RIS phase shift design. To solve the non-convex problem with a non-smooth objective function, a low-complexity alternating optimization (AO) algorithm is proposed. This algorithm includes a closed-form expression for intra-cluster power allocation (intra-CPA), penalty and successive convex approximation (SCA) methods for optimizing beamforming and phase shift design.Simulation results show the effectiveness of the proposed algorithm in terms of improving minimum beampattern gain (MBPG) compared to other baselines. Additionally, the trade-off between sensing and communication is analyzed and demonstrated in the simulation results.
Dynamic Simulation of Three-Phase Induction Machines Under Eccentricity Conditions
paper_authors: Jianan Liu, Guanhua Ding, Yuxuan Xia, Jinping Sun, Tao Huang, Lihua Xie, Bing Zhu for:这篇论文旨在探讨在现实世界ADAS和自动驾驶场景中的在线3D多对象跟踪(MOT)问题,具体来说是对LiDAR和4D影像雷达点云的点对象跟踪(POT)和扩展对象跟踪(EOT)两种不同的方法进行系统性的调研。methods:这篇论文使用了三种不同的方法进行比较:传统的TBD-POT方法、最近研究的JDT-EOT方法以及我们提出的TBD-EOT方法。这些方法在两个开源的4D影像雷达数据集上进行了广泛的评估。results:实验结果表明,传统的TBD-POT方法在在线3D MOT中具有高跟踪性和低计算复杂度,而我们提出的TBD-EOT方法在某些情况下可能超越其性能。然而,JDT-EOT方法在评估场景中表现不佳,并且经过分析多种评价指标和视觉化分析后,我们提出了改进其性能的可能性。这些研究为未来4D影像雷达基于在线3D MOT的发展提供了首个和重要的指南。Abstract
Online 3D multi-object tracking (MOT) has recently received significant research interests due to the expanding demand of 3D perception in advanced driver assistance systems (ADAS) and autonomous driving (AD). Among the existing 3D MOT frameworks for ADAS and AD, conventional point object tracking (POT) framework using the tracking-by-detection (TBD) strategy has been well studied and accepted for LiDAR and 4D imaging radar point clouds. In contrast, extended object tracking (EOT), another important framework which accepts the joint-detection-and-tracking (JDT) strategy, has rarely been explored for online 3D MOT applications. This paper provides the first systematical investigation of the EOT framework for online 3D MOT in real-world ADAS and AD scenarios. Specifically, the widely accepted TBD-POT framework, the recently investigated JDT-EOT framework, and our proposed TBD-EOT framework are compared via extensive evaluations on two open source 4D imaging radar datasets: View-of-Delft and TJ4DRadSet. Experiment results demonstrate that the conventional TBD-POT framework remains preferable for online 3D MOT with high tracking performance and low computational complexity, while the proposed TBD-EOT framework has the potential to outperform it in certain situations. However, the results also show that the JDT-EOT framework encounters multiple problems and performs inadequately in evaluation scenarios. After analyzing the causes of these phenomena based on various evaluation metrics and visualizations, we provide possible guidelines to improve the performance of these MOT frameworks on real-world data. These provide the first benchmark and important insights for the future development of 4D imaging radar-based online 3D MOT.
摘要
在线3D多对象跟踪(MOT)最近受到了广泛的研究兴趣,这主要归功于自动驾驶系统(ADAS)和自主驱动(AD)的扩展需求。在现有的3D MOT框架中,使用跟踪检测(TBD)策略的点对象跟踪(POT)框架已经广泛研究和应用于激光雷达和4D图像雷达的点云数据。然而,接受联合检测与跟踪(JDT)策略的延长对象跟踪(EOT)框架在在线3D MOT应用中 rarely 被研究。本文提供了在实际ADAS和AD场景中的首次系统性的EOT框架比较,包括广泛接受的TBD-POT框架、最近研究的JDT-EOT框架以及我们的提议的TBD-EOT框架。经过广泛的评估于两个开源4D图像雷达数据集:View-of-Delft和TJ4DRadSet,实验结果显示, convent ional TBD-POT框架在在线3D MOT中具有高跟踪性和低计算复杂度,而我们提议的TBD-EOT框架在某些情况下具有超越TBD-POT框架的潜在优势。然而,结果也表明,JDT-EOT框架在评估场景中存在多个问题,并且表现不佳。经过根据多种评价指标和视觉化分析,我们提供了可能的改进方向,这些提供了首次的4D图像雷达基于在线3D MOT的 Referenced bencmark和重要的指导。
Non-parametric Ensemble Empirical Mode Decomposition for extracting weak features to identify bearing defects
results: 比较NPCEEMD和现有方法,NPCEEMD的模式混合度较低Here’s a more detailed explanation of each point:
for: The paper is written for identifying bearing defects using weak features, specifically using the proposed NPCEEMD method.
methods: The paper proposes a non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) method for identifying bearing defects. This method is non-parametric, meaning it does not require defining the ideal signal-to-noise ratio (SNR) or the number of ensembles every time while processing the signals.
results: The paper presents simulation results showing that the proposed NPCEEMD method has less mode mixing than existing decomposition methods. Additionally, the method is applied to experimental data, and the resulting signal is computed using the envelope spectrum to confirm the presence of defects.Abstract
A non-parametric complementary ensemble empirical mode decomposition (NPCEEMD) is proposed for identifying bearing defects using weak features. NPCEEMD is non-parametric because, unlike existing decomposition methods such as ensemble empirical mode decomposition, it does not require defining the ideal SNR of noise and the number of ensembles, every time while processing the signals. The simulation results show that mode mixing in NPCEEMD is less than the existing decomposition methods. After conducting in-depth simulation analysis, the proposed method is applied to experimental data. The proposed NPCEEMD method works in following steps. First raw signal is obtained. Second, the obtained signal is decomposed. Then, the mutual information (MI) of the raw signal with NPCEEMD-generated IMFs is computed. Further IMFs with MI above 0.1 are selected and combined to form a resulting signal. Finally, envelope spectrum of resulting signal is computed to confirm the presence of defect.
摘要
“一种非Parametric complementary ensemble empirical mode decomposition(NPCEEMD)是用于发现推挽缺陷的方法,使用弱特征。NPCEEMD非 Parametric,因为它不需要每次处理信号时定义理想噪声水平和数量的集合。模拟结果显示,NPCEEMD中的模式混合度比现有的分解方法更低。经过深入的模拟分析,提议的方法应用到实验数据中。NPCEEMD方法的步骤如下:首先获取原始信号;第二,将获取的信号进行分解;然后计算原始信号与NPCEEMD生成的IMF的相互信息(MI);进一步选择MI大于0.1的IMF并将其组合成一个结果信号;最后,计算结果信号的响应特征来确认缺陷存在。”Note: "IMF" stands for "intrinsic mode function".
Massive Access of Static and Mobile Users via Reconfigurable Intelligent Surfaces: Protocol Design and Performance Analysis
results: simulations 表明,提议的 MAC 协议在系统吞吐量和访问公平性两个方面具有优于标准协议,但是存在访问公平性和系统吞吐量之间的贸易关系。Abstract
The envisioned wireless networks of the future entail the provisioning of massive numbers of connections, heterogeneous data traffic, ultra-high spectral efficiency, and low latency services. This vision is spurring research activities focused on defining a next generation multiple access (NGMA) protocol that can accommodate massive numbers of users in different resource blocks, thereby, achieving higher spectral efficiency and increased connectivity compared to conventional multiple access schemes. In this article, we present a multiple access scheme for NGMA in wireless communication systems assisted by multiple reconfigurable intelligent surfaces (RISs). In this regard, considering the practical scenario of static users operating together with mobile ones, we first study the interplay of the design of NGMA schemes and RIS phase configuration in terms of efficiency and complexity. Based on this, we then propose a multiple access framework for RIS-assisted communication systems, and we also design a medium access control (MAC) protocol incorporating RISs. In addition, we give a detailed performance analysis of the designed RIS-assisted MAC protocol. Our extensive simulation results demonstrate that the proposed MAC design outperforms the benchmarks in terms of system throughput and access fairness, and also reveal a trade-off relationship between the system throughput and fairness.
摘要
将来的无线网络视图包括提供庞大量连接、不同数据流量、ultra-高频率效率和低延迟服务。这一视图激发了研究人员关于定义下一代多接入(NGMA)协议的研究活动,以便在不同资源块中承载大量用户,从而实现更高的频率效率和连接性。在这篇文章中,我们介绍了一种基于RIS的NGMA协议在无线通信系统中的应用。在这种情况下,我们首先研究了NGMA协议的设计和RIS相位配置之间的关系,并对系统效率和复杂性进行了分析。然后,我们提出了一种基于RIS的通信系统多Access框架,并设计了一种嵌入RIS的媒体访问控制协议(MAC)协议。此外,我们对提出的MAC协议进行了详细的性能分析。我们的广泛的 simulations结果表明,提出的MAC设计在系统吞吐量和访问公平性方面都有优于标准准确,并且还发现了系统吞吐量和公平性之间的负相关性。
Performance Bounds for Near-Field Localization with Widely-Spaced Multi-Subarray mmWave/THz MIMO
results: 研究发现,在某些情况下,WSMS的CRB小于均勋阵列的CRB,并且提供了各种系统特性的可见性和实验验证。Abstract
This paper investigates the potential of near-field localization using widely-spaced multi-subarrays (WSMSs) and analyzing the corresponding angle and range Cram\'er-Rao bounds (CRBs). By employing the Riemann sum, closed-form CRB expressions are derived for the spherical wavefront-based WSMS (SW-WSMS). We find that the CRBs can be characterized by the angular span formed by the line connecting the array's two ends to the target, and the different WSMSs with same angular spans but different number of subarrays have identical normalized CRBs. We provide a theoretical proof that, in certain scenarios, the CRB of WSMSs is smaller than that of uniform arrays. We further yield the closed-form CRBs for the hybrid spherical and planar wavefront-based WSMS (HSPW-WSMS), and its components can be seen as decompositions of the parameters from the CRBs for the SW-WSMS. Simulations are conducted to validate the accuracy of the derived closed-form CRBs and provide further insights into various system characteristics. Basically, this paper underscores the high resolution of utilizing WSMS for localization, reinforces the validity of adopting the HSPW assumption, and, considering its applications in communications, indicates a promising outlook for integrated sensing and communications based on HSPW-WSMSs.
摘要
results: 该模型在26个下游任务中表现出色,达到了多个任务的顶峰性能,为实现通用的音频表示铺平了道路。Abstract
Audio-Language models jointly learn multimodal text and audio representations that enable Zero-Shot inference. Models rely on the encoders to create powerful representations of the input and generalize to multiple tasks ranging from sounds, music, and speech. Although models have achieved remarkable performance, there is still a performance gap with task-specific models. In this paper, we propose a Contrastive Language-Audio Pretraining model that is pretrained with a diverse collection of 4.6M audio-text pairs employing two innovative encoders for Zero-Shot inference. To learn audio representations, we trained an audio encoder on 22 audio tasks, instead of the standard training of sound event classification. To learn language representations, we trained an autoregressive decoder-only model instead of the standard encoder-only models. Then, the audio and language representations are brought into a joint multimodal space using Contrastive Learning. We used our encoders to improve the downstream performance by a margin. We extensively evaluated the generalization of our representations on 26 downstream tasks, the largest in the literature. Our model achieves state of the art results in several tasks leading the way towards general-purpose audio representations.
摘要
<> translate_language = zh-CN;>Audio-语言模型同时学习多Modal文本和音频表示,以实现零shot推理。模型依靠encoder创建强大的输入表示,并能泛化到多个任务,从声音、音乐到语音。虽然模型已达到了非常出色的性能,但还存在任务特定模型的性能差距。在这篇论文中,我们提出了一种对比语言-音频预训练模型,该模型通过使用多样化的4.6M audio-文本对employs two innovative encoders来实现零shot推理。为了学习音频表示,我们在22种音频任务上训练了音频encoder,而不是标准的声音分类训练。为了学习语言表示,我们训练了一个自然语言模型,而不是标准的encoder-only模型。然后,音频和语言表示被带入一个共同多模态空间,使用对比学习。我们使用我们的encoder来提高下游性能的边缘。我们广泛评估了我们的表示的泛化性能,并取得了Literature中最大的26个下游任务。我们的模型在一些任务中取得了状态的战果,领先于普适音频表示的发展。
Kernel Interpolation of Incident Sound Field in Region Including Scattering Objects
results: 实验结果表明,该方法比无分离的幂函数回归更高精度地估计入射声场。Abstract
A method for estimating the incident sound field inside a region containing scattering objects is proposed. The sound field estimation method has various applications, such as spatial audio capturing and spatial active noise control; however, most existing methods do not take into account the presence of scatterers within the target estimation region. Although several techniques exist that employ knowledge or measurements of the properties of the scattering objects, it is usually difficult to obtain them precisely in advance, and their properties may change during the estimation process. Our proposed method is based on the kernel ridge regression of the incident field, with a separation from the scattering field represented by a spherical wave function expansion, thus eliminating the need for prior modeling or measurements of the scatterers. Moreover, we introduce a weighting matrix to induce smoothness of the scattering field in the angular direction, which alleviates the effect of the truncation order of the expansion coefficients on the estimation accuracy. Experimental results indicate that the proposed method achieves a higher level of estimation accuracy than the kernel ridge regression without separation.
摘要
一种估计受到障碍物影响的受测 зву场的方法被提议。这种受测音场估算方法在各种应用中有重要意义,如空间音采和空间活动噪声控制,但大多数现有方法忽略了目标估算区域内的障碍物。虽然有一些技术利用了障碍物的性能知识或测量结果,但通常很难在进行估算之前 precisely 获取它们,而且它们在估算过程中可能会发生变化。我们提议的方法基于incident field的 kernel ridge regression,通过将散射场表示为球形傅里叶函数展开,因此无需在进行估算之前 precisely 知道障碍物的性能。此外,我们引入了一个权重矩阵来促进angular方向上的平滑性,这有助于减少 truncation order 对估算精度的影响。实验结果表明,我们提议的方法比kernel ridge regression无 separation 更高级别的估算精度。
Undecidability Results and Their Relevance in Modern Music Making
results: 研究提供了这些主张的理论证明,并证明了这些概念在实践中的实用性。 本研究的最终目标是促进对 Undecidability 在音乐中的新理解,强调其更广泛的应用和可能性,以及对计算机助理(以及传统)音乐创作的影响。Abstract
This paper delves into the intersection of computational theory and music, examining the concept of undecidability and its significant, yet overlooked, implications within the realm of modern music composition and production. It posits that undecidability, a principle traditionally associated with theoretical computer science, extends its relevance to the music industry. The study adopts a multidimensional approach, focusing on five key areas: (1) the Turing completeness of Ableton, a widely used digital audio workstation, (2) the undecidability of satisfiability in sound creation utilizing an array of effects, (3) the undecidability of constraints on polymeters in musical compositions, (4) the undecidability of satisfiability in just intonation harmony constraints, and (5) the undecidability of "new ordering systems". In addition to providing theoretical proof for these assertions, the paper elucidates the practical relevance of these concepts for practitioners outside the field of theoretical computer science. The ultimate aim is to foster a new understanding of undecidability in music, highlighting its broader applicability and potential to influence contemporary computer-assisted (and traditional) music making.
摘要
The Turing completeness of Ableton, a widely used digital audio workstation.2. The undecidability of satisfiability in sound creation using an array of effects.3. The undecidability of constraints on polymeters in musical compositions.4. The undecidability of satisfiability in just intonation harmony constraints.5. The undecidability of “new ordering systems”.In addition to providing theoretical proof for these assertions, the paper also illustrates the practical relevance of these concepts for practitioners outside the field of theoretical computer science. The ultimate aim is to foster a new understanding of undecidability in music, highlighting its broader applicability and potential to influence contemporary computer-assisted (and traditional) music making.
SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus
results: 通过对听写材料进行分析,发现可以通过利用视频上的文本信息提高语音识别性能。Abstract
Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.
摘要
多Modal自动语音识别(ASR)技术目的在于利用其他modalities提高语音识别系统的性能。现有的方法主要关注视频或上下文信息,而使用补充的文本信息则被忽略。 recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.
Towards generalisable and calibrated synthetic speech detection with self-supervised representations
methods: 该论文使用预训练的自动学习表示 followed by a simple logistic regression classifier,以实现强大的普适性。
results: 该方法在新引入的 In-the-Wild 数据集上减少了平均错误率从 30% 降低到 8%,并且生成了更好地归一化的模型,可以用于下游任务,如不确定性估计。Abstract
Generalisation -- the ability of a model to perform well on unseen data -- is crucial for building reliable deep fake detectors. However, recent studies have shown that the current audio deep fake models fall short of this desideratum. In this paper we show that pretrained self-supervised representations followed by a simple logistic regression classifier achieve strong generalisation capabilities, reducing the equal error rate from 30% to 8% on the newly introduced In-the-Wild dataset. Importantly, this approach also produces considerably better calibrated models when compared to previous approaches. This means that we can trust our model's predictions more and use these for downstream tasks, such as uncertainty estimation. In particular, we show that the entropy of the estimated probabilities provides a reliable way of rejecting uncertain samples and further improving the accuracy.
摘要
“一般化”——模型在未见到的数据上表现良好的能力——是深圳识别器的重要需求。然而,最近的研究表明,现有的音频深圳模型尚未达到这个需求。在这篇论文中,我们展示了预训自动 represencing,然后跟着一个简单的逻辑函数分类器可以实现强大的一般化能力,从30%降至8%的平均错误率在新引入的 In-the-Wild 数据集上。此外,这种方法还生成了较好的条件分布,使得我们可以更加信任模型的预测,并将其用于下游任务,如uncertainty估计。具体来说,我们显示出估计概率的熵可以可靠地拒绝不确定的数据,并进一步提高准确率。
Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
results: 我们的实验结果表明,通过在 acoustics-only diarization 系统中添加 LLM 的 lexical knowledge,可以提高总的 speaker-attributed word error rate (SA-WER)。实验结果还表明,LLMs 可以为 speaker diarization 和其他语音处理任务提供更多的上下文信息,并且可以在不可见的上下文信息方面提供补做。Abstract
Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual cues in human dialogues. Our method builds upon an acoustic-based speaker diarization system by adding lexical information from an LLM in the inference stage. We model the multi-modal decoding process probabilistically and perform joint acoustic and lexical beam search to incorporate cues from both modalities: audio and text. Our experiments demonstrate that infusing lexical knowledge from the LLM into an acoustics-only diarization system improves overall speaker-attributed word error rate (SA-WER). The experimental results show that LLMs can provide complementary information to acoustic models for the speaker diarization task via proposed beam search decoding approach showing up to 39.8% relative delta-SA-WER improvement from the baseline system. Thus, we substantiate that the proposed technique is able to exploit contextual information that is inaccessible to acoustics-only systems which is represented by speaker embeddings. In addition, these findings point to the potential of using LLMs to improve speaker diarization and other speech processing tasks by capturing semantic and contextual cues.
摘要
paper_authors: Lanhong Yao, Zheyuan Zhang, Ugur Demir, Elif Keles, Camila Vendrami, Emil Agarunov, Candice Bolan, Ivo Schoots, Marc Bruno, Rajesh Keswani, Frank Miller, Tamas Gonda, Cemal Yazici, Temel Tirkes, Michael Wallace, Concetto Spampinato, Ulas Bagci for:这篇论文的目的是为了提出一个新的电脑支持诊断架构,以帮助诊断潜在的胰脏癌症。methods:这篇论文使用了一种独特的自适应分 segmentation 策略来定义胰脏的边界,然后使用了一个新的深度学习架构来进行分类。results:这篇论文在使用多个检测方法时得到了超过 80% 的准确率,较之前的国际标准和出版研究更高。Abstract
Intraductal Papillary Mucinous Neoplasm (IPMN) cysts are pre-malignant pancreas lesions, and they can progress into pancreatic cancer. Therefore, detecting and stratifying their risk level is of ultimate importance for effective treatment planning and disease control. However, this is a highly challenging task because of the diverse and irregular shape, texture, and size of the IPMN cysts as well as the pancreas. In this study, we propose a novel computer-aided diagnosis pipeline for IPMN risk classification from multi-contrast MRI scans. Our proposed analysis framework includes an efficient volumetric self-adapting segmentation strategy for pancreas delineation, followed by a newly designed deep learning-based classification scheme with a radiomics-based predictive approach. We test our proposed decision-fusion model in multi-center data sets of 246 multi-contrast MRI scans and obtain superior performance to the state of the art (SOTA) in this field. Our ablation studies demonstrate the significance of both radiomics and deep learning modules for achieving the new SOTA performance compared to international guidelines and published studies (81.9\% vs 61.3\% in accuracy). Our findings have important implications for clinical decision-making. In a series of rigorous experiments on multi-center data sets (246 MRI scans from five centers), we achieved unprecedented performance (81.9\% accuracy).
摘要
卵巢瘤细胞肿(IPMN)是肝脏前期癌变,可能会进展到肝脏癌。因此,检测和分级IPMN的风险水平是肝脏疾病控制的关键。然而,这是一项非常具有挑战性的任务,因为IPMN瘤肿的形态、文本和大小均非常多样化和不规则。在这项研究中,我们提出了一种新的计算机助手诊断管线,用于IPMN风险分类从多方位MRI扫描中。我们的提议分析框架包括高效的自适应分割策略,以便识别肝脏,然后是一种新设计的深度学习基于的分类方案,以及一种基于 радиологи学的预测方法。我们在多个中心的数据集上测试了我们的提议决策融合模型,并获得了在这个领域的新的最高性能(81.9%)。我们的剖析研究表明, Both radiomics和深度学习模块对于实现新的最高性能具有重要的意义,相比于国际指南和已发表的研究(81.9% vs 61.3%)。我们的发现对临床决策有重要的意义。在多个中心的数据集上(246个MRI扫描)进行了严格的实验,我们实现了准确率81.9%。
Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation
results: 我们在两个遥感图像集上进行了广泛的实验,并证明了我们的模型在几何学相关学习图像semantic segmentation中的有效性和优势。Abstract
Remote sensing image semantic segmentation is an important problem for remote sensing image interpretation. Although remarkable progress has been achieved, existing deep neural network methods suffer from the reliance on massive training data. Few-shot remote sensing semantic segmentation aims at learning to segment target objects from a query image using only a few annotated support images of the target class. Most existing few-shot learning methods stem primarily from their sole focus on extracting information from support images, thereby failing to effectively address the large variance in appearance and scales of geographic objects. To tackle these challenges, we propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation. Our model enhances the generalization by considering both self-correlation and cross-correlation between support and query images to make segmentation predictions. To further explore the self-correlation with the query image, we propose to adopt a classical spectral method to produce a class-agnostic segmentation mask based on the basic visual information of the image. Extensive experiments on two remote sensing image datasets demonstrate the effectiveness and superiority of our model in few-shot remote sensing image semantic segmentation. Code and models will be accessed at https://github.com/linhanwang/SCCNet.
摘要
<>将文本翻译成简化中文。<>遥感图像semantic segmentation是遥感图像解释中的重要问题。尽管已经取得了很大的进步,现有的深度神经网络方法却受到大量训练数据的依赖。几个shot遥感semantic segmentation目标是通过只使用少量标注图像来学习针对目标类图像进行分割。现有的几个shot学习方法主要围绕支持图像中的信息提取而设计,导致效果不够地处理大量的地理物体外观和比例差异。为解决这些挑战,我们提议一种基于自身相关和交叉相关学习网络(SCCNet)。我们的模型通过考虑支持和查询图像之间的自身相关和交叉相关来增强总体化。为进一步探索支持图像与查询图像之间的自身相关,我们提议采用一种经典的spectral方法生成基于图像的基本视觉信息的类型独立分割mask。我们的实验结果表明,我们的模型在几个shot遥感图像semantic segmentation中表现出色,并且与其他方法相比,具有更高的一致性和稳定性。代码和模型将在https://github.com/linhanwang/SCCNet上公开。
SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition
results: 对于NTU-RGB+D(60&120)和PKU-MMD(I&II) datasets,我们的方法显著超越了现有的SOTA方法,并在多个下游任务上达到了优秀的性能,包括动作识别、动作检索、过渡学习和半监督学习。Abstract
Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly.
摘要
<>对比学习在skeleton基因action认识中取得了很大成功。然而,现有的方法通常将骨架序列编码为杂合的空间时间表示,并将对比限制在同一个表示层次。而本文提出了一种新的对比学习框架,即空间时间准确网络(SCD-Net)。specifically,我们将分解模块与特征提取器结合,以 deriv implicit clue from空间和时间 DOMAIN separately。在 SCD-Net 的训练中,我们使用构建的全球锚点,并且鼓励锚点和提取的 clue 之间的互动。此外,我们提出了一种新的masking strategy,以强制Contextual associations,利用图像模型的最新发展。我们在 NTU-RGB+D (60&120) 和 PKU-MMD (I&II) datasets 进行了广泛的评估,覆盖了多种下游任务,如 action recognition、action retrieval、传输学习和半监督学习。实验结果表明,我们的方法可以很有效地与现有的 SOTA 方法进行比较。
Instance-Agnostic Geometry and Contact Dynamics Learning
results: 实验表明,本文的框架可以学习静止和凹形物体的形状和动力学特性,并超越现有的跟踪框架。Abstract
This work presents an instance-agnostic learning framework that fuses vision with dynamics to simultaneously learn shape, pose trajectories and physical properties via the use of geometry as a shared representation. Unlike many contact learning approaches that assume motion capture input and a known shape prior for the collision model, our proposed framework learns an object's geometric and dynamic properties from RGBD video, without requiring either category-level or instance-level shape priors. We integrate a vision system, BundleSDF, with a dynamics system, ContactNets and propose a cyclic training pipeline to use the output from the dynamics module to refine the poses and the geometry from the vision module, using perspective reprojection. Experiments demonstrate our framework's ability to learn the geometry and dynamics of rigid and convex objects and improve upon the current tracking framework.
摘要
Mobile Vision Transformer-based Visual Object Tracking
results: 在大规模数据集 GOT10k 和 TrackingNet 上,我们的 MobileViT-based Tracker(MVT)的性能超过了当前的轻量级跟踪器,并且在 GPU 上运行的速度比 DiMP-50 快得多。 Code 和模型可以从 https://github.com/goutamyg/MVT 获取。Abstract
The introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive since they have a large number of model parameters and rely on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but are less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. The experimental results show that our MobileViT-based Tracker, MVT, surpasses the performance of recent lightweight trackers on the large-scale datasets GOT10k and TrackingNet, and with a high inference speed. In addition, our method outperforms the popular DiMP-50 tracker despite having 4.7 times fewer model parameters and running at 2.8 times its speed on a GPU. The tracker code and models are available at https://github.com/goutamyg/MVT
摘要
Introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive due to their large number of model parameters and reliance on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. Experimental results show that our MobileViT-based Tracker, MVT, outperforms the performance of recent lightweight trackers on large-scale datasets GOT10k and TrackingNet, with high inference speed. Additionally, our method outperforms the popular DiMP-50 tracker despite having 4.7 times fewer model parameters and running at 2.8 times its speed on a GPU. Tracker code and models are available at https://github.com/goutamyg/MVT.
results: 实验结果显示,KD-FixMatch在四个公开的数据集上都比 FixMatch 高效。 KD-FixMatch 可以从具有标签的数据集和无标签的数据集中获得更好的训练开头点,从而提高模型的性能。Abstract
Semi-supervised learning (SSL) has become a crucial approach in deep learning as a way to address the challenge of limited labeled data. The success of deep neural networks heavily relies on the availability of large-scale high-quality labeled data. However, the process of data labeling is time-consuming and unscalable, leading to shortages in labeled data. SSL aims to tackle this problem by leveraging additional unlabeled data in the training process. One of the popular SSL algorithms, FixMatch, trains identical weight-sharing teacher and student networks simultaneously using a siamese neural network (SNN). However, it is prone to performance degradation when the pseudo labels are heavily noisy in the early training stage. We present KD-FixMatch, a novel SSL algorithm that addresses the limitations of FixMatch by incorporating knowledge distillation. The algorithm utilizes a combination of sequential and simultaneous training of SNNs to enhance performance and reduce performance degradation. Firstly, an outer SNN is trained using labeled and unlabeled data. After that, the network of the well-trained outer SNN generates pseudo labels for the unlabeled data, from which a subset of unlabeled data with trusted pseudo labels is then carefully created through high-confidence sampling and deep embedding clustering. Finally, an inner SNN is trained with the labeled data, the unlabeled data, and the subset of unlabeled data with trusted pseudo labels. Experiments on four public data sets demonstrate that KD-FixMatch outperforms FixMatch in all cases. Our results indicate that KD-FixMatch has a better training starting point that leads to improved model performance compared to FixMatch.
摘要
深度学习中的半监督学习(SSL)已成为一种重要的方法,以解决深度神经网络的受限于标注数据的问题。然而,数据标注是一个时间consuming和不可扩展的过程,导致标注数据的短缺。SSL利用额外的无标注数据来提高深度神经网络的性能。 FixMatch 是一种流行的 SSL 算法,它使用同一个 weight-sharing 教师和学生网络同时训练,使用 Siamese 神经网络(SNN)。然而, FixMatch 在早期训练阶段 pseudo labels 具有很强的噪音,可能导致性能下降。我们提出了 KD-FixMatch,一种新的 SSL 算法,通过搅合顺序和同时训练 SNNs 来提高性能并降低性能下降。首先,外部 SNN 通过标注和无标注数据进行训练。然后,外部 SNN 的网络生成 pseudo labels для无标注数据,并从中选择一 subset of 无标注数据,通过高confidence 采样和深度嵌入划分来生成可信 pseudo labels。最后,内部 SNN 通过标注数据、无标注数据和可信 pseudo labels 进行训练。我们在四个公共数据集上进行了实验,结果显示 KD-FixMatch 在所有情况下都高于 FixMatch。我们的结果表明,KD-FixMatch 具有更好的训练开始点,导致深度神经网络的性能得到提高。
Rice Plant Disease Detection and Diagnosis using Deep Convolutional Neural Networks and Multispectral Imaging
results: 研究发现,使用多spectral和RGB通道作为输入,可以获得更高的F1准确率,比使用RGB输入only更高。Abstract
Rice is considered a strategic crop in Egypt as it is regularly consumed in the Egyptian people's diet. Even though Egypt is the highest rice producer in Africa with a share of 6 million tons per year, it still imports rice to satisfy its local needs due to production loss, especially due to rice disease. Rice blast disease is responsible for 30% loss in rice production worldwide. Therefore, it is crucial to target limiting yield damage by detecting rice crops diseases in its early stages. This paper introduces a public multispectral and RGB images dataset and a deep learning pipeline for rice plant disease detection using multi-modal data. The collected multispectral images consist of Red, Green and Near-Infrared channels and we show that using multispectral along with RGB channels as input archives a higher F1 accuracy compared to using RGB input only.
摘要
rice 被视为埃及的战略作物,因为它是埃及人的日常饮食中的重要组成部分。尽管埃及是非洲最大的rice生产国,每年生产600万吨rice,但它仍然需要从外部进口rice来满足本地需求,尤其是因为生产损失,如rice疾病。rice疾病是全球rice生产损失的30%原因。因此,它是非常重要的target limiting yield damage by detecting rice crops diseases in its early stages。这篇论文介绍了一个公共多spectral和RGB图像集和一个深度学习管道,用于rice plant疾病检测以及多Modal数据。收集的多spectral图像包括红、绿和近红外通道,我们表明使用多spectral和RGB通道作为输入,可以达到高于RGB输入Only的F1准确率。
SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors
paper_authors: Hongge Chen, Zhao Chen, Gregory P. Meyer, Dennis Park, Carl Vondrick, Ashish Shrivastava, Yuning Chai
for: 用于生成 Structurally plausible yet challenging 3D shapes,以检测3D object detectors的漏洞。
methods: 使用 signed distanced function (SDF) 表示 объек,并通过权重错误信号来缓慢塑形或对象的pose进行变化,以混淆下游3D检测器。
results: 通过SHIFT3D方法生成的对象 physically differ from baseline object, yet retain semantically recognizable shape,可以提供3D检测器的可读性失败模式,帮助预先发现3D感知系统中的安全隐患。Abstract
We present SHIFT3D, a differentiable pipeline for generating 3D shapes that are structurally plausible yet challenging to 3D object detectors. In safety-critical applications like autonomous driving, discovering such novel challenging objects can offer insight into unknown vulnerabilities of 3D detectors. By representing objects with a signed distanced function (SDF), we show that gradient error signals allow us to smoothly deform the shape or pose of a 3D object in order to confuse a downstream 3D detector. Importantly, the objects generated by SHIFT3D physically differ from the baseline object yet retain a semantically recognizable shape. Our approach provides interpretable failure modes for modern 3D object detectors, and can aid in preemptive discovery of potential safety risks within 3D perception systems before these risks become critical failures.
摘要
我团队现在发布了Shift3D,一个可微分的管道,用于生成3D形状,这些形状具有可能挑战3D对象检测器的结构性可能性,但又保持semanticognizable的形状。在自动驾驶等安全关键应用中,发现这些新挑战的对象可以提供对3D检测器的不明显漏洞的知识。通过使用签名距离函数(SDF)表示对象,我们表明了下游3D检测器的导数误差信号,允许我们缓和形状或pose的变化,以搅乱下游3D检测器。这些由Shift3D生成的对象与基准对象有所不同,但它们仍保持semanticognizable的形状。我们的方法提供了可读取的失败模式,可以帮助在3D感知系统中发现潜在的安全风险之前,以避免这些风险变成 kritical failures。
Divergences in Color Perception between Deep Neural Networks and Humans
results: 研究发现,当前的DNN架构(包括卷积神经网络和视Transformer)在处理图像中的色彩相似性评估中表现不佳,其Result与人类对图像的色彩相似性评估存在很大差异。而基于wavelet分解的一种可解释的和有理性的色彩模型则能够更好地预测人类对图像的色彩相似性评估。Abstract
Deep neural networks (DNNs) are increasingly proposed as models of human vision, bolstered by their impressive performance on image classification and object recognition tasks. Yet, the extent to which DNNs capture fundamental aspects of human vision such as color perception remains unclear. Here, we develop novel experiments for evaluating the perceptual coherence of color embeddings in DNNs, and we assess how well these algorithms predict human color similarity judgments collected via an online survey. We find that state-of-the-art DNN architectures $-$ including convolutional neural networks and vision transformers $-$ provide color similarity judgments that strikingly diverge from human color judgments of (i) images with controlled color properties, (ii) images generated from online searches, and (iii) real-world images from the canonical CIFAR-10 dataset. We compare DNN performance against an interpretable and cognitively plausible model of color perception based on wavelet decomposition, inspired by foundational theories in computational neuroscience. While one deep learning model $-$ a convolutional DNN trained on a style transfer task $-$ captures some aspects of human color perception, our wavelet algorithm provides more coherent color embeddings that better predict human color judgments compared to all DNNs we examine. These results hold when altering the high-level visual task used to train similar DNN architectures (e.g., image classification versus image segmentation), as well as when examining the color embeddings of different layers in a given DNN architecture. These findings break new ground in the effort to analyze the perceptual representations of machine learning algorithms and to improve their ability to serve as cognitively plausible models of human vision. Implications for machine learning, human perception, and embodied cognition are discussed.
摘要
深度神经网络(DNN)在人类视觉模型中得到了广泛的应用,它们的表现在图像分类和物体识别任务上是极其出色的。然而,DNN是否能够捕捉人类视觉的基本特征,如色彩感知,还未得到了清楚的回答。在这篇文章中,我们开发了一系列新的实验来评估DNN中色彩嵌入的听觉性,并对DNN的预测与人类的色彩相似性判断进行比较。我们发现,包括卷积神经网络和视Transformers在内的当今最佳DNN架构,其对人类色彩判断的预测与人类实际上的色彩判断存在很大差异。我们将DNN的表现与基于计算神经科学的理论 inspirited by wavelet decomposition的一种可读性和认知可能的色彩模型进行比较。结果显示,一个 convolutional DNN 在样式传递任务上训练后 capture some aspects of human color perception,但我们的波let算法提供了更听觉性的色彩嵌入,可以更好地预测人类色彩判断。这些结果在不同的高级视觉任务(如图像分类和图像分割)和不同层次的DNN架构中都保持相同。这些发现开拓了对机器学习算法的听觉表征分析和改进其为人类视觉可能的模型的领域。文章结尾,我们讨论了机器学习、人类感知和embodied cognition等领域的影响。
paper_authors: Ivan Grishchenko, Geng Yan, Eduard Gabriel Bazavan, Andrei Zanfir, Nikolai Chinaev, Karthik Raveendran, Matthias Grundmann, Cristian Sminchisescu
results: 该论文在现代手机上实现了30+ FPS的实时预测,并且可以从单个彩色RGB图像中获取52个混合形态系数。Abstract
We present Blendshapes GHUM, an on-device ML pipeline that predicts 52 facial blendshape coefficients at 30+ FPS on modern mobile phones, from a single monocular RGB image and enables facial motion capture applications like virtual avatars. Our main contributions are: i) an annotation-free offline method for obtaining blendshape coefficients from real-world human scans, ii) a lightweight real-time model that predicts blendshape coefficients based on facial landmarks.
摘要
我们现在提供Blendshapes GHUM,一个在设备上的机器学习管道,可以在现代手机上预测52个 facial blendshape 系数,每秒30多帧,从单个灰度RGB图像中获取,并实现了人脸动作捕捉应用,如虚拟人物。我们的主要贡献包括:1. 无需注释的离线方法,可以从真实世界人体扫描中获取 blendshape 系数。2. 轻量级的实时模型,可以基于人脸特征点预测 blendshape 系数。
LUNet: Deep Learning for the Segmentation of Arterioles and Venules in High Resolution Fundus Images
paper_authors: Jonathan Fhima, Jan Van Eijgen, Hana Kulenovic, Valérie Debeuf, Marie Vangilbergen, Marie-Isaline Billen, Heloïse Brackenier, Moti Freiman, Ingeborg Stalmans, Joachim A. Behar
For: The paper aims to develop a deep learning architecture for automated segmentation of retinal arterioles and venules in digital fundus images.* Methods: The proposed method, called LUNet, uses a double dilated convolutional block and a long tail to enhance the receptive field and resolution of the segmentation, respectively. The custom loss function emphasizes the continuity of the blood vessels.* Results: LUNet significantly outperforms two state-of-the-art segmentation algorithms on both the local test set and four external test sets simulating distribution shifts across ethnicity, comorbidities, and annotators.Here are the three key points in Simplified Chinese:* For: 这篇论文的目的是开发一种深度学习架构,用于自动分类retinal arterioles和venules在数字背景图像中。* Methods: 提出的方法(LUNet)使用了双倍扩展 convolutional block和长尾,以提高模型的接收场和分辨率。 Custom loss function 强调血管之间的连续性。* Results: LUNet 在本地测试集和四个外部测试集上,与两种现有的分割算法相比,显著地表现出优异。Abstract
The retina is the only part of the human body in which blood vessels can be accessed non-invasively using imaging techniques such as digital fundus images (DFI). The spatial distribution of the retinal microvasculature may change with cardiovascular diseases and thus the eyes may be regarded as a window to our hearts. Computerized segmentation of the retinal arterioles and venules (A/V) is essential for automated microvasculature analysis. Using active learning, we created a new DFI dataset containing 240 crowd-sourced manual A/V segmentations performed by fifteen medical students and reviewed by an ophthalmologist, and developed LUNet, a novel deep learning architecture for high resolution A/V segmentation. LUNet architecture includes a double dilated convolutional block that aims to enhance the receptive field of the model and reduce its parameter count. Furthermore, LUNet has a long tail that operates at high resolution to refine the segmentation. The custom loss function emphasizes the continuity of the blood vessels. LUNet is shown to significantly outperform two state-of-the-art segmentation algorithms on the local test set as well as on four external test sets simulating distribution shifts across ethnicity, comorbidities, and annotators. We make the newly created dataset open access (upon publication).
摘要
retina 是人体中唯一可以非侵入式通过图像技术访问血管的部分。通过计算机化分割,我们可以自动分析微血管网络。使用活动学习,我们创建了一个新的数字肉眼图像(DFI)集合,包括240名医学生 manually 手动 segmentation 和一位眼科医生审阅,并开发了 LUNet,一种新的深度学习架构,用于高分辨率血管分割。LUNet 架构包括一个双扩展 convolutional block,用于提高模型的感知范围和减少参数数量。此外,LUNet 还有一个长尾,用于在高分辨率下细化分割。我们定义了一个自定义损失函数,用于强调血管之间的连续性。LUNet 在本地测试集上以及四个外部测试集上,对比两种现有的分割算法,显著超越它们。我们将新创建的数据集开放给社区(在发表之前)。
TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language
results: 对于 downstream tasks, TransferDoc 模型表现出色,在 industrial evaluation scenario 中出perform other state-of-the-art approaches。Abstract
The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.
摘要
领域中的视觉文档理解受到了快速增长的挑战和强大的多modal策略的推动。然而,它们依赖于大量的文档数据来学习其预先定义的目标任务,因此在实际上的线上工业环境中表现不佳。主要的原因是依赖于 OCR 引擎来提取文档页面上的局部位置信息,从而忽略了文档图像中的全局信息。因此,这会导致模型的普适性、灵活性和可靠性受到限制。我们介绍了 TransferDoc,一种基于 transferred 的 cross-modal transformer 架构,通过三种新的预先定义目标来预处理。TransferDoc 学习了更加具有 semantic 的概念,使得生成更加可转移的模型。此外,我们还引入了两个新的下游任务,以提供更加真实的工业评估场景,在这些场景下,TransferDoc 超过了其他状态计算的方法。
Evaluating the Reliability of CNN Models on Classifying Traffic and Road Signs using LIME
results: 研究发现,使用LIME框架可以增加模型预测的可解释性和可靠性,并且可以提高模型在图像分类任务中的效果。结论:LIME是一种重要的工具,可以帮助改进机器学习模型的可解释性和可靠性,无论这些模型在图像分类任务中的性能如何。Abstract
The objective of this investigation is to evaluate and contrast the effectiveness of four state-of-the-art pre-trained models, ResNet-34, VGG-19, DenseNet-121, and Inception V3, in classifying traffic and road signs with the utilization of the GTSRB public dataset. The study focuses on evaluating the accuracy of these models' predictions as well as their ability to employ appropriate features for image categorization. To gain insights into the strengths and limitations of the model's predictions, the study employs the local interpretable model-agnostic explanations (LIME) framework. The findings of this experiment indicate that LIME is a crucial tool for improving the interpretability and dependability of machine learning models for image identification, regardless of the models achieving an f1 score of 0.99 on classifying traffic and road signs. The conclusion of this study has important ramifications for how these models are used in practice, as it is crucial to ensure that model predictions are founded on the pertinent image features.
摘要
本研究的目的是评估和比较四种现代预训练模型(ResNet-34、VGG-19、DenseNet-121和Inception V3)在使用GTSRB公共数据集进行交通和道路标志分类中的效果。研究旨在评估这些模型预测结果的准确性以及它们在图像分类中采用合适的特征。为了获得模型预测结果的含义和依赖性,研究使用了地方可解释性模型-无关的框架(LIME)。研究发现,LIME是一种重要的工具,可以提高图像识别模型的可解释性和可靠性,无论这些模型在交通和道路标志分类中的f1分数是0.99。这项研究的结论对于这些模型在实践中的使用具有重要的意义,因为需要确保模型的预测基于相关的图像特征。
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
results: 对6种物体类别的 egocentric video进行了实验,与单视和多视方法相比,显示了显著的改善,并且可以重建来自YouTube的任意clip,包括1人和3人互动。Abstract
We tackle the task of reconstructing hand-object interactions from short video clips. Given an input video, our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape, as well as the time-varying motion and hand articulation. While the input video naturally provides some multi-view cues to guide 3D inference, these are insufficient on their own due to occlusions and limited viewpoint variations. To obtain accurate 3D, we augment the multi-view signals with generic data-driven priors to guide reconstruction. Specifically, we learn a diffusion network to model the conditional distribution of (geometric) renderings of objects conditioned on hand configuration and category label, and leverage it as a prior to guide the novel-view renderings of the reconstructed scene. We empirically evaluate our approach on egocentric videos across 6 object categories, and observe significant improvements over prior single-view and multi-view methods. Finally, we demonstrate our system's ability to reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person interactions.
摘要
我们面临着从短视频clip中重建手对象交互的任务。给定输入视频,我们的方法将3D推断视为每个视频优化,并recover一个神经网络3D表示物体形状,以及时变动和手部骨骼运动。虽然输入视频自然提供一些多视图提示来导引3D推断,但这些提示不够准确,因为 occlusion 和有限的视角变化。为了获得高精度3D,我们将多视图信号与通用数据驱动的假设相结合,以便重建。特别是,我们学习了一种扩散网络,用于模型(geometry)视图的条件分布,根据手姿和类别标签来Conditional Rendering。我们观察了在6种物品类别上的 Egocentric 视频中,我们的方法与之前的单视和多视方法进行比较,并观察到了显著的改进。最后,我们示出了我们系统可以从 YouTube 上 reconstruction任意clip,包括1人和3人交互。
ViHOPE: Visuotactile In-Hand Object 6D Pose Estimation with Shape Completion
results: 相比于直接将视听触感转换为6D姿态,该方法可以提高6D姿态估计的准确性。在视听形状完成任务中,我们超过了状态时的表现,并在Chamfer距离和Intersection of Union metric上减少了88%和265%的值。在视听姿态估计任务中,我们获得了35%和64%的位置和角度错误减少。Abstract
In this letter, we introduce ViHOPE, a novel framework for estimating the 6D pose of an in-hand object using visuotactile perception. Our key insight is that the accuracy of the 6D object pose estimate can be improved by explicitly completing the shape of the object. To this end, we introduce a novel visuotactile shape completion module that uses a conditional Generative Adversarial Network to complete the shape of an in-hand object based on volumetric representation. This approach improves over prior works that directly regress visuotactile observations to a 6D pose. By explicitly completing the shape of the in-hand object and jointly optimizing the shape completion and pose estimation tasks, we improve the accuracy of the 6D object pose estimate. We train and test our model on a synthetic dataset and compare it with the state-of-the-art. In the visuotactile shape completion task, we outperform the state-of-the-art by 265% using the Intersection of Union metric and achieve 88% lower Chamfer Distance. In the visuotactile pose estimation task, we present results that suggest our framework reduces position and angular errors by 35% and 64%, respectively. Furthermore, we ablate our framework to confirm the gain on the 6D object pose estimate from explicitly completing the shape. Ultimately, we show that our framework produces models that are robust to sim-to-real transfer on a real-world robot platform.
摘要
在这封信中,我们介绍了一个新的框架,即ViHOPE,用于估计手中物体的6D姿态。我们的关键发现是,通过显式完成手中物体的形状,可以提高6D物体姿态估计的准确性。为此,我们引入了一种新的视听形状完成模块,该模块使用conditional生成对抗网络来完成手中物体的形状基于体积表示。这种方法在直接从视听观测中预测6D姿态的先前作品上进行了改进。通过显式完成手中物体的形状并同时优化形状完成和姿态估计任务,我们提高了6D物体姿态估计的准确性。我们在一个 sintetic 数据集上训练和测试了我们的模型,并与当前最佳的状态 comparison。在视听形状完成任务中,我们使用Intersection of Union metric的比较,我们的模型在比较中高于当前最佳的状态,提高了88%的Chamfer Distance。在视听姿态估计任务中,我们提供了结果,表明我们的框架可以降低位置和angular error by 35%和64%,分别。此外,我们对我们的框架进行了剖析,并证明了在实际世界Robot平台上的robustness。最终,我们表明了我们的框架生成的模型在实际世界中是可靠的。
An Effective Two-stage Training Paradigm Detector for Small Dataset
results: 这种方法在 DelftBikes 测试集上得到了 30.4% 的平均精度,在领域中排名第四。Abstract
Learning from the limited amount of labeled data to the pre-train model has always been viewed as a challenging task. In this report, an effective and robust solution, the two-stage training paradigm YOLOv8 detector (TP-YOLOv8), is designed for the object detection track in VIPriors Challenge 2023. First, the backbone of YOLOv8 is pre-trained as the encoder using the masked image modeling technique. Then the detector is fine-tuned with elaborate augmentations. During the test stage, test-time augmentation (TTA) is used to enhance each model, and weighted box fusion (WBF) is implemented to further boost the performance. With the well-designed structure, our approach has achieved 30.4% average precision from 0.50 to 0.95 on the DelftBikes test set, ranking 4th on the leaderboard.
摘要
学习从有限的标注数据到预训练模型总是被视为一个挑战。在这份报告中,我们提出了一种有效和可靠的解决方案,即两stage训练 парадигмы(TP-YOLOv8),用于Object Detection track在VIPriors Challenge 2023中。首先,YOLOv8的背bone被用作Encoder,通过masked image modeling技术进行预训练。然后,检测器被细化地归一化。在测试阶段,使用测试时数学增强(TTA)以提高每个模型的性能,并实施Weighted Box Fusion(WBF)以进一步提高表现。基于我们的结构设计,我们的方法在DelftBikes测试集上达到了30.4%的平均精度,在领导者板块上排名第四。
CitDet: A Benchmark Dataset for Citrus Fruit Detection
results: 该论文通过提供高分辨率图像和高质量 bounding box 约束,实现了 citrus 果实检测的高精度。此外,论文还显示了 citrus 果实的位置可以准确地反映 citrus 树上HLB病毒的影响,并与产量估算有直接的相关性。Abstract
In this letter, we present a new dataset to advance the state of the art in detecting citrus fruit and accurately estimate yield on trees affected by the Huanglongbing (HLB) disease in orchard environments via imaging. Despite the fact that significant progress has been made in solving the fruit detection problem, the lack of publicly available datasets has complicated direct comparison of results. For instance, citrus detection has long been of interest in the agricultural research community, yet there is an absence of work, particularly involving public datasets of citrus affected by HLB. To address this issue, we enhance state-of-the-art object detection methods for use in typical orchard settings. Concretely, we provide high-resolution images of citrus trees located in an area known to be highly affected by HLB, along with high-quality bounding box annotations of citrus fruit. Fruit on both the trees and the ground are labeled to allow for identification of fruit location, which contributes to advancements in yield estimation and potential measure of HLB impact via fruit drop. The dataset consists of over 32,000 bounding box annotations for fruit instances contained in 579 high-resolution images. In summary, our contributions are the following: (i) we introduce a novel dataset along with baseline performance benchmarks on multiple contemporary object detection algorithms, (ii) we show the ability to accurately capture fruit location on tree or on ground, and finally (ii) we present a correlation of our results with yield estimations.
摘要
在这封信中,我们介绍了一个新的数据集,以提高检测柑橘果的状态艺术在受 huanglongbing(HLB)病虫影响的orchard环境中。尽管在检测果实问题上已经取得了重要进步,但由于缺乏公共可用的数据集,对结果的直接比较受到了限制。例如,柑橘果检测已经在农业研究领域产生了长期的兴趣,但是没有具有公共数据集的相关研究,特别是涉及HLB病虫的柑橘果检测。为解决这个问题,我们改进了现有的object detection方法,以适应典型的orchard环境。具体来说,我们提供了高分辨率的柑橘树图像,以及高质量的 bounding box 约束标注。 fruit的位置可以在树上或地上被确定,这有助于提高产量估算和HLB病虫的影响度量。数据集包含了32,000个 bounding box 约束标注,分别包含579个高分辨率的图像。简而言之,我们的贡献包括以下几点:1. 我们引入了一个新的数据集,并提供了多种现代 object detection 算法的基准性能评价。2. 我们能够准确地捕捉柑橘果的位置,包括树上和地上的位置。3. 我们对我们的结果与产量估算之间的相关性进行了报告。
Learning the Geodesic Embedding with Graph Neural Networks
results: 对ShapeNet进行了测试,并证明了对比 existed方法,具有orders of magnitude快的速度和相似或更好的准确性。同时,方法还能够在噪音和缺失点云上进行稳定的计算,并且具有强大的泛化能力。Abstract
We present GeGnn, a learning-based method for computing the approximate geodesic distance between two arbitrary points on discrete polyhedra surfaces with constant time complexity after fast precomputation. Previous relevant methods either focus on computing the geodesic distance between a single source and all destinations, which has linear complexity at least or require a long precomputation time. Our key idea is to train a graph neural network to embed an input mesh into a high-dimensional embedding space and compute the geodesic distance between a pair of points using the corresponding embedding vectors and a lightweight decoding function. To facilitate the learning of the embedding, we propose novel graph convolution and graph pooling modules that incorporate local geodesic information and are verified to be much more effective than previous designs. After training, our method requires only one forward pass of the network per mesh as precomputation. Then, we can compute the geodesic distance between a pair of points using our decoding function, which requires only several matrix multiplications and can be massively parallelized on GPUs. We verify the efficiency and effectiveness of our method on ShapeNet and demonstrate that our method is faster than existing methods by orders of magnitude while achieving comparable or better accuracy. Additionally, our method exhibits robustness on noisy and incomplete meshes and strong generalization ability on out-of-distribution meshes. The code and pretrained model can be found on https://github.com/IntelligentGeometry/GeGnn.
摘要
我们介绍GeGnn,一种基于学习的方法,用于计算粗略地理odesic距离 между两个随机点 на discrete polyhedra 表面上,具有常量时间复杂度以后快速预处理。先前的相关方法都是计算一个源点和所有目标点之间的 geodesic 距离,其复杂度至少是线性的,或者需要长时间的预处理。我们的关键想法是通过训练一个图神经网络,将输入网格 embedding 到高维 embedding 空间中,然后使用对应的 embedding вектор和一个轻量级的解码函数计算 geodesic 距离。为了促进 embedding 的学习,我们提出了新的图神经和图聚合模块,它们在地理odesic 信息的本地特征上吸收了更多的信息,并被证明是远胜先前的设计。之后,我们只需要在网格上进行一次网络前向传播,然后可以使用我们的解码函数计算 geodesic 距离,这需要只需要几个矩阵乘法和可以大规模并行化在 GPU 上。我们证明了我们的方法在ShapeNet上的效率和有效性,并表明我们的方法比现有方法速度多orders of magnitude,同时具有相似或更好的准确性。此外,我们的方法在噪音和缺失网格上展现了鲁棒性,并且在不同网格上的泛化能力强。我们的代码和预训练模型可以在https://github.com/IntelligentGeometry/GeGnn 找到。
UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase
results: 这个论文在三个公共评估benchmark上达到了优秀的结果,包括SemanticKITTI、nuScenes和Waymo Open Dataset(WOD)的LiDARsemantic分割和панOPTIC分割挑战赛。此外,这个论文还构建了OpenPCSeg代码库,它是最大和最全面的户外LiDAR分割代码库,包含大多数户外LiDAR分割算法和可重现实现。Abstract
Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space,where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg.
摘要
<> translate the following text into Simplified Chinese:Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space,where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg.Translated text in Simplified Chinese:Point-, voxel-, 和 range-views 是 LiDAR 点云的三种表示形式,它们都具有高精度的 3D 测量,但缺乏颜色和текстура信息。 RGB 图像是 LiDAR 点云视图的自然补充,完全利用图像的广泛 semantic 信息和 calibration 错误的Robustness。在这篇论文中,我们提出了一种多模态 LiDAR 分割网络,称为 UniSeg,它利用 RGB 图像和三种 LiDAR 点云视图,并在同时完成 semantic 分割和 panoptic 分割。具体来说,我们首先设计了 Learnable cross-Modal Association(LMA)模块,自动将 voxel-view 和 range-view 特征与图像特征进行关联,以全面利用图像的 semantic 信息和 calibration 错误的Robustness。然后,通过将增强的 voxel-view 和 range-view 特征转换到点空间,并在点云特征上进行 adaptive 关联,使得三种点云视图的特征得到了有效的融合。值得一提的是,UniSeg 在 SemanticKITTI、nuScenes 和 Waymo Open Dataset(WOD)三个公共 bencmarks 上表现出色,其中在 nuScenes 中的 LiDAR semantic 分割挑战和 SemanticKITTI 中的 panoptic 分割挑战上排名第一。此外,我们还建立了 OpenPCSeg 代码库,它是最大和最全面的 outdoor LiDAR 分割代码库,它包含了大多数流行的 outdoor LiDAR 分割算法,并提供了可重现的实现。OpenPCSeg 代码库将在 https://github.com/PJLab-ADG/PCSeg 上公开。
OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
results: 实验结果表明,该方法具有显著的对外域泛化能力和稳定性,并在多个任务和benchmark上实现了STATE-OF-THE-ART的性能和准确率。Abstract
The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.
摘要
“在线购物和电商的不断增长下,需要可扩展和可靠的机器学习解决方案来满足客户需求。在自动标签分类和多 modal 搜寻的上下文中,先前的工作 Either 定义了一个低通用的超vised 学习方法或更可重用的 CLIP 基本技术,而且将训练数据来源汇入关闭。在这个工作中,我们提出了 OpenFashionCLIP,一种视觉和语言对照学习方法,仅使用开源时装数据,来自多个领域,具有不同程度的特定性。我们的方法在多个任务和标准库中进行了广泛验证,实验结果显示了它在不同领域的外部测试能力和稳定性有很大提升,并且在精度和回传上也有显著的改善。源代码和训练模型可以在:https://github.com/aimagelab/open-fashion-clip 中下载。”
paper_authors: Misgina Tsighe Hagos, Niamh Belton, Kathleen M. Curran, Brian Mac Namee
for: This paper is written for training deep learning models with interactive learning approaches that provide transparent explanations of the model’s decisions.
methods: The paper proposes a method called distance-aware explanation loss, which adds a distance-based penalty to the categorical losses to train the model to focus on important regions of the training dataset.
results: The paper demonstrates the performance of the proposed method on three image classification tasks, and proposes an interpretability metric for evaluating visual feature-attribution based model explanations.Abstract
eXplanation Based Learning (XBL) is an interactive learning approach that provides a transparent method of training deep learning models by interacting with their explanations. XBL augments loss functions to penalize a model based on deviation of its explanations from user annotation of image features. The literature on XBL mostly depends on the intersection of visual model explanations and image feature annotations. We present a method to add a distance-aware explanation loss to categorical losses that trains a learner to focus on important regions of a training dataset. Distance is an appropriate approach for calculating explanation loss since visual model explanations such as Gradient-weighted Class Activation Mapping (Grad-CAMs) are not strictly bounded as annotations and their intersections may not provide complete information on the deviation of a model's focus from relevant image regions. In addition to assessing our model using existing metrics, we propose an interpretability metric for evaluating visual feature-attribution based model explanations that is more informative of the model's performance than existing metrics. We demonstrate performance of our proposed method on three image classification tasks.
摘要
<>批处学习(XBL)是一种互动式学习方法,它通过与模型的解释进行交互式训练深度学习模型。 XBL 将损失函数增强为对模型解释的偏差进行惩罚。 文献中的 XBL 主要基于图像特征标注和视觉模型解释的交集。 我们提出一种将距离意识到的解释损失添加到 categorical 损失函数中,以训练学习者专注于训练集中重要的区域。 距离是一种适当的方法 для计算解释损失,因为视觉模型的解释,如梯度权重分布图(Grad-CAMs),不是很准确的标注,而且它们的交集可能不会提供完整的信息,关于模型的ocus deviate from relevant image regions。 除了使用现有的metric来评估我们的模型,我们还提出了一种可以更加准确地评估视觉特征归属基于模型解释的解释度量。 我们在三个图像分类任务上展示了我们的提议的性能。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
On the detection of Out-Of-Distribution samples in Multiple Instance Learning
results: 实验结果显示,DICE emerges as the best-performing method overall,但它在某些数据集上表现不佳,这说明OOD检测在MIL框架下是一个复杂和挑战性的话题。Abstract
The deployment of machine learning solutions in real-world scenarios often involves addressing the challenge of out-of-distribution (OOD) detection. While significant efforts have been devoted to OOD detection in classical supervised settings, the context of weakly supervised learning, particularly the Multiple Instance Learning (MIL) framework, remains under-explored. In this study, we tackle this challenge by adapting post-hoc OOD detection methods to the MIL setting while introducing a novel benchmark specifically designed to assess OOD detection performance in weakly supervised scenarios. Extensive experiments based on diverse public datasets do not reveal a single method with a clear advantage over the others. Although DICE emerges as the best-performing method overall, it exhibits significant shortcomings on some datasets, emphasizing the complexity of this under-explored and challenging topic. Our findings shed light on the complex nature of OOD detection under the MIL framework, emphasizing the importance of developing novel, robust, and reliable methods that can generalize effectively in a weakly supervised context. The code for the paper is available here: https://github.com/loic-lb/OOD_MIL.
摘要
deployment of machine learning solutions in real-world scenarios 常会面临对不同数据分布(Out-of-distribution,OOD)的挑战。虽然对约束学习(classical supervised learning)的OOD检测得到了很多努力,但是多例学习(Multiple Instance Learning,MIL)框架仍然尚未得到了充分的研究。在这篇研究中,我们对MIL框架中的OOD检测进行了适应,并创建了特别设计来评估OOD检测性能的实验室环境。经过了各种公开数据集的广泛实验,我们发现没有一个方法能够在所有数据集中表现出色,DICE获得了整体最好的成绩,但是在某些数据集上它具有明显的缺陷,这说明了MIL框架下OOD检测的复杂性和挑战性。我们的发现强调了在弱监督下进行OOD检测的重要性,需要开发出新的、可靠、有效的方法,以应对这种挑战。研究代码可以在以下链接中找到:https://github.com/loic-lb/OOD_MIL。
ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation
results: 通过考虑不同的跨领域情况,如 Waymo-to-KITTI、Waymo-to-nuScenes 等,实验表明 ReSimAD 方法能够提高领域总结能力,甚至在 3D 预训练中表现出色。Abstract
Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous-domain knowledge can be hardly directly deployed to a new domain without additional costs. In this paper, we provide a new perspective and approach of alleviating the domain shifts, by proposing a Reconstruction-Simulation-Perception (ReSimAD) scheme. Specifically, the implicit reconstruction process is based on the knowledge from the previous old domain, aiming to convert the domain-related knowledge into domain-invariant representations, \textit{e.g.}, 3D scene-level meshes. Besides, the point clouds simulation process of multiple new domains is conditioned on the above reconstructed 3D meshes, where the target-domain-like simulation samples can be obtained, thus reducing the cost of collecting and annotating new-domain data for the subsequent perception process. For experiments, we consider different cross-domain situations such as Waymo-to-KITTI, Waymo-to-nuScenes, Waymo-to-ONCE, \textit{etc}, to verify the \textbf{zero-shot} target-domain perception using ReSimAD. Results demonstrate that our method is beneficial to boost the domain generalization ability, even promising for 3D pre-training.
摘要
域别变化(Domain shifts)是自动驾驶(Autonomous Driving)中的普遍问题,这对于自动驾驶模型而言是一大挑战,因为这些模型从前一个域别获得的知识很难直接应用到新的域别中。在这篇文章中,我们提出了一个新的见解和方法来解决域别变化问题,即提案了一个复原-模拟-观察(ReSimAD)方案。具体来说,这个方案的隐藏重建过程基于以前的旧域别知识,目的是将域别相关的知识转换为域别不对称的表示,例如3D场景级别的几何体。此外,模拟过程中的多个新域别的点 clouds是基于上述复原的3D几何体进行 conditioning,从而降低了获取和标注新域别数据的成本,以便在后续的观察过程中使用。实验中,我们考虑了不同的跨域别情况,如 Waymo-to-KITTI、Waymo-to-nuScenes、Waymo-to-ONCE等,以验证我们的方法在零数据目标域观察中的优化效果。结果显示,我们的方法可以增强域别普遍化能力,甚至对3D预训有推动作用。
Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss
paper_authors: Sebastian Schmidt, Stephan Günnemann
for: 这 paper 是关于活动学习(AL),它可以减少机器学习模型训练所需的标注数据量。
methods: 这 paper 使用了一种新的 temporal predicted loss(TPL)方法,它利用了图像流的时间性质量进行过滤。
results: 实验表明,TPL 方法可以大幅提高选择的多样性,同时比 pool-based 方法更快。TPL 还与州界前面的 pool-based 和流程基于的方法进行比较,显示它在不同的模型上表现更出色。Abstract
Active learning (AL) reduces the amount of labeled data needed to train a machine learning model by intelligently choosing which instances to label. Classic pool-based AL requires all data to be present in a datacenter, which can be challenging with the increasing amounts of data needed in deep learning. However, AL on mobile devices and robots, like autonomous cars, can filter the data from perception sensor streams before reaching the datacenter. We exploited the temporal properties for such image streams in our work and proposed the novel temporal predicted loss (TPL) method. To evaluate the stream-based setting properly, we introduced the GTA V streets and the A2D2 streets dataset and made both publicly available. Our experiments showed that our approach significantly improves the diversity of the selection while being an uncertainty-based method. As pool-based approaches are more common in perception applications, we derived a concept for comparing pool-based and stream-based AL, where TPL out-performed state-of-the-art pool- or stream-based approaches for different models. TPL demonstrated a gain of 2.5 precept points (pp) less required data while being significantly faster than pool-based methods.
摘要
aktive lerning (AL) 可以减少用于训练机器学习模型的标注数据量,通过智能地选择需要标注的实例。 классиic pool-based AL 需要所有数据都存在数据中心,这可能是深度学习中所需的数据量增加的挑战。然而, AL 在移动设备和机器人(如自动驾驶车)上可以从感知传感器流中筛选数据,以避免数据中心的压力。我们利用了图像流的时间性质量,并提出了新的时间预测损失(TPL)方法。为了正确评估流式设置,我们介绍了 GTA V 街道和 A2D2 街道数据集,并将其公开发布。我们的实验表明,我们的方法可以明显提高选择的多样性,而且是一种uncertainty-based方法。在感知应用中,pool-based方法更常见,因此我们提出了对 pool-based 和流式 AL 进行比较的概念,并证明 TPL 在不同的模型上都能够超过状态对照方法。TPL 在训练过程中得到了2.5个预言点(pp) menos的数据量,同时比 pool-based 方法更快。
paper_authors: Haoke Xiao, Lv Tang, Bo Li, Zhiming Luo, Shaozi Li
for: 本研究旨在模仿人类视觉系统中识别图像集中的共同和突出对象的能力。
methods: 我们提出了一种采用基础计算机视觉模型的零标注CoSOD框架,不需要任何训练过程。我们还 introduce two novel component:集体提示生成模块(GPG)和共聚焦度图生成模块(CMP)。
results: 我们对广泛使用的 dataset 进行评估,并观察了非常出色的结果。我们的方法超过了现有的无监督方法,甚至超过了2020年之前开发的完全监督方法,而且与2022年之前开发的一些完全监督方法保持竞争力。Abstract
Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system's capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework's performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.
摘要
Note: "CoSOD" is a abbreviation of "Co-salient Object Detection".Here's the translation in Simplified Chinese:CoSOD (共同焦点物体检测) 目标是模仿人类视觉系统在一组图像中识别共同焦点和突出的物体。尽管最近的深度学习模型已经取得了一定的进步,但这些模型仍然需要基于良好注解的 CoSOD 数据集进行训练。训练自由零shot CoSOD 框架的探索受限。在这篇论文中,我们 inspirited 由基础计算机视觉模型的零shot 传输能力,我们提出了第一个无需训练的 CoSOD 框架。为实现这一点,我们提出了两个新的组件:集体提示生成(GPG)模块和共同焦点图生成(CMP)模块。我们在广泛使用的数据集上评估了该框架的性能,并观察到了非常出色的结果。我们的方法超过了现有的无监督方法,甚至超过了2020年之前开发的完全监督方法,同时与2022年之前开发的一些完全监督方法维持竞争力。
COMPASS: High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability
paper_authors: Jongmin Park, Jooyoung Lee, Munchurl Kim
for: This paper proposes a novel neural network (NN)-based spatially scalable image compression method called COMPASS, which supports arbitrary-scale spatial scalability and has a flexible structure that allows for arbitrary determination of the number of layers and their respective scale factors during inference.
methods: The proposed COMPASS method uses an inter-layer arbitrary scale prediction method called LIFF based on implicit neural representation to reduce spatial redundancy between adjacent layers for arbitrary scale factors. The method also uses a combined RD loss function to effectively train multiple layers.
results: The experimental results show that COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Additionally, COMPASS shows comparable or even better coding efficiency than the single-layer coding for various scale factors.Abstract
Recently, neural network (NN)-based image compression studies have actively been made and has shown impressive performance in comparison to traditional methods. However, most of the works have focused on non-scalable image compression (single-layer coding) while spatially scalable image compression has drawn less attention although it has many applications. In this paper, we propose a novel NN-based spatially scalable image compression method, called COMPASS, which supports arbitrary-scale spatial scalability. Our proposed COMPASS has a very flexible structure where the number of layers and their respective scale factors can be arbitrarily determined during inference. To reduce the spatial redundancy between adjacent layers for arbitrary scale factors, our COMPASS adopts an inter-layer arbitrary scale prediction method, called LIFF, based on implicit neural representation. We propose a combined RD loss function to effectively train multiple layers. Experimental results show that our COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Our COMPASS also shows comparable or even better coding efficiency than the single-layer coding for various scale factors.
摘要
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval
results: 我们在两个多语言图像文本数据集和一个视频文本数据集上进行了广泛的实验,结果表明我们的提议方法具有效果和稳定性,同时也能够扩展到跨语言图像文本基线和 OUT-OF-DOMAIN 数据。Abstract
Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.
摘要
当前的跨模态检索研究大多是英语 oriented,因为有大量的英语人工标注的视觉语言数据。为了突破非英语标注数据的限制,跨语言跨模态检索(CCR)已经吸引了越来越多的关注。大多数 CCR 方法使用机器翻译(MT)construct pseudo-parallel vision-language corpora以实现跨语言传递。然而,由 MT 翻译的句子通常不能准确描述相应的视觉内容。如果不正确地假设 pseudo-parallel 数据是正确相关的,那么网络会适应到噪音相关性。因此,我们提出了双视角最优运输(DCOT),以学习噪音相关性在 CCR 中。具体来说,我们使用optimal transport理论从双视角量度来衡量样本对的相关性,并设计了双视角课程学习来动态模型运输成本 According to the learning stage of the two views。我们在两个多语言图像文本数据集和一个视频文本数据集上进行了广泛的实验,并得到了我们提posed方法的效果和稳定性。此外,我们的提出方法还显示了跨语言图像文本基elines的好适用性和out-of-domain数据的 descent generalization。
A Localization-to-Segmentation Framework for Automatic Tumor Segmentation in Whole-Body PET/CT Images
results: 在MICCAI 2023 Automated Lesion Segmentation in Whole-Body FDG-PET/CT challenge dataset上进行实验,我们的方法在预选测试集上取得了竞争性的结果,排名在前7名之间。Abstract
Fluorodeoxyglucose (FDG) positron emission tomography (PET) combined with computed tomography (CT) is considered the primary solution for detecting some cancers, such as lung cancer and melanoma. Automatic segmentation of tumors in PET/CT images can help reduce doctors' workload, thereby improving diagnostic quality. However, precise tumor segmentation is challenging due to the small size of many tumors and the similarity of high-uptake normal areas to the tumor regions. To address these issues, this paper proposes a localization-to-segmentation framework (L2SNet) for precise tumor segmentation. L2SNet first localizes the possible lesions in the lesion localization phase and then uses the location cues to shape the segmentation results in the lesion segmentation phase. To further improve the segmentation performance of L2SNet, we design an adaptive threshold scheme that takes the segmentation results of the two phases into consideration. The experiments with the MICCAI 2023 Automated Lesion Segmentation in Whole-Body FDG-PET/CT challenge dataset show that our method achieved a competitive result and was ranked in the top 7 methods on the preliminary test set. Our work is available at: https://github.com/MedCAI/L2SNet.
摘要
富含氟代谐糖蛋白 (FDG) позиトрон辐射Tomography (PET) 与计算机断层成像 (CT) 被视为检测一些肿瘤的首选方法,如肺癌和黑色素瘤。自动将肿瘤 segmented 出 PET/CT 图像中可以减轻医生的工作负担,从而提高诊断质量。然而,准确地 segmenting 肿瘤是困难的,因为许多肿瘤的体积很小,而且高吸收的正常区域与肿瘤区域相似。为解决这些问题,本文提出了一个 localization-to-segmentation 框架 (L2SNet),用于准确地 segmenting 肿瘤。L2SNet 首先在肿瘤localization阶段确定可能的肿瘤,然后使用位置提示来形成 segmentation 结果。为了进一步提高 L2SNet 的 segmentation 性能,我们设计了一种适应reshold scheme,该 scheme 根据 segmentation 结果来调整阈值。我们的实验表明,使用 MICCAI 2023 自动肿瘤 segmentation in Whole-Body FDG-PET/CT 挑战数据集,我们的方法在预liminary test set 上获得了竞争性的结果,并列在前 7 名。我们的工作可以在 GitHub 上找到:https://github.com/MedCAI/L2SNet。
Towards Content-based Pixel Retrieval in Revisited Oxford and Paris
results: 研究结果表明,像素检索任务对现有方法来说是一个挑战,与现有问题不同,这表明进一步研究可以提高内容基于像素检索的用户搜索体验。Abstract
This paper introduces the first two pixel retrieval benchmarks. Pixel retrieval is segmented instance retrieval. Like semantic segmentation extends classification to the pixel level, pixel retrieval is an extension of image retrieval and offers information about which pixels are related to the query object. In addition to retrieving images for the given query, it helps users quickly identify the query object in true positive images and exclude false positive images by denoting the correlated pixels. Our user study results show pixel-level annotation can significantly improve the user experience. Compared with semantic and instance segmentation, pixel retrieval requires a fine-grained recognition capability for variable-granularity targets. To this end, we propose pixel retrieval benchmarks named PROxford and PRParis, which are based on the widely used image retrieval datasets, ROxford and RParis. Three professional annotators label 5,942 images with two rounds of double-checking and refinement. Furthermore, we conduct extensive experiments and analysis on the SOTA methods in image search, image matching, detection, segmentation, and dense matching using our pixel retrieval benchmarks. Results show that the pixel retrieval task is challenging to these approaches and distinctive from existing problems, suggesting that further research can advance the content-based pixel-retrieval and thus user search experience. The datasets can be downloaded from \href{https://github.com/anguoyuan/Pixel_retrieval-Segmented_instance_retrieval}{this link}.
摘要
FlowIBR: Leveraging Pre-Training for Efficient Neural Image-Based Rendering of Dynamic Scenes
paper_authors: Marcel Büsching, Josef Bengtson, David Nilsson, Mårten Björkman
for: 这个论文的目的是为了实现单目视图合成动态场景。
methods: 这个方法使用神经网络基于图像渲染方法,在大量可用的静止场景数据集上进行预训练,然后使用每个场景的优化的场景流场谱来抵消场景动力,使摄像机辐射线与场景动力相抵消,以present the dynamic scene as if it were static to the rendering network。
results: 该方法可以在单个消费级GPU上实现near-optimal results,并且可以减少每个场景优化时间量by an order of magnitude。Abstract
We introduce a novel approach for monocular novel view synthesis of dynamic scenes. Existing techniques already show impressive rendering quality but tend to focus on optimization within a single scene without leveraging prior knowledge. This limitation has been primarily attributed to the lack of datasets of dynamic scenes available for training and the diversity of scene dynamics. Our method FlowIBR circumvents these issues by integrating a neural image-based rendering method, pre-trained on a large corpus of widely available static scenes, with a per-scene optimized scene flow field. Utilizing this flow field, we bend the camera rays to counteract the scene dynamics, thereby presenting the dynamic scene as if it were static to the rendering network. The proposed method reduces per-scene optimization time by an order of magnitude, achieving comparable results to existing methods - all on a single consumer-grade GPU.
摘要
我们介绍了一种新的笔脚渲染方法,用于单视图动态场景的synthesis。现有技术已经达到了非常出色的渲染质量,但它们通常会在单个场景中优化,而不是利用先前的知识。这种限制主要归结于动态场景的数据集不足以进行训练,以及场景动态的多样性。我们的方法FlowIBR通过将神经网络图像基于渲染方法与每个场景优化的场景流场融合,使用这个流场场融合了摄像机杆的弯曲,使动态场景被渲染为如果是静止的,并且通过这种方法可以大幅降低每个场景优化时间,达到与现有方法相同的结果,全部在单个Consumer-grade GPU上进行。
Treatment-aware Diffusion Probabilistic Model for Longitudinal MRI Generation and Diffuse Glioma Growth Prediction
paper_authors: Qinghui Liu, Elies Fuster-Garcia, Ivar Thokle Hovden, Donatas Sederevicius, Karoline Skogen, Bradley J MacIntosh, Edvard Grødem, Till Schellhorn, Petter Brandal, Atle Bjørnerud, Kyrre Eeg Emblem
For: This paper presents a novel end-to-end network for generating future tumor masks and realistic MRIs of how the tumor will look at any future time points for different treatment plans.* Methods: The approach is based on cutting-edge diffusion probabilistic models and deep-segmentation neural networks, using sequential multi-parametric magnetic resonance images (MRI) and treatment information as conditioning inputs to guide the generative diffusion process.* Results: The model has demonstrated promising performance across a range of tasks, including the generation of high-quality synthetic MRIs with tumor masks, time-series tumor segmentations, and uncertainty estimates.Abstract
Diffuse gliomas are malignant brain tumors that grow widespread through the brain. The complex interactions between neoplastic cells and normal tissue, as well as the treatment-induced changes often encountered, make glioma tumor growth modeling challenging. In this paper, we present a novel end-to-end network capable of generating future tumor masks and realistic MRIs of how the tumor will look at any future time points for different treatment plans. Our approach is based on cutting-edge diffusion probabilistic models and deep-segmentation neural networks. We included sequential multi-parametric magnetic resonance images (MRI) and treatment information as conditioning inputs to guide the generative diffusion process. This allows for tumor growth estimates at any given time point. We trained the model using real-world postoperative longitudinal MRI data with glioma tumor growth trajectories represented as tumor segmentation maps over time. The model has demonstrated promising performance across a range of tasks, including the generation of high-quality synthetic MRIs with tumor masks, time-series tumor segmentations, and uncertainty estimates. Combined with the treatment-aware generated MRIs, the tumor growth predictions with uncertainty estimates can provide useful information for clinical decision-making.
摘要
Diffuse gliomas 是肿瘤性脑肿,通过脑中扩散生长。这种肿瘤的复杂的Cellular interactions和正常组织之间的互动,以及治疗所引起的变化,使肿瘤增长模型变得具有挑战性。在这篇论文中,我们提出了一种新的端到端网络,能够生成未来肿瘤的面具和真实的MRI图像。我们的方法基于最新的扩散概率模型和深度分割神经网络。我们将继序多parametric磁共振成像(MRI)和治疗信息作为conditioning输入,以引导生成扩散过程。这allow for肿瘤增长估计在任何给定时间点。我们使用了真实世界后期手术 longitudinal MRI数据,其中肿瘤增长轨迹表示为肿瘤分割地图的时间序列。模型在多种任务上表现出色,包括生成高质量的Synthetic MRI面具、时间序列肿瘤分割、和不确定性估计。与治疗意识的生成MRI图像结合,肿瘤增长预测与不确定性估计可以为临床决策提供有用信息。
Two-Stage Hybrid Supervision Framework for Fast, Low-resource, and Accurate Organ and Pan-cancer Segmentation in Abdomen CT
methods: 这个方法基于自带训练和平均教师的混合模型,使用半标示和无标示数据进行分类。它还 introduce a two-stage segmentation pipeline和整个质量量-based input strategy来最大化分类精度。
results: 在FLARE2023的验证集上,这个方法实现了佳的分类性能,同时具有快速和低资源模型测试的能力。它的平均DSC分数为89.79%和45.55%,而执行时间和GPU内存使用率分别为11.25秒和9627.82MB。Abstract
Abdominal organ and tumour segmentation has many important clinical applications, such as organ quantification, surgical planning, and disease diagnosis. However, manual assessment is inherently subjective with considerable inter- and intra-expert variability. In the paper, we propose a hybrid supervised framework, StMt, that integrates self-training and mean teacher for the segmentation of abdominal organs and tumors using partially labeled and unlabeled data. We introduce a two-stage segmentation pipeline and whole-volume-based input strategy to maximize segmentation accuracy while meeting the requirements of inference time and GPU memory usage. Experiments on the validation set of FLARE2023 demonstrate that our method achieves excellent segmentation performance as well as fast and low-resource model inference. Our method achieved an average DSC score of 89.79\% and 45.55 \% for the organs and lesions on the validation set and the average running time and area under GPU memory-time cure are 11.25s and 9627.82MB, respectively.
摘要
腹部器官和肿瘤分割有很多重要的临床应用,如器官量化、手术规划和疾病诊断。然而,手动评估是内在含糊不清,存在较大的内外专家变化。在论文中,我们提议一种混合监督方案,StMt,该方案将自我教学和平均老师约束用于腹部器官和肿瘤分割,使用部分标注和无标注数据。我们提出了两stage分割管道和整体量化输入策略,以便最大化分割准确性,同时满足推理时间和GPU内存使用的要求。在FLARE2023验证集上进行了实验,我们的方法实现了优秀的分割性能,同时具有快速和低资源模型推理。我们的方法在验证集上获得了89.79%的DSC分数和45.55%的lesion DSC分数,平均执行时间为11.25秒,GPU内存使用率为9627.82MB。
for: robust single rotation averaging to handle extremely large fractions of outliers
methods: minimize total truncated least unsquared deviations (TLUD) cost of geodesic distances, consisting of three steps: consider each input rotation as a potential initial solution, obtain the inlier set using the initial solution, and iteratively compute the geodesic $L_1$-mean of the inliers using the Weiszfeld algorithm on $SO(3)$
results: outperform the current state of the art in robustness against up to 99% outliers given a sufficient number of accurate inliersAbstract
In this work, we propose a novel method for robust single rotation averaging that can efficiently handle an extremely large fraction of outliers. Our approach is to minimize the total truncated least unsquared deviations (TLUD) cost of geodesic distances. The proposed algorithm consists of three steps: First, we consider each input rotation as a potential initial solution and choose the one that yields the least sum of truncated chordal deviations. Next, we obtain the inlier set using the initial solution and compute its chordal $L_2$-mean. Finally, starting from this estimate, we iteratively compute the geodesic $L_1$-mean of the inliers using the Weiszfeld algorithm on $SO(3)$. An extensive evaluation shows that our method is robust against up to 99% outliers given a sufficient number of accurate inliers, outperforming the current state of the art.
摘要
在这个研究中,我们提出了一种新的稳定单旋转平均方法,可以高效处理极高比例的异常值。我们的方法是将总 truncated least unsquared deviations(TLUD)成本最小化。我们的算法包括三步:首先,我们考虑每个输入旋转作为可能的初始解,选择它们的总和最小的 truncated chordal deviations 成本。接下来,我们使用初始解来获取准确的集合,并计算其圆柱 $L_2$-平均。最后,我们从这个估计开始,iteratively 使用 Weiszfeld 算法在 $SO(3)$ 上计算 geodesic $L_1$-平均。我们的方法可以快速处理大量数据,并且对于具有足够多准确准确的数据,可以快速地减少异常值的影响。
Collective PV-RCNN: A Novel Fusion Technique using Collective Detections for Enhanced Local LiDAR-Based Perception
results: 提出了一种新的拟合方法,可以将共同探测结果融合到本地LiDAR探测管道中,以提高自动驾驶车辆的环境感知能力。Abstract
Comprehensive perception of the environment is crucial for the safe operation of autonomous vehicles. However, the perception capabilities of autonomous vehicles are limited due to occlusions, limited sensor ranges, or environmental influences. Collective Perception (CP) aims to mitigate these problems by enabling the exchange of information between vehicles. A major challenge in CP is the fusion of the exchanged information. Due to the enormous bandwidth requirement of early fusion approaches and the interchangeability issues of intermediate fusion approaches, only the late fusion of shared detections is practical. Current late fusion approaches neglect valuable information for local detection, this is why we propose a novel fusion method to fuse the detections of cooperative vehicles within the local LiDAR-based detection pipeline. Therefore, we present Collective PV-RCNN (CPV-RCNN), which extends the PV-RCNN++ framework to fuse collective detections. Code is available at https://github.com/ekut-es
摘要
全面的环境感知是自动驾驶车辆安全运行的关键。然而,自动驾驶车辆的感知能力受到遮挡、感器范围有限以及环境因素的限制。集成感知(CP)试图解决这些问题,通过车辆之间的信息交换来提高感知范围和精度。但是,CP中的信息融合带来挑战,特别是在早期的融合方法需要巨大的带宽,而中间融合方法则存在交换信息的问题。为此,我们提出了一种新的融合方法,将在本地LiDAR基本检测管道中融合协作车辆的检测结果。因此,我们提出了集成PV-RCNN(CPV-RCNN),它是基于PV-RCNN++框架,用于融合集成检测结果。代码可以在GitHub上找到:https://github.com/ekut-es。
CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution
results: 在多个小数据集上实现了无需预训练的ViT表现提高(几乎无额外参数或计算成本增加)Abstract
The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. The merit of ViT over CNN has been largely attributed to large training datasets or auxiliary pre-training. Without pre-training, the performance of ViT on small datasets is limited because the global self-attention has limited capacity in local modeling. Towards boosting ViT on small datasets without pre-training, this work improves its local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show promising results. However, the element-wise simple learnable weight mask not only induces a non-trivial additional parameter overhead but also increases the optimization complexity. To this end, this work proposes a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks. Experimental results on multiple small datasets demonstrate that the effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost zero additional parameter or computation cost). Our code will be publicly available at \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}.
摘要
异形卷积(ViT)的成功在各种图像识别任务上广泛报道。异形卷积比 traditional CNN 有更多的优势,主要归功于大规模的训练数据或auxiliary预训练。然而,在小数据集上不进行预训练时,异形卷积的性能有限,因为全球自注意的能力在本地模型中有限。为了提高异形卷积在小数据集上的性能,这种工作改进了异形卷积的本地模型,通过应用Weight Mask在原始的自注意矩阵上。一种直观的方式是使用元素可学习的权重Mask(ELM),我们的初步结果表明这种方法有承诺的结果。然而,元素可学习的简单权重Mask不仅增加了非常小的额外参数负担,还增加了优化复杂度。为了解决这个问题,这种工作提议一种新的 Gaussian Mixture Mask(GMM),它只有两个可学习参数,可以方便地在任何异形卷积变种中使用。我们的实验结果表明,我们的提议的 Gaussian 面积可以免除额外参数和计算成本,提高异形卷积的性能。我们的代码将在 GitHub 上公开,可以在 \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention} 上获取。
Learning Geometric Representations of Objects via Interaction
results: 该 paper 提供了一种理论基础和实验证明,证明理想学习者可以准确地提取代理人和对象的位置,并且在下游任务中使用 reinforcement learning 进行有效的解决。Abstract
We address the problem of learning representations from observations of a scene involving an agent and an external object the agent interacts with. To this end, we propose a representation learning framework extracting the location in physical space of both the agent and the object from unstructured observations of arbitrary nature. Our framework relies on the actions performed by the agent as the only source of supervision, while assuming that the object is displaced by the agent via unknown dynamics. We provide a theoretical foundation and formally prove that an ideal learner is guaranteed to infer an isometric representation, disentangling the agent from the object and correctly extracting their locations. We evaluate empirically our framework on a variety of scenarios, showing that it outperforms vision-based approaches such as a state-of-the-art keypoint extractor. We moreover demonstrate how the extracted representations enable the agent to solve downstream tasks via reinforcement learning in an efficient manner.
摘要
我们处理了对一个场景中的代理人和外部物体之间的学习表现的问题。为此,我们提出了一个表现学习框架,从无结构的观察中提取代理人和物体的物理空间位置。我们的框架仅受代理人的动作作为指导,并假设物体是由代理人驱动的隐藏 Dynamics。我们提供了理论基础,正式证明了理想学习者可以从无结构观察中推导出不对称的表现,将代理人和物体分离开来,并正确地提取他们的位置。我们进行了实验评估,证明了我们的框架在多种情况下表现较好,并且可以通过循环学习来解决下游任务。
PAg-NeRF: Towards fast and efficient end-to-end panoptic 3D representations for agricultural robotics
paper_authors: Claus Smitt, Michael Halstead, Patrick Zimmer, Thomas Läbe, Esra Guclu, Cyrill Stachniss, Chris McCool
for: 实现园林 robots 监控和干预任务中的高精度Scene理解
methods: 使用NeRF技术建立3D测量和照片实际化描述
results: 提高 peak signal to noise ratio 和 panoptic quality,并且可以适应不精度的机器人位置资料Abstract
Precise scene understanding is key for most robot monitoring and intervention tasks in agriculture. In this work we present PAg-NeRF which is a novel NeRF-based system that enables 3D panoptic scene understanding. Our representation is trained using an image sequence with noisy robot odometry poses and automatic panoptic predictions with inconsistent IDs between frames. Despite this noisy input, our system is able to output scene geometry, photo-realistic renders and 3D consistent panoptic representations with consistent instance IDs. We evaluate this novel system in a very challenging horticultural scenario and in doing so demonstrate an end-to-end trainable system that can make use of noisy robot poses rather than precise poses that have to be pre-calculated. Compared to a baseline approach the peak signal to noise ratio is improved from 21.34dB to 23.37dB while the panoptic quality improves from 56.65% to 70.08%. Furthermore, our approach is faster and can be tuned to improve inference time by more than a factor of 2 while being memory efficient with approximately 12 times fewer parameters.
摘要
precisions scene understanding 是 agriculture robot monitoring 和 intervening 任务中的关键。在这项工作中,我们介绍了一种基于NeRF的新系统PAg-NeRF,它可以实现3D权威场景理解。我们的表示使用了含有噪声的机器人姿态pose的图像序列,以及自动生成的�anoptic预测结果,其中每帧ID不匹配。尽管输入含噪声,我们的系统仍能输出场景几何学、真实图像和3D一致的�anoptic表示,并且实现了一致的实例ID。我们在一个非常具有挑战性的植物种植场景中评估了这种新系统,并在这之中展示了一个可以使用噪声机器人姿态而不需要先计算精确姿态的端到端可训练系统。相比基线方法,我们的方法可以提高峰峰信号听频比为21.34dB到23.37dB,并提高�anoptic质量从56.65%到70.08%。此外,我们的方法更快,可以通过更 чем2倍的速度调整执行时间,同时具有较少的参数。
results: 我们在两个 Pascal VOC 数据集上进行评估,结果显示我们的方法无需使用复杂的调整和练习,仍能超越许多现有的州Of-The-Art 方法。Abstract
Class-Incremental learning (CIL) is the ability of artificial agents to accommodate new classes as they appear in a stream. It is particularly interesting in evolving environments where agents have limited access to memory and computational resources. The main challenge of class-incremental learning is catastrophic forgetting, the inability of neural networks to retain past knowledge when learning a new one. Unfortunately, most existing class-incremental object detectors are applied to two-stage algorithms such as Faster-RCNN and rely on rehearsal memory to retain past knowledge. We believe that the current benchmarks are not realistic, and more effort should be dedicated to anchor-free and rehearsal-free object detection. In this context, we propose MultIOD, a class-incremental object detector based on CenterNet. Our main contributions are: (1) we propose a multihead feature pyramid and multihead detection architecture to efficiently separate class representations, (2) we employ transfer learning between classes learned initially and those learned incrementally to tackle catastrophic forgetting, and (3) we use a class-wise non-max-suppression as a post-processing technique to remove redundant boxes. Without bells and whistles, our method outperforms a range of state-of-the-art methods on two Pascal VOC datasets.
摘要
《级间学习(CIL)是人工智能代理人的承受新类出现在流中的能力。在发展环境中,代理人具有有限的内存和计算资源。主要挑战是迁移学习,即神经网络忘记过去知识 WHEN learning new one。现有的大多数级间学习对象检测器采用了两stage算法如Faster-RCNN,并且 rely on rehearsal memory来保持过去知识。我们认为现有的标准准则不实际,更应该投入更多的努力于 anchor-free和 rehearsal-free 对象检测。在这个上下文中,我们提出了 MultIOD,基于 CenterNet 的级间学习对象检测器。我们的主要贡献是:1. 我们提出了多头特征层和多头检测架构,以有效地分离类表示。2. 我们使用类初学习和级间学习之间的转移学习来抗衡迁移学习问题。3. 我们使用类别非最大抑制作为后处理技术,以去除重复的框。 ohne 钻石和精雕,我们的方法在 Pascal VOC 两个数据集上超越了一些状态的先进方法。》
Diff-Privacy: Diffusion-based Face Privacy Protection
results: 对比于传统方法,提出的Diff-Privacy方法能够更好地保护人脸隐私,并且可以同时实现人脸隐私和视觉特征隐私。经验表明,Diff-Privacy方法能够减少人脸隐私攻击的风险。Abstract
Privacy protection has become a top priority as the proliferation of AI techniques has led to widespread collection and misuse of personal data. Anonymization and visual identity information hiding are two important facial privacy protection tasks that aim to remove identification characteristics from facial images at the human perception level. However, they have a significant difference in that the former aims to prevent the machine from recognizing correctly, while the latter needs to ensure the accuracy of machine recognition. Therefore, it is difficult to train a model to complete these two tasks simultaneously. In this paper, we unify the task of anonymization and visual identity information hiding and propose a novel face privacy protection method based on diffusion models, dubbed Diff-Privacy. Specifically, we train our proposed multi-scale image inversion module (MSI) to obtain a set of SDM format conditional embeddings of the original image. Based on the conditional embeddings, we design corresponding embedding scheduling strategies and construct different energy functions during the denoising process to achieve anonymization and visual identity information hiding. Extensive experiments have been conducted to validate the effectiveness of our proposed framework in protecting facial privacy.
摘要
隐私保护已成为人工智能技术普及后的首要任务,由于广泛收集和不当使用个人数据,隐私保护变得更加重要。隐身化和视觉特征隐藏是两项重要的面部隐私保护任务,它们的目标是在人类视觉水平上从面部图像中除去识别特征。然而,这两项任务之间存在重要的区别,隐身化旨在防止机器正确地识别,而视觉特征隐藏则需要保证机器识别的准确性。因此,同时完成这两项任务是很困难的。在这篇论文中,我们将隐身化和视觉特征隐藏的任务综合起来,并提出了一种基于扩散模型的面部隐私保护方法,称为Diff-Privacy。具体来说,我们将训练我们提出的多尺度图像反转模块(MSI),以获得原始图像的SDM格式条件嵌入。根据条件嵌入,我们设计了对应的嵌入调度策略,并在减噪过程中构建不同的能量函数,以实现隐身化和视觉特征隐藏。我们对方法进行了广泛的实验,以验证其在保护面部隐私方面的效果。
DeCUR: decoupling common & unique representations for multimodal self-supervision
results: 在radar-optical、RGB-elevation和RGB-depth等常见的多modal场景中,该方法能够准确地进行景物分类和 semantic segmentation下游任务,并且可以 straightforward地提高state-of-the-art超vised多modal方法的性能。Abstract
The increasing availability of multi-sensor data sparks interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings, DeCUR is trained to integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent benefits on scene classification and semantic segmentation downstream tasks. Notably, we get straightforward improvements by transferring our pretrained backbones to state-of-the-art supervised multimodal methods without any hyperparameter tuning. Furthermore, we conduct a comprehensive explainability analysis to shed light on the interpretation of common and unique features in our multimodal approach. Codes are available at \url{https://github.com/zhu-xlab/DeCUR}.
摘要
随着多感器数据的更加普遍,人们对多模态自我supervised学习表示越来越多的兴趣。然而,大多数现有的方法只学习多modalities中的共同表示,而忽略了INTRA-modal训练和特有表示。我们提议一种简单 yet effective的方法:异步共同和独特表示分离(DeCUR),用于多modal self-supervised learning。通过分辨Inter-和INTRA-modal嵌入,DeCUR可以整合不同modalities中的补充信息。我们在radar-optical、RGB-elevation和RGB-depth三个常见的多modal场景中进行评估,并证明DeCUR在Scene classification和semantic segmentation下渠道任务中具有一致的好处。特别是,我们可以无需任何超参数调整直接将我们预训练的backbone传承到状态 искусственный极点的supervised multimodal方法中,获得直接的改进。此外,我们进行了广泛的解释性分析,以便更好地理解我们的多modal方法中的共同和特有特征的解释。codes可以在\url{https://github.com/zhu-xlab/DeCUR}中找到。
Task-driven Compression for Collision Encoding based on Depth Images
results: 比较这个提案的任务驱动编码方法与传统任务无关的方法, demonstrate superior performance for the task of collision image prediction from extremely low-dimensional latent spaces。一系列的比较研究显示,提案的方法可以对复杂的场景中的细小障碍物进行更好的编码,具有4050:1的压缩比。Abstract
This paper contributes a novel learning-based method for aggressive task-driven compression of depth images and their encoding as images tailored to collision prediction for robotic systems. A novel 3D image processing methodology is proposed that accounts for the robot's size in order to appropriately "inflate" the obstacles represented in the depth image and thus obtain the distance that can be traversed by the robot in a collision-free manner along any given ray within the camera frustum. Such depth-and-collision image pairs are used to train a neural network that follows the architecture of Variational Autoencoders to compress-and-transform the information in the original depth image to derive a latent representation that encodes the collision information for the given depth image. We compare our proposed task-driven encoding method with classical task-agnostic methods and demonstrate superior performance for the task of collision image prediction from extremely low-dimensional latent spaces. A set of comparative studies show that the proposed approach is capable of encoding depth image-and-collision image tuples from complex scenes with thin obstacles at long distances better than the classical methods at compression ratios as high as 4050:1.
摘要
Class-Incremental Grouping Network for Continual Audio-Visual Learning
results: 根据实验结果显示,CIGN 可以在 VGGSound-Instruments、VGGSound-100 和 VGG-Sound Sources 评量上实现类别增量学习的最佳性能。Abstract
Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-modal representations for continual audio-visual learning. To address this gap, we propose a novel class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning. Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features. Additionally, it utilizes class tokens distillation and continual grouping to prevent forgetting parameters learned from previous tasks, thereby improving the model's ability to capture discriminative audio-visual categories. We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks. Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance. Code is available at https://github.com/stoneMo/CIGN.
摘要
<> translate into Simplified ChineseContinual learning 是一个具有挑战性的问题, models 需要在非站点数据上进行 sequential 任务中的 class-incremental learning。 在过去的方法中,使用了 either regularization 或 rehearsal-based 框架,以减轻 catastrophic forgetting 在 image classification 中,但这些方法仅适用于单一模式,无法学习 cross-modal 的 class-aware 表示。为了解决这个差异,我们提出了一个 novel class-incremental grouping network (CIGN),可以学习 category-wise semantic features,以实现 continual audio-visual learning。我们的 CIGN 利用可学习的 audio-visual 类别标志和 audio-visual 分组,以 continually 积累 class-aware 特征。此外,它还利用类别标志激发和 continual 分组,以防止遗传 learned 的前一个任务中的知识。我们对 VGGSound-Instruments、VGGSound-100 和 VGG-Sound Sources 标准库进行了广泛的实验。我们的实验结果显示,CIGN 在 audio-visual class-incremental learning 中 achieved state-of-the-art 性能。代码可以在 https://github.com/stoneMo/CIGN 获取。
results: 我们的方法可以在两个普遍性类型的物体计数 benchmark 上减少多种现有 visual counter 的平均绝对误差,比如FSCD-LVIS 和 FSC-147,减少约30%到40%。Abstract
We propose a novel framework for interactive class-agnostic object counting, where a human user can interactively provide feedback to improve the accuracy of a counter. Our framework consists of two main components: a user-friendly visualizer to gather feedback and an efficient mechanism to incorporate it. In each iteration, we produce a density map to show the current prediction result, and we segment it into non-overlapping regions with an easily verifiable number of objects. The user can provide feedback by selecting a region with obvious counting errors and specifying the range for the estimated number of objects within it. To improve the counting result, we develop a novel adaptation loss to force the visual counter to output the predicted count within the user-specified range. For effective and efficient adaptation, we propose a refinement module that can be used with any density-based visual counter, and only the parameters in the refinement module will be updated during adaptation. Our experiments on two challenging class-agnostic object counting benchmarks, FSCD-LVIS and FSC-147, show that our method can reduce the mean absolute error of multiple state-of-the-art visual counters by roughly 30% to 40% with minimal user input. Our project can be found at https://yifehuang97.github.io/ICACountProjectPage/.
摘要
我们提出了一种新的框架,用于互动性的类共享对象计数,其中人类用户可以互动地提供反馈来提高计数的准确性。我们的框架包括两个主要组成部分:一个易于使用的视觉化器来收集反馈,以及一种高效的机制来整合它。在每次迭代中,我们生成一张扩散图来显示当前预测结果,并将其分成不重叠的区域,每个区域可以轻松地验证其中的对象数量。用户可以通过选择具有明显计数错误的区域,并指定该区域内对象数量的范围,来提供反馈。为了改进计数结果,我们开发了一种新的适应损失,使得视觉计数器输出预测的计数在用户指定的范围内。为了有效和高效地适应,我们提议一种修充模块,可以与任何扩散基本的视觉计数器结合使用,只有修充模块中的参数会在适应过程中更新。我们的实验表明,我们的方法可以在两个普遍性类共享对象计数标准 benchmark 上减少多个现状顶尖的视觉计数器的平均绝对误差约30%到40%,具有最小的用户输入。您可以在 查看我们的项目。
Diving into Darkness: A Dual-Modulated Framework for High-Fidelity Super-Resolution in Ultra-Dark Environments
for: This paper is written for the problem of super-resolution in ultra-dark environments, which is a challenging and practical problem that has received little attention.
methods: The paper proposes a specialized dual-modulated learning framework that includes a self-regularized luminance constraint and Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Additionally, the paper designs a Resolution-Sensitive Merging Up-sampler (RSMU) module to mitigate the presence of artifacts and halos.
results: The paper shows that its approach outperforms state-of-the-art methods in terms of PSNR, LPIPS, and RMSE score, with a notable improvement of 5% in PSNR and 43% in LPIPS. The paper also demonstrates the generalization of its approach across different darkness levels, with a 19-fold increase in the RMSE score.Abstract
Super-resolution tasks oriented to images captured in ultra-dark environments is a practical yet challenging problem that has received little attention. Due to uneven illumination and low signal-to-noise ratio in dark environments, a multitude of problems such as lack of detail and color distortion may be magnified in the super-resolution process compared to normal-lighting environments. Consequently, conventional low-light enhancement or super-resolution methods, whether applied individually or in a cascaded manner for such problem, often encounter limitations in recovering luminance, color fidelity, and intricate details. To conquer these issues, this paper proposes a specialized dual-modulated learning framework that, for the first time, attempts to deeply dissect the nature of the low-light super-resolution task. Leveraging natural image color characteristics, we introduce a self-regularized luminance constraint as a prior for addressing uneven lighting. Expanding on this, we develop Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Besides, instead of deploying naive up-sampling strategies, we design the Resolution-Sensitive Merging Up-sampler (RSMU) module that brings together different sampling modalities as substrates, effectively mitigating the presence of artifacts and halos. Comprehensive experiments showcases the applicability and generalizability of our approach to diverse and challenging ultra-low-light conditions, outperforming state-of-the-art methods with a notable improvement (i.e., $\uparrow$5\% in PSNR, and $\uparrow$43\% in LPIPS). Especially noteworthy is the 19-fold increase in the RMSE score, underscoring our method's exceptional generalization across different darkness levels. The code will be available online upon publication of the paper.
摘要
“低光环境下的超解像任务是一个实际且挑战性的问题,它们获得了少量的注意。由于低光环境中的不均匀照明和讯号与干扰比,它们可能会导致缺乏细节和颜色扭曲。因此,传统的低光照增强或超解像方法, whether applied individually or in a cascaded manner for this problem, often encounter limitations in recovering luminance, color fidelity, and intricate details. To conquer these issues, this paper proposes a specialized dual-modulated learning framework that, for the first time, attempts to deeply dissect the nature of the low-light super-resolution task. Leveraging natural image color characteristics, we introduce a self-regularized luminance constraint as a prior for addressing uneven lighting. Expanding on this, we develop Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Besides, instead of deploying naive up-sampling strategies, we design the Resolution-Sensitive Merging Up-sampler (RSMU) module that brings together different sampling modalities as substrates, effectively mitigating the presence of artifacts and halos. Comprehensive experiments showcases the applicability and generalizability of our approach to diverse and challenging ultra-low-light conditions, outperforming state-of-the-art methods with a notable improvement (i.e., $\uparrow$5\% in PSNR, and $\uparrow$43\% in LPIPS). Especially noteworthy is the 19-fold increase in the RMSE score, underscoring our method's exceptional generalization across different darkness levels. The code will be available online upon publication of the paper.”
A horizon line annotation tool for streamlining autonomous sea navigation experiments
paper_authors: Yassir Zardoua, Abdelhamid El Wahabi, Mohammed Boulaala, Abdelali Astito
For: 本研究的目的是为marine autonomous navigation任务提供更加稳定和可靠的海Horizon line检测方法。* Methods: 本研究使用了一种新的公共注释软件,用于快速和容易地注释收集的海洋图像中的海Horizon line。* Results: 本研究通过对多种海condexts进行实验 validate了一种更加Robust的海Horizon line检测方法。Here’s the breakdown of each point in English:* For: The purpose of this research is to provide more stable and reliable horizon line detection methods for marine autonomous navigation tasks.* Methods: This research uses a new public annotation software to quickly and easily annotate collected marine images with the correct position and orientation of the horizon line.* Results: This research experimentally validated a more robust horizon line detection method by collecting and annotating images of various sea conditions.Abstract
Horizon line (or sea line) detection (HLD) is a critical component in multiple marine autonomous navigation tasks, such as identifying the navigation area (i.e., the sea), obstacle detection and geo-localization, and digital video stabilization. A recent survey highlighted several weaknesses of such detectors, particularly on sea conditions lacking from the most extensive dataset currently used by HLD researchers. Experimental validation of more robust HLDs involves collecting an extensive set of these lacking sea conditions and annotating each collected image with the correct position and orientation of the horizon line. The annotation task is daunting without a proper tool. Therefore, we present the first public annotation software with tailored features to make the sea line annotation process fast and easy. The software is available at: https://drive.google.com/drive/folders/1c0ZmvYDckuQCPIWfh_70P7E1A_DWlIvF?usp=sharing
摘要
<>将文本翻译为简化字符串。<>水平线(或海洋线)检测(HLD)是多种海上自动导航任务的关键组件,包括定位水平线(即海洋)、障碍物检测和地理位置确定、数字视频稳定化等。一项最新的调查指出了HLD检测器的一些弱点,特别是在缺乏当前HLD研究者使用最广泛的数据集的情况下。为验证更加稳定的HLD,需要收集一个广泛的这些缺失的海洋条件,并对每个收集的图像注解正确的水平线位置和方向。 annotation task是无法进行的,因此我们提供了首个公共的注释软件,带有适应的特性,使水平线注释过程快速和容易。软件可以在以下链接中下载:https://drive.google.com/drive/folders/1c0ZmvYDckuQCPIWfh_70P7E1A_DWlIvF?usp=sharing
Gall Bladder Cancer Detection from US Images with Only Image Level Labels
results: 我们的方法在 AP 和检测敏感性上比 SOTA transformer-based 和 CNN-based WSOD 方法更好。Note:
for: 是指研究的目的或目标,即这个研究是为了提高 GBC 的检测精度。
methods: 是指使用的方法或技术,即使用 transformer 模型和 MIL 等方法。
results: 是指研究所得到的结果,即比 SOTA 方法更好的检测精度。Abstract
Automated detection of Gallbladder Cancer (GBC) from Ultrasound (US) images is an important problem, which has drawn increased interest from researchers. However, most of these works use difficult-to-acquire information such as bounding box annotations or additional US videos. In this paper, we focus on GBC detection using only image-level labels. Such annotation is usually available based on the diagnostic report of a patient, and do not require additional annotation effort from the physicians. However, our analysis reveals that it is difficult to train a standard image classification model for GBC detection. This is due to the low inter-class variance (a malignant region usually occupies only a small portion of a US image), high intra-class variance (due to the US sensor capturing a 2D slice of a 3D object leading to large viewpoint variations), and low training data availability. We posit that even when we have only the image level label, still formulating the problem as object detection (with bounding box output) helps a deep neural network (DNN) model focus on the relevant region of interest. Since no bounding box annotations is available for training, we pose the problem as weakly supervised object detection (WSOD). Motivated by the recent success of transformer models in object detection, we train one such model, DETR, using multi-instance-learning (MIL) with self-supervised instance selection to suit the WSOD task. Our proposed method demonstrates an improvement of AP and detection sensitivity over the SOTA transformer-based and CNN-based WSOD methods. Project page is at https://gbc-iitd.github.io/wsod-gbc
摘要
自动检测结肠癌(GBC)从超声画像(US)图像是一个重要的问题,吸引了研究者们的关注。然而,大多数这些工作使用难以获得的信息,如矩形框注释或更多的US视频。在这篇论文中,我们专注于基于只有图像级别的标签进行GBC检测。这些标签通常可以基于病人的诊断报告获得,不需要更多的注释努力。然而,我们的分析表明,使用标准的图像分类模型进行GBC检测是困难的。这是因为肿瘤区域通常占用US图像中的only a small portion,以及US感知器捕捉的2Dslice of a 3D object leading to large viewpoint variations。我们认为,即使只有图像级别的标签,使用深度神经网络(DNN)模型仍然可以帮助模型关注相关的区域。由于没有 bounding box 注释可用于训练,我们将问题定义为弱型Supervised Object Detection(WSOD)问题。驱动于Recent success of transformer models in object detection,我们使用多例学习(MIL)和自我驱动的实例选择来训练我们的模型DETR。我们的提议方法在AP和检测敏感性方面与State-of-the-Art(SOTA)基于 transformer 和 CNN 的 WSOD 方法之上具有进步。相关页面可以在 中找到。
FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Objection
results: 在 nuScenes dataset 上 Achieve 72.6% mAP 和 75.1% NDS,超越现有方法Abstract
Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features through a simple channel concatenation require transformation features into bird's eye view space and may lose the information on Z-axis thus leads to inferior performance. To this end, we propose FusionFormer, an end-to-end multi-modal fusion framework that leverages transformers to fuse multi-modal features and obtain fused BEV features. And based on the flexible adaptability of FusionFormer to the input modality representation, we propose a depth prediction branch that can be added to the framework to improve detection performance in camera-based detection tasks. In addition, we propose a plug-and-play temporal fusion module based on transformers that can fuse historical frame BEV features for more stable and reliable detection results. We evaluate our method on the nuScenes dataset and achieve 72.6% mAP and 75.1% NDS for 3D object detection tasks, outperforming state-of-the-art methods.
摘要
多感器模式融合已经在3D物体探测任务中显示出了强大的优势。然而,现有的方法通过简单的通道拼接来融合多模态特征可能会产生Z轴信息损失,导致性能下降。为此,我们提出了FusionFormer,一个端到端多模态融合框架,利用转换器来融合多模态特征并获得融合BEV特征。此外,基于输入模式表示的灵活适应性,我们提出了一个深度预测分支,可以添加到框架中,以提高摄像头基于探测任务的检测性能。此外,我们还提出了基于转换器的历史帧BEV特征融合模块,可以将历史帧特征融合以获得更稳定和可靠的检测结果。我们在nuScenes dataset上评估了我们的方法,并实现了3D物体探测任务中的72.6% mAP和75.1% NDS,超过了当前state-of-the-art方法。
Towards Better Data Exploitation In Self-Supervised Monocular Depth Estimation
paper_authors: Jinfeng Liu, Lingtong Kong, Jie Yang, Wei Liu
for: 本研究旨在提高自助学习独眼视觉系统中的深度估计能力,以增强机器人视觉系统的泛化能力。
methods: 本研究使用了两种数据增强技术:Resizing-Cropping和Splitting-Permuting,以充分利用训练数据的潜在能力。具体来说,我们将原始图像和生成的两个增强图像同时 feed into 训练管道,并通过自我采样来进行自采样。此外,我们还提出了细节优化的 DepthNet,其包括一个全规模分支在编码器中和一个网格解码器,以提高depth图中的细节Restoration。
results: 实验结果表明,我们的方法可以在KITTI标准测试集上达到状态机器人视觉系统中的最佳性能,并且我们的模型还能够在Make3D和NYUv2测试集上 Transfer learning 表现出色。Abstract
Depth estimation plays an important role in the robotic perception system. Self-supervised monocular paradigm has gained significant attention since it can free training from the reliance on depth annotations. Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at https://github.com/Sauf4896/BDEdepth.
摘要
depth estimation 在 robotic perception system 中扮演着重要的角色。无监督单目法固有了广泛的关注,因为它可以免除depth注释的依赖。Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at https://github.com/Sauf4896/BDEdepth.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
paper_authors: Yiming Zhang, ZeMing Gong, Angel X. Chang for: 本研究旨在用自然语言描述Localize多个3D场景中的灵活数量的物体。现有的3D视觉固定任务都是基于唯一的目标物体描述,但这种约束是不自然的,因为在真实世界情况下,我们经常需要Localize多个物体。methods: 我们提出了Multi3DRefer,一种扩展ScanRefer数据集和任务,以解决这种情况。我们的数据集包含11609个 объек的11609个描述,其中每个描述可能有一个、多个或zero个目标物体。我们还引入了一种新的评价指标和优化方法,以便进一步研究多模式3D场景理解。results: 我们开发了一种新的基线方法,利用CLIP的2D特征进行对比学习,可以在线渲染出对象提案,并超越当前状态的最佳性能在ScanReferbenchmark上。Abstract
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.
摘要
我们介绍了在实际世界3D场景中自然语言描述中Localizing多个灵活对象的任务。现有的3D视觉定位任务都是基于固定对象的描述,但这种 Setting是不自然的,因为在实际场景中Localizing多个对象是常见的需求,例如视觉导航和对象重新排序。为解决这种设定,我们提出了Multi3DRefer,它扩展了ScanRefer数据集和任务。我们的数据集包括11609个对象的11609个描述,其中每个描述可能引用零、单个或多个目标对象。我们还引入了一个新的评价指标和优秀方法,从优秀的多模式3D场景理解研究中继承来。此外,我们开发了一个更好的基线,通过在线渲染对象提案,使用对比学习,以获得更高的性能在ScanRefer bencmark上。
HAT: Hybrid Attention Transformer for Image Restoration
results: 对比于基eline方法,HAT可以更好地恢复图像,并且可以扩展到更多的图像修复应用,如真实世界图像超解、Gaussian图像噪声除除和图像压缩 artefacts 除除。实验表明,HAT可以达到状态之前的最佳性能。Abstract
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at https://github.com/XPixelGroup/HAT.
摘要
<>Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at .中文简体版:Transformer基于方法在图像修复任务中表现出色,如图像超解和噪声除除。然而,我们发现这些网络只能利用输入图像的有限范围信息,通过属性分析来进行评估。这意味着Transformer的潜力还未被完全利用。为了激活更多的输入像素,提高修复效果,我们提议一种新的混合注意力Transformer(HAT)。它结合了通道注意力和窗口基于自注意力的方法,并且利用了它们的优势。此外,为了更好地聚合跨窗信息,我们引入了重叠cross-attention模块,以增强邻近窗口特征之间的互动。在训练阶段,我们还采用了同任务预训练策略,以进一步利用模型的潜力。广泛的实验证明了我们提议的模块的效iveness。此外,我们还将HAT扩展到更多的图像修复应用,包括实际世界图像超解、Gaussian图像噪声除除和图像压缩 artifacts reduction。实验表明,我们的HAT在量化和质量上都达到了领先的性能。代码和模型在公共可用。
Angle Range and Identity Similarity Enhanced Gaze and Head Redirection based on Synthetic data
results: 实现了大角度重定向的改进表现,同时保持高图像质量和人脸认知稳定性Abstract
In this paper, we propose a method for improving the angular accuracy and photo-reality of gaze and head redirection in full-face images. The problem with current models is that they cannot handle redirection at large angles, and this limitation mainly comes from the lack of training data. To resolve this problem, we create data augmentation by monocular 3D face reconstruction to extend the head pose and gaze range of the real data, which allows the model to handle a wider redirection range. In addition to the main focus on data augmentation, we also propose a framework with better image quality and identity preservation of unseen subjects even training with synthetic data. Experiments show that our method significantly improves redirection performance in terms of redirection angular accuracy while maintaining high image quality, especially when redirecting to large angles.
摘要
在这篇论文中,我们提出了一种方法来提高全面图像中的视线和头部重定向精度。现有模型无法处理大角度的重定向,主要是因为数据训练的限制。为解决这个问题,我们创造了一种数据扩展方法,通过单目3D脸部重建来扩展实际数据中的头部姿态和视线范围,这allow the model可以处理更广泛的重定向范围。除了主要关注数据扩展之外,我们还提出了一个框架,可以保持不seen subjects的图像质量和身份,即使使用 synthetic data 进行训练。实验表明,我们的方法可以明显提高重定向性能,特别是在重定向大角度时。
Phase-Specific Augmented Reality Guidance for Microscopic Cataract Surgery Using Long-Short Spatiotemporal Aggregation Transformer
paper_authors: Puxun Tu, Hongfei Ye, Haochen Shi, Jeff Young, Meng Xie, Peiquan Zhao, Ce Zheng, Xiaoyi Jiang, Xiaojun Chen
For: The paper is focused on developing a novel phase-specific augmented reality (AR) guidance system for phacoemulsification cataract surgery (PCS) to enhance intraoperative proficiency.* Methods: The proposed system utilizes a two-stage surgical microscopic video recognition network, including a multi-task learning structure to segment the surgical limbus region and extract spatial features, and a long-short spatiotemporal aggregation transformer (LS-SAT) network to recognize the current surgical phase. The system also incorporates AR visual cues designed in collaboration with ophthalmologists.* Results: The proposed system was evaluated on publicly available and in-house datasets, with comparison results demonstrating its superior performance compared to related works. The system was also evaluated in a clinical setup, with results indicating remarkable accuracy and real-time performance, highlighting its potential for clinical applications.Abstract
Phacoemulsification cataract surgery (PCS) is a routine procedure conducted using a surgical microscope, heavily reliant on the skill of the ophthalmologist. While existing PCS guidance systems extract valuable information from surgical microscopic videos to enhance intraoperative proficiency, they suffer from non-phasespecific guidance, leading to redundant visual information. In this study, our major contribution is the development of a novel phase-specific augmented reality (AR) guidance system, which offers tailored AR information corresponding to the recognized surgical phase. Leveraging the inherent quasi-standardized nature of PCS procedures, we propose a two-stage surgical microscopic video recognition network. In the first stage, we implement a multi-task learning structure to segment the surgical limbus region and extract limbus region-focused spatial feature for each frame. In the second stage, we propose the long-short spatiotemporal aggregation transformer (LS-SAT) network to model local fine-grained and global temporal relationships, and combine the extracted spatial features to recognize the current surgical phase. Additionally, we collaborate closely with ophthalmologists to design AR visual cues by utilizing techniques such as limbus ellipse fitting and regional restricted normal cross-correlation rotation computation. We evaluated the network on publicly available and in-house datasets, with comparison results demonstrating its superior performance compared to related works. Ablation results further validated the effectiveness of the limbus region-focused spatial feature extractor and the combination of temporal features. Furthermore, the developed system was evaluated in a clinical setup, with results indicating remarkable accuracy and real-time performance. underscoring its potential for clinical applications.
摘要
喷洗cataract手术(PCS)是一种常见的手术程序,需要医生高度精准的技巧。现有的PCS导航系统可以从手术微镜视频中提取有价值信息,以提高手术过程中的灵活性,但是这些导航系统却受到非相对阶段的导航,从而导致多余的视觉信息。在本研究中,我们的主要贡献是开发了一种新的相对阶段增强现实(AR)导航系统,该系统可以根据识别到的当前手术阶段提供适应的AR信息。利用手术微镜程序的自然固有标准化特性,我们提议一种两阶段的手术微镜视频识别网络。在第一阶段,我们实施了多任务学习结构,将手术边缘区域分割出来,并对每帧图像提取边缘区域专门的空间特征。在第二阶段,我们提议使用长短距离空间时间汇聚变换(LS-SAT)网络,模型局部细腻和全局时间关系,并将提取的空间特征组合以识别当前手术阶段。此外,我们与眼科医生合作,通过利用技术such as镜面轮廓适应和区域限制正常垂直扩散计算来设计AR视觉提示。我们对公共可用和自有数据集进行评估,与相关工作进行比较,结果显示其性能更高。剥离结果进一步验证了镜边区域专门的空间特征提取器和时间特征的组合的效果。此外,我们开发的系统在临床设置中进行了评估,结果表明其精度和实时性具有很高的潜力。
Learning Sequential Acquisition Policies for Robot-Assisted Feeding
results: 在实际杯子上进行了38个食物获取任务中,VAPORS 比基eline高效得多,可以普遍应对实际杯子上的变化(如顶部和酱料),并在49名用户中的调查中得到了较高的用户满意度。Abstract
A robot providing mealtime assistance must perform specialized maneuvers with various utensils in order to pick up and feed a range of food items. Beyond these dexterous low-level skills, an assistive robot must also plan these strategies in sequence over a long horizon to clear a plate and complete a meal. Previous methods in robot-assisted feeding introduce highly specialized primitives for food handling without a means to compose them together. Meanwhile, existing approaches to long-horizon manipulation lack the flexibility to embed highly specialized primitives into their frameworks. We propose Visual Action Planning OveR Sequences (VAPORS), a framework for long-horizon food acquisition. VAPORS learns a policy for high-level action selection by leveraging learned latent plate dynamics in simulation. To carry out sequential plans in the real world, VAPORS delegates action execution to visually parameterized primitives. We validate our approach on complex real-world acquisition trials involving noodle acquisition and bimanual scooping of jelly beans. Across 38 plates, VAPORS acquires much more efficiently than baselines, generalizes across realistic plate variations such as toppings and sauces, and qualitatively appeals to user feeding preferences in a survey conducted across 49 individuals. Code, datasets, videos, and supplementary materials can be found on our website: https://sites.google.com/view/vaporsbot.
摘要
robot提供协助时需要执行特殊的机械操作,使用各种工具来拾取和给食各种食品。除了灵活的低级技能外,协助 robot还需要规划这些策略,在较长的时间距离内完成整个餐点。现有的机器人协助食物投入方法 introduce highly specialized primitives for food handling without a means to compose them together,而现有的长期机械 manipulate方法缺乏灵活性来嵌入高级特殊 primitives。我们提出了Visual Action Planning OveR Sequences(VAPORS),一种长期食物获取框架。VAPORS学习一个高级行为选择策略,通过在模拟中学习latent plate dynamics来执行。为了在实际世界中执行sequential plans,VAPORS委托action execution给视觉参数化的基本 primitives。我们验证了我们的方法在复杂的实际食物获取任务中,比baseline效果更高,可以 generale across realistic plate variations such as toppings and sauces,并且在49名用户中的调查中获得了负面feeding preferences的评价。代码、数据集、视频和补充材料可以在我们的网站上找到:https://sites.google.com/view/vaporsbot。
Towards Viewpoint Robustness in Bird’s Eye View Segmentation
paper_authors: Tzofi Klinghoffer, Jonah Philion, Wenzheng Chen, Or Litany, Zan Gojcic, Jungseock Joo, Ramesh Raskar, Sanja Fidler, Jose M. Alvarez
For: 这个论文旨在解决自动驾驶车辆(AV)中神经网络的视角不一致问题,以便在多种车辆上部署神经网络模型 без重复收集和标注数据。* Methods: 作者们提出了一种基于鸟瞰视(BEV)分割任务的方法,使用novel view synthesis技术将收集的数据转换到目标摄像头配置的视角下,以便在不同的摄像头配置上训练BEV分割模型。* Results: 作者们通过广泛的实验发现,现有的感知模型具有较大的视角不一致敏感度,当训练数据来自特定的摄像头配置时,小量的视角变化会导致大幅下降在性能。作者们的方法能够恢复约14.7%的 intersectioin over union(IoU),即使在新的摄像头配置上部署模型。Abstract
Autonomous vehicles (AV) require that neural networks used for perception be robust to different viewpoints if they are to be deployed across many types of vehicles without the repeated cost of data collection and labeling for each. AV companies typically focus on collecting data from diverse scenarios and locations, but not camera rig configurations, due to cost. As a result, only a small number of rig variations exist across most fleets. In this paper, we study how AV perception models are affected by changes in camera viewpoint and propose a way to scale them across vehicle types without repeated data collection and labeling. Using bird's eye view (BEV) segmentation as a motivating task, we find through extensive experiments that existing perception models are surprisingly sensitive to changes in camera viewpoint. When trained with data from one camera rig, small changes to pitch, yaw, depth, or height of the camera at inference time lead to large drops in performance. We introduce a technique for novel view synthesis and use it to transform collected data to the viewpoint of target rigs, allowing us to train BEV segmentation models for diverse target rigs without any additional data collection or labeling cost. To analyze the impact of viewpoint changes, we leverage synthetic data to mitigate other gaps (content, ISP, etc). Our approach is then trained on real data and evaluated on synthetic data, enabling evaluation on diverse target rigs. We release all data for use in future work. Our method is able to recover an average of 14.7% of the IoU that is otherwise lost when deploying to new rigs.
摘要
HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving
paper_authors: Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li
for: 这 paper 是为了推动自动驾驶系统的多任务合一化,使用单一的语言模型来把多个自动驾驶任务从视频中提取出来。
methods: 这 paper 使用了 singular multimodal large language models (MLLMs) 来把多个自动驾驶任务从视频中提取出来,并且提出了一种 efficient method 来 incorporate high-resolution (HR) information into MLLMs。
results: 实验结果显示,与现有的 MLLMs 相比,HiLM-D 在 ROLISP 任务上表现出色,提高了 4.8% 的 BLEU-4 和 17.2% 的 mIoU。Abstract
Autonomous driving systems generally employ separate models for different tasks resulting in intricate designs. For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i.e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task. ROLISP uses natural language to simultaneously identify and interpret risk objects, understand ego-vehicle intentions, and provide motion suggestions, eliminating the necessity for task-specific architectures. However, lacking high-resolution (HR) information, existing MLLMs often miss small objects (e.g., traffic cones) and overly focus on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task. Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning branch, can be any MLLMs, processes low-resolution videos to caption risk objects and discern ego-vehicle intentions/suggestions; (ii) the high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR images to enhance detection by capturing vision-specific HR feature maps and prioritizing all potential risks over merely salient objects. Our HR-PB serves as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
摘要
自适应驾驶系统通常采用分离的模型来处理不同任务,导致设计变得复杂。我们是第一个使用单一多Modal大语言模型(MLLMs)将多个自适应驾驶任务从视频中整合,即风险对象本地化和建议预测(ROLISP)任务。ROLISP使用自然语言同时识别和解释风险对象,理解ego汽车的意图,并提供动作建议,从而消除任务特定的建筑。然而,由于缺乏高分辨率(HR)信息,现有的MLLMs经常会遗弃小对象(例如交通标志),而偏重于突出对象(例如大卡车)。我们提出了HiLM-D(推向高分辨率理解在MLLMs中的自适应驾驶),一种高效的方法,将HR信息 integrate到MLLMs中。尤其是HiLM-D具有两个分支:(i)低分辨率理解分支,可以是任何MLLMs,处理低分辨率视频,描述风险对象和ego汽车的意图/建议;(ii)高分辨率感知分支(HR-PB),特点在HiLM-D中,通过捕捉视觉特定的HR特征图和优先处理所有风险对象,提高检测精度。我们的HR-PB作为插件模块,顺利地适配现有MLLMs。实验表明HiLM-D在ROLISP数据集上表现出色,与领先MLLMs相比,提高了4.8%的BLEU-4措词率和17.2%的mIoU检测精度。