results: 研究 validate在两个属性(性别和年龄)上,并在各种攻击者和数据集下进行了实验。Abstract
Speaker embeddings are ubiquitous, with applications ranging from speaker recognition and diarization to speech synthesis and voice anonymisation. The amount of information held by these embeddings lends them versatility, but also raises privacy concerns. Speaker embeddings have been shown to contain information on age, sex, health and more, which speakers may want to keep private, especially when this information is not required for the target task. In this work, we propose a method for removing and manipulating private attributes from speaker embeddings that leverages a Vector-Quantized Variational Autoencoder architecture, combined with an adversarial classifier and a novel mutual information loss. We validate our model on two attributes, sex and age, and perform experiments with ignorant and fully-informed attackers, and with in-domain and out-of-domain data.
摘要
喊Word embeddings在各种应用中广泛使用,包括说话人识别和分类、语音合成和声音匿名化。这些喊Word embeddings中包含了大量信息,这使其具有多样性,但也引起了隐私问题。这些喊Word embeddings中包含的信息包括年龄、性别、健康等,这些信息可能会让说话人保持隐私,特别是当这些信息不是target任务所需的时候。在这项工作中,我们提出了一种去除和修改私人属性从喊Word embeddings中的方法,该方法基于Vector-Quantized Variational Autoencoder架构,并与对抗类ifier和一种新的共同信息损失相结合。我们验证了我们的模型在两个属性上,性别和年龄上,并在不知情和完全了解的攻击者下进行了实验,以及在Domain和Out-of-Domain数据上。
Discriminative Speech Recognition Rescoring with Pre-trained Language Models
paper_authors: Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko
for: 提高自动语音识别(ASR)系统的竞争力
methods: 使用预训练语言模型(LM)的探索性训练
results: 在 LibriSpeech 数据集上,所有MWER训练方案都有所提高,最高提高8.5% WER; Pooling 变体可以降低延迟,保持大部分改进; bidirectional LM 更好地利用探索性训练。Abstract
Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several discriminative fine-tuning schemes for pre-trained LMs. We propose two architectures based on different pooling strategies of output embeddings and compare with probability based MWER. We conduct detailed comparisons between pre-trained causal and bidirectional LMs in discriminative settings. Experiments on LibriSpeech demonstrate that all MWER training schemes are beneficial, giving additional gains upto 8.5\% WER. Proposed pooling variants achieve lower latency while retaining most improvements. Finally, our study concludes that bidirectionality is better utilized with discriminative training.
摘要
第二个通过重新分配是竞争自动语音识别(ASR)系统的重要组成部分。大型语言模型已经证明了它们可以使用预训信息来改善ASR假设的重新分配。精确训练,直接优化最小单词错误率(MWER)标准通常会提高重新分配。在这项研究中,我们提出并探索了多种精确定型训练方案 для预训练LM。我们提出了基于不同抽取策略的输出嵌入的两种架构,并与概率基于MWER进行比较。我们在预训练 causal 和 bidirectional LM 中进行了详细比较。在 LibriSpeech 上进行的实验表明,所有MWER 训练方案都是有利的,可以获得额外的8.5% WER 的提升。我们的 pooling 变体可以减少延迟时间,保持大多数改进。最后,我们的研究结论是,批处性是在精确训练中更好地利用的。
results: 研究所获得的结果显示,使用 transformer-based 2D 模型可以达到最高的 Dice 分数为 0.583,而使用 residual U-Net 3D 模型可以达到最高的 Dice 分数为 0.504。此外,研究还通过 Wilcoxon 测试发现了 stroke 体积预测和实际值之间的相关关系。Abstract
Brain stroke has become a significant burden on global health and thus we need remedies and prevention strategies to overcome this challenge. For this, the immediate identification of stroke and risk stratification is the primary task for clinicians. To aid expert clinicians, automated segmentation models are crucial. In this work, we consider the publicly available dataset ATLAS $v2.0$ to benchmark various end-to-end supervised U-Net style models. Specifically, we have benchmarked models on both 2D and 3D brain images and evaluated them using standard metrics. We have achieved the highest Dice score of 0.583 on the 2D transformer-based model and 0.504 on the 3D residual U-Net respectively. We have conducted the Wilcoxon test for 3D models to correlate the relationship between predicted and actual stroke volume. For reproducibility, the code and model weights are made publicly available: https://github.com/prantik-pdeb/BeSt-LeS.
摘要
Brain stroke 已成为全球医疗的重要挑战,因此我们需要有效的治疗和预防策略。为此,诊断 stroke 的速度和风险评估是临床医生的首要任务。为了帮助专业医生,自动分割模型是非常重要。在这项工作中,我们使用公共可用的 ATLAS $v2.0$ 数据集来对不同的端到端授 taught U-Net 模型进行比较。我们在2D和3D脑图像上对模型进行了测试,并使用标准指标进行评估。我们在2D transformer-based 模型上达到了最高的 Dice 分数为 0.583,并在3D residual U-Net 模型上达到了 0.504 的最高分数。我们通过沃克逊测试来检验3D模型中预测和实际 stroke 体积之间的相关性。为了保持可重复性,我们在 GitHub 上公开了代码和模型参数:https://github.com/prantik-pdeb/BeSt-LeS。
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
results: 我们的方法显著超越了基准值,并 achieve strong out-of-distribution robustness。我们进行了广泛的减少研究来证明我们的设计选择的有效性,并提供了深入的分析,以便未来的研究。Abstract
Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://vis-www.cs.umass.edu/TextPSG.
摘要
它最近提出了泛睿场景图(Panoptic Scene Graph,PSG),但以前的工作采用了完全监督学习方式,需要大量的像素密集标注数据,这总是费时和贵金属。为了解决这个限制,我们研究了一个新的问题:即从文本描述中生成PSG(Caption-to-PSG)。我们利用了互联网上免费的图像描述数据集,以生成泛睿场景图。这个问题具有三个约束:1)没有位置假设;2)没有显式地将视觉区域与文本实体连接起来;3)没有预定的概念集。为了解决这个问题,我们提出了一个新的框架:TextPSG,包括四个模块:区域分组器、实体降解器、段合并器和标签生成器。我们还提出了一些新的技术。区域分组器首先将图像像素分成不同的段,然后实体降解器将视觉段与文本描述相对应。这些降解结果可以作为pseudo标签,使段合并器学习段的相似性以及导引标签生成器学习对象 semantics和关系预测,从而实现细化的结构化场景理解。我们的框架高效,Significantly Outperforming基elines,并且具有强大的Out-of-distribution Robustness。我们进行了广泛的拆分分析,以证明我们的设计选择的有效性,并提供了深入的分析,以透视未来的方向。我们的代码、数据和结果都可以在我们项目页面上找到:。
Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images
results: 我们的实验结果表明,使用生成的医学图像可以达到与真实图像相当的性能,或者甚至超过它们。我们还引入了一个大规模的生成医学图像数据集,并将其与匿名化的真实医学报告集成对应。Abstract
Medical Vision-Language Pre-training (VLP) learns representations jointly from medical images and paired radiology reports. It typically requires large-scale paired image-text datasets to achieve effective pre-training for both the image encoder and text encoder. The advent of text-guided generative models raises a compelling question: Can VLP be implemented solely with synthetic images generated from genuine radiology reports, thereby mitigating the need for extensively pairing and curating image-text datasets? In this work, we scrutinize this very question by examining the feasibility and effectiveness of employing synthetic images for medical VLP. We replace real medical images with their synthetic equivalents, generated from authentic medical reports. Utilizing three state-of-the-art VLP algorithms, we exclusively train on these synthetic samples. Our empirical evaluation across three subsequent tasks, namely image classification, semantic segmentation and object detection, reveals that the performance achieved through synthetic data is on par with or even exceeds that obtained with real images. As a pioneering contribution to this domain, we introduce a large-scale synthetic medical image dataset, paired with anonymized real radiology reports. This alleviates the need of sharing medical images, which are not easy to curate and share in practice. The code and the dataset will be made publicly available upon paper acceptance.
摘要
医疗视力语言预训练(VLP)学习表示结合医疗图像和相关的医疗报告。通常需要大规模的对应图像文本数据来实现有效的预训练,以便图像编码器和文本编码器都能够得到好的表示。然而,文本导向生成模型的出现提出了一个吸引人的问题:可以通过使用生成自真实医疗报告的 sintetic 图像来实现 VLP,从而减少对对应图像文本数据的需求。在这种情况下,我们详细研究了使用 sintetic 图像来进行医疗 VLP 的可行性和效果。我们将真实的医疗图像替换为生成自它们的 sintetic 图像,然后使用三种现状顶尖 VLP 算法进行专注式训练。我们的实验结果表明,通过 sintetic 图像来进行医疗 VLP 的性能与使用真实图像相当或甚至高于。在这种领域中,我们首次提供了一个大规模的 sintetic 医疗图像集,与医疗报告相对应,以避免分享医疗图像的困难。我们将代码和数据集公开发布,以便其他研究人员可以进行更多的研究和应用。
Pre-Trained Masked Image Model for Mobile Robot Navigation
results: 该论文显示了基础视觉模型可以无需微调,对各种输入模式进行通用应用,并且在缺乏训练数据的情况下,可以减少任务时间。更多资讯请参考https://raaslab.org/projects/MIM4Robots。Abstract
2D top-down maps are commonly used for the navigation and exploration of mobile robots through unknown areas. Typically, the robot builds the navigation maps incrementally from local observations using onboard sensors. Recent works have shown that predicting the structural patterns in the environment through learning-based approaches can greatly enhance task efficiency. While many such works build task-specific networks using limited datasets, we show that the existing foundational vision networks can accomplish the same without any fine-tuning. Specifically, we use Masked Autoencoders, pre-trained on street images, to present novel applications for field-of-view expansion, single-agent topological exploration, and multi-agent exploration for indoor mapping, across different input modalities. Our work motivates the use of foundational vision models for generalized structure prediction-driven applications, especially in the dearth of training data. For more qualitative results see https://raaslab.org/projects/MIM4Robots.
摘要
2D 顶点下方地图通常用于移动机器人的导航和探索未知区域。通常,机器人在当地感知器上建立导航地图,逐步增量更新。现在的研究表明,通过学习基本视觉网络可以大幅提高任务效率。虽然许多这些工作建立了特定任务的网络,但我们表明可以使用预先训练的街景图像Masked Autoencoders来实现相同的目标,无需任何微调。我们使用这些网络进行预览展示,包括预览展示、单机器人探索和多机器人探索,并且可以处理不同的输入模式。我们的工作激励使用基础视觉模型来预测环境结构,特别是在训练数据稀缺的情况下。更多资讯请参考https://raaslab.org/projects/MIM4Robots。
Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models
results: 该方法在多个已知的视频对象分割和跟踪 benchmark 上达到了强大的表现,并且能够在掩蔽中重新识别对象。Abstract
Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.
摘要
Object tracking是Robot感知和场景理解中的核心。Tracking-by-detection是特定对象类型的tracking的传统方法。最近,大规模预训练模型在2D静止图像中探测和分割对象和部分表现出了可观的进步。这引发了问题:我们可以将这些大规模预训练静止图像模型重新用于开放词汇视频跟踪吗?在这篇文章中,我们将开放词汇探测、分割和稠密运动场景估计器重新用于跟踪和分割任意类型对象在2D视频中。我们的方法预测对象和部分跟踪,并将其关联到语言描述。我们在无监测下探测开放词汇对象实例,从帧到帧传播这些实例的框架,并使用视觉探测器的框架进行框架进行简单的级联探测。我们根据框架的框架进行简单的级联探测。我们在视频中跟踪对象的终止是根据对象存在程度和前后方向运动的一致性决定。我们使用深度匹配来重新识别对象。我们的方法在多个已知的视频对象分割和跟踪标准准则上达到了强性表现,并且在执行操作数据时可以生成合理的跟踪。特别是,我们的方法在UVO和BURST两个开放世界对象跟踪和分割标准上超越了之前的状态。我们希望我们的方法可以作为一个简单和可扩展的框架,为未来的研究提供服务。
Leveraging Neural Radiance Fields for Uncertainty-Aware Visual Localization
results: 在公共数据集上进行实验,发现我们的方法可以选择带有最高信息增加的样本,并提高SCR的数据效率和准确率。Abstract
As a promising fashion for visual localization, scene coordinate regression (SCR) has seen tremendous progress in the past decade. Most recent methods usually adopt neural networks to learn the mapping from image pixels to 3D scene coordinates, which requires a vast amount of annotated training data. We propose to leverage Neural Radiance Fields (NeRF) to generate training samples for SCR. Despite NeRF's efficiency in rendering, many of the rendered data are polluted by artifacts or only contain minimal information gain, which can hinder the regression accuracy or bring unnecessary computational costs with redundant data. These challenges are addressed in three folds in this paper: (1) A NeRF is designed to separately predict uncertainties for the rendered color and depth images, which reveal data reliability at the pixel level. (2) SCR is formulated as deep evidential learning with epistemic uncertainty, which is used to evaluate information gain and scene coordinate quality. (3) Based on the three arts of uncertainties, a novel view selection policy is formed that significantly improves data efficiency. Experiments on public datasets demonstrate that our method could select the samples that bring the most information gain and promote the performance with the highest efficiency.
摘要
随着视觉本地化的潮流,场景坐标回归(SCR)在过去的十年中取得了很大的进步。大多数最新的方法通常采用神经网络来学习图像像素到3D场景坐标的映射,需要大量的注解训练数据。我们提议利用神经辐射场(NeRF)生成训练样本。尽管NeRF有效地渲染图像,但是许多渲染数据受到艺术ifacts或只含有有限的信息增长,这可能会降低回归精度或带来无需要的计算成本。这些挑战在本文中被解决:1. NeRF用于分别预测渲染色彩和深度图像的不确定性,以显示像素级别的数据可靠性。2. SCR被表述为深度证明学中的证明不确定性,用于评估信息增长和场景坐标质量。3. 基于三种不确定性,我们提出了一种新的视图选择策略,可以大幅提高数据效率。在公共数据集上进行的实验表明,我们的方法可以选择带有最高信息增长的样本,并提高性能的效率。
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality
results: 我们的实验表明,PDD 可以效果地提高现有 dataset distillation 方法的性能,最高提高4.3%;此外,我们的方法可以生成许多更大的 sintetic 数据集。Abstract
Dataset distillation aims to minimize the time and memory needed for training deep networks on large datasets, by creating a small set of synthetic images that has a similar generalization performance to that of the full dataset. However, current dataset distillation techniques fall short, showing a notable performance gap when compared to training on the original data. In this work, we are the first to argue that using just one synthetic subset for distillation will not yield optimal generalization performance. This is because the training dynamics of deep networks drastically change during the training. Hence, multiple synthetic subsets are required to capture the training dynamics at different phases of training. To address this issue, we propose Progressive Dataset Distillation (PDD). PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets without requiring additional training time. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enable generating considerably larger synthetic datasets.
摘要
Translated into Simplified Chinese: dataset distillation 目的是减少训练深度网络的时间和内存需求,通过创建一个小型的合成图像集,以提高模型的泛化性能。然而,当前的 dataset distillation 技术尚未能充分利用合成图像集的优势,显示了训练原始数据时的性能差距。在这项工作中,我们是首次 argue 使用单一的合成subset 不能提供最佳的泛化性能。这是因为深度网络在训练过程中的训练动态会发生剧烈变化。因此,我们需要使用多个合成subset,以捕捉不同阶段的训练动态。为此,我们提出了 Progressive Dataset Distillation (PDD)。PDD 使用多个小型的合成图像集,每个集合基于前一个集合,并在这些集合的积合union 上训练模型,无需额外的训练时间。我们的广泛实验表明,PDD 可以提高现有的 dataset distillation 方法的性能,最高提高4.3%。此外,我们的方法首次允许生成较大的合成图像集。
ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning
results: ObjectComposer 可以生成高品质的对象compositions,同时保持用户指定的物体外观和配置。这种方法不需要修改基础模型的参数,可以在实时和大规模应用中使用。Abstract
Recent text-to-image generative models can generate high-fidelity images from text prompts. However, these models struggle to consistently generate the same objects in different contexts with the same appearance. Consistent object generation is important to many downstream tasks like generating comic book illustrations with consistent characters and setting. Numerous approaches attempt to solve this problem by extending the vocabulary of diffusion models through fine-tuning. However, even lightweight fine-tuning approaches can be prohibitively expensive to run at scale and in real-time. We introduce a method called ObjectComposer for generating compositions of multiple objects that resemble user-specified images. Our approach is training-free, leveraging the abilities of preexisting models. We build upon the recent BLIP-Diffusion model, which can generate images of single objects specified by reference images. ObjectComposer enables the consistent generation of compositions containing multiple specific objects simultaneously, all without modifying the weights of the underlying models.
摘要
Comparing the robustness of modern no-reference image- and video-quality metrics to adversarial attacks
results: 研究发现一些度量表示高度的抵抗力,使其在 benchmark 中更安全。 benchmark 接受新的度量提交,邀请研究人员通过提高度量的 adversarial robustness 或找到更安全的度量来参与研究。 用户可以使用 pip install robustness-benchmark 进行测试。Abstract
Nowadays neural-network-based image- and video-quality metrics show better performance compared to traditional methods. However, they also became more vulnerable to adversarial attacks that increase metrics' scores without improving visual quality. The existing benchmarks of quality metrics compare their performance in terms of correlation with subjective quality and calculation time. However, the adversarial robustness of image-quality metrics is also an area worth researching. In this paper, we analyse modern metrics' robustness to different adversarial attacks. We adopted adversarial attacks from computer vision tasks and compared attacks' efficiency against 15 no-reference image/video-quality metrics. Some metrics showed high resistance to adversarial attacks which makes their usage in benchmarks safer than vulnerable metrics. The benchmark accepts new metrics submissions for researchers who want to make their metrics more robust to attacks or to find such metrics for their needs. Try our benchmark using pip install robustness-benchmark.
摘要
现在的神经网络基于的图像和视频质量指标表现更好于传统方法,但它们也变得更易受到攻击,使其分数提高而不改善视觉质量。现有的质量指标比较标准是根据主观质量相关性和计算时间。然而,对图像质量指标的攻击Robustness也是一个值得研究的领域。在这篇论文中,我们分析了现代指标对不同的攻击方法的Robustness。我们从计算机视觉任务中抽取了攻击,并将其与15种无参考图像/视频质量指标进行比较。一些指标具有高度抵抗攻击的能力,使其在标准中使用更加安全。我们的标准接受研究人员提交新的指标,以提高指标的Robustness或找到适合他们需求的指标。您可以使用pip安装robustness-benchmark来尝试我们的标准。
End-to-end Evaluation of Practical Video Analytics Systems for Face Detection and Recognition
results: 该论文通过使用适用于驾驶specific的数据集进行了全面的端到端评估,并发现了独立评估模块、数据集不均衡和笔迹不一致等因素可能导致系统性能估计错误。提出了创建平衡评估子集和确保数据集和分析任务之间的协调的策略。然后对端到端系统性能进行了顺序评估,以考虑任务之间的依赖关系。实验结果表明,我们的方法可以提供正确、可靠、可解释的系统性能估计,这对实际应用非常重要。Abstract
Practical video analytics systems that are deployed in bandwidth constrained environments like autonomous vehicles perform computer vision tasks such as face detection and recognition. In an end-to-end face analytics system, inputs are first compressed using popular video codecs like HEVC and then passed onto modules that perform face detection, alignment, and recognition sequentially. Typically, the modules of these systems are evaluated independently using task-specific imbalanced datasets that can misconstrue performance estimates. In this paper, we perform a thorough end-to-end evaluation of a face analytics system using a driving-specific dataset, which enables meaningful interpretations. We demonstrate how independent task evaluations, dataset imbalances, and inconsistent annotations can lead to incorrect system performance estimates. We propose strategies to create balanced evaluation subsets of our dataset and to make its annotations consistent across multiple analytics tasks and scenarios. We then evaluate the end-to-end system performance sequentially to account for task interdependencies. Our experiments show that our approach provides consistent, accurate, and interpretable estimates of the system's performance which is critical for real-world applications.
摘要
实际的视频分析系统在带宽缩限环境中,如自动驾驶车辆,执行计算机视觉任务,如人脸检测和识别。在末端面 analytics 系统中,输入首先使用流行的视频编码器如 HEVC 压缩,然后传递到模块进行人脸检测、对应和识别的sequential进行处理。通常,这些系统的模块会被独立进行评估,使用任务特定的不均衡数据集来评估性能。在这篇论文中,我们进行了综合的末端评估,使用驾驶相关的数据集,以获得可靠的性能估计。我们示出了独立任务评估、数据集不均衡和多个分析任务和场景中的标注不一致可能导致系统性能估计错误的情况。我们提议创建均衡评估 subsets 并使其标注一致于多个分析任务和场景。然后,我们顺序评估整个系统性能,以兑合任务之间的依赖关系。我们的实验结果表明,我们的方法可以提供可靠、准确和可解释的系统性能估计,这是实际应用中的关键。
Self-supervised Object-Centric Learning for Videos
results: 我们的方法可以成功地分割 YouTube 影像序列中的多个复杂和多种类型的物体,并且可以提供高质量的分割结果。Abstract
Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios. In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
摘要
自动多对象分割已经在图像上显示出了很好的结果,通过利用自我监督预训练中强大的 semantics 学习。在视频序列中,通常会使用一个额外modal,如深度或运动,来促进分割。然而,在Synthetic序列中观察到的性能提升不会在更复杂的实际场景中转移。在这篇论文中,我们提出了第一个完全无监督的方法,用于在实际场景中分割多个 объек。我们的 object-centric 学习框架将 объек 分配到每帧的幂 Space 中,然后在帧中关联这些幂 Space。我们的训练目标是在高级 semantics 特征空间中重建中间帧。我们提出了一种masking策略,通过删除大量的 токен在特征空间中来提高效率和规范。此外,我们解决过度归一化问题,通过基于相似性的槽合并。我们的方法可以成功地分割 YouTube 视频中的多个复杂和多样化的实例。
Distillation Improves Visual Place Recognition for Low-Quality Queries
results: 在 Pittsburgh 250k 数据集和自己的室内数据集中,通过 Fine-tune NetVLAD 参数并使用我们的扩展损失函数,对低质量查询图像进行视觉地理位置认识,实现了明显的改善。Abstract
The shift to online computing for real-time visual localization often requires streaming query images/videos to a server for visual place recognition (VPR), where fast video transmission may result in reduced resolution or increased quantization. This compromises the quality of global image descriptors, leading to decreased VPR performance. To improve the low recall rate for low-quality query images, we present a simple yet effective method that uses high-quality queries only during training to distill better feature representations for deep-learning-based VPR, such as NetVLAD. Specifically, we use mean squared error (MSE) loss between the global descriptors of queries with different qualities, and inter-channel correlation knowledge distillation (ICKD) loss over their corresponding intermediate features. We validate our approach using the both Pittsburgh 250k dataset and our own indoor dataset with varying quantization levels. By fine-tuning NetVLAD parameters with our distillation-augmented losses, we achieve notable VPR recall-rate improvements over low-quality queries, as demonstrated in our extensive experimental results. We believe this work not only pushes forward the VPR research but also provides valuable insights for applications needing dependable place recognition under resource-limited conditions.
摘要
往往在在线计算中进行实时视觉地标注需要将查询图像/视频流式传输到服务器进行视觉地标注(VPR),这可能会导致快速视频传输,从而导致图像的分辨率下降或者Quantization增加。这会下降全图像描述器的质量,从而降低VPR性能。为了提高低质量查询图像的回归率,我们提出了一种简单 yet有效的方法。在训练时使用高质量查询图像来浸泡出更好的特征表示,例如NetVLAD。我们使用查询图像的全球描述器之间的平均方差(MSE)损失,以及其相应的中间特征之间的相关知识储存(ICKD)损失。我们验证了我们的方法使用 Pittsburgh 250k 数据集和我们自己的室内数据集,并对查询图像的量化水平进行了变化。通过调整 NetVLAD 参数使用我们的浸泡损失和储存损失,我们实现了对低质量查询图像的 VPR 回归率的显著改进。我们认为这种工作不仅推进了 VPR 研究,也提供了具有可靠性的地标注应用场景中的有价值的意见。
Mitigating stereotypical biases in text to image generative systems
results: 对比基eline,DFT模型在 perceived skin tone 和 perceived gender 方面的公平度指标提高了150%和97.7%。此外,DFT模型也生成了更多的人类 WITH perceived darker skin tone AND more women。作者将会公开所有文本提示和代码,以便其他研究人员可以进行开放研究。Abstract
State-of-the-art generative text-to-image models are known to exhibit social biases and over-represent certain groups like people of perceived lighter skin tones and men in their outcomes. In this work, we propose a method to mitigate such biases and ensure that the outcomes are fair across different groups of people. We do this by finetuning text-to-image models on synthetic data that varies in perceived skin tones and genders constructed from diverse text prompts. These text prompts are constructed from multiplicative combinations of ethnicities, genders, professions, age groups, and so on, resulting in diverse synthetic data. Our diversity finetuned (DFT) model improves the group fairness metric by 150% for perceived skin tone and 97.7% for perceived gender. Compared to baselines, DFT models generate more people with perceived darker skin tone and more women. To foster open research, we will release all text prompts and code to generate training images.
摘要
现代生成文本图像模型已知存在社会偏见,常常过度表现人员 perceived 皮肤颜色和男性。在这项工作中,我们提议一种方法来纠正这些偏见,使结果具有不同群体人员的公平。我们通过精度调整文本图像模型在人工生成的数据上,使用包括多种民族、性别、职业、年龄等多种特征的多元组合生成文本提示。我们称之为多样性精度调整(DFT)模型。对比基eline,DFT 模型在 perceived 皮肤颜色和性别方面的公平指标提高了150%和97.7%。相比基eline,DFT 模型生成了更多的 perceived 皮肤颜色更浅的人员和更多的女性。为促进开放研究,我们将发布所有文本提示和生成训练图像代码。
AutoAD II: The Sequel – Who, When, and What in Movie Audio Description
paper_authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman for: 本研究旨在提供一种自动生成电影 Audio Description(AD)的模型,以帮助视障观众更好地理解电影的情节。methods: 本研究使用 CLIP 视觉特征、cast 列表和时间语音位置来生成 AD,并解决 ‘who’, ‘when’ 和 ‘what’ 三个问题:(i) 谁:引入主要演员的名称、演员和 CLIP 面部特征库,以提高生成 AD 中的人名称;(ii) WHEN:基于视觉内容和其邻近时间段的分析,确定是否需要生成 AD,以及生成 AD 的时间点;(iii) WHAT:使用视语模型,将提案集与 CLIP 视觉特征进行交叉注意力,以提高 AD 文本生成的质量。results: 研究表明,使用 proposed 的模型可以提高 AD 文本生成的质量,并在 apples-to-apples 比较中胜过先前的建模 Architecture。Abstract
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.
摘要
听说描述(AD)是将视觉内容描述出来,以便为视障听众提供帮助。电影中,AD 有很多挑战 -- AD 只能发生在对话中的存在插段,应该使用人物名称,并且需要帮助理解电影的故事情节。为解决这些问题,我们开发了一种新的自动生成电影 AD 模型,使用 CLIP 视觉特征、主演人员名单和时间幕上的对话时间进行生成 ; 解决 'who'、'when' 和 'what' 三个问题:(i) 谁 -- 我们提出了一个人物银行,包括每部电影的主要演员名单、他们的面部 CLIP 特征和actor的名称,并证明了如何使用这些特征来提高生成的 AD 中的命名。(ii) WHEN -- 我们研究了多种方法来确定是否应该在某个时间间隔生成 AD,基于该时间间隔的视觉内容和其邻近时间间隔的特征。(iii) WHAT -- 我们实现了一种新的视觉语言模型,可以将人物银行的提议与视觉特征进行跨对话注意力注入,并证明了如何使用这种模型来改进之前的 AD 文本生成预算。
What Does Stable Diffusion Know about the 3D Scene?
results: 作者们发现,Stable Diffusion 网络在场景几何学、支持关系、阴影和深度方面表现出色,但对 occlusion 的表现较差。此外,作者们还对 DINO 和 CLIP 等其他大规模训练的模型进行了测试,发现其表现较差于Stable Diffusion。Abstract
Recent advances in generative models like Stable Diffusion enable the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a protocol to evaluate whether a network models a number of physical 'properties' of the 3D scene by probing for explicit features that represent these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures. (iii) We find that Stable Diffusion is good at a number of properties including scene geometry, support relations, shadows and depth, but less performant for occlusion. (iv) We also apply the probes to other models trained at large-scale, including DINO and CLIP, and find their performance inferior to that of Stable Diffusion.
摘要
Recent advances in generative models like Stable Diffusion have enabled the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions:(i) We introduce a protocol to evaluate whether a network models a number of physical 'properties' of the 3D scene by probing for explicit features that represent these properties. The probes are applied on datasets of real images with annotations for the property.(ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures.(iii) We find that Stable Diffusion is good at a number of properties including scene geometry, support relations, shadows, and depth, but less performant for occlusion.(iv) We also apply the probes to other models trained at large-scale, including DINO and CLIP, and find their performance inferior to that of Stable Diffusion.
results: 本文的神经缓减可以减少至少一个数量级的假阳性结果,相比传统方法。Abstract
Bounding volumes are an established concept in computer graphics and vision tasks but have seen little change since their early inception. In this work, we study the use of neural networks as bounding volumes. Our key observation is that bounding, which so far has primarily been considered a problem of computational geometry, can be redefined as a problem of learning to classify space into free or occupied. This learning-based approach is particularly advantageous in high-dimensional spaces, such as animated scenes with complex queries, where neural networks are known to excel. However, unlocking neural bounding requires a twist: allowing -- but also limiting -- false positives, while ensuring that the number of false negatives is strictly zero. We enable such tight and conservative results using a dynamically-weighted asymmetric loss function. Our results show that our neural bounding produces up to an order of magnitude fewer false positives than traditional methods.
摘要
bounding 是计算机图形和视觉任务中已经成熟的概念,但是自它的早期发明以来它几乎没有发生变化。在这项工作中,我们研究使用神经网络作为 bounding。我们的关键观察是,bounding,曾经主要被视为计算几何问题,可以被重新定义为学习将空间分类为可用或占用的问题。这种学习基于的方法在高维空间,如动漫场景中的复杂查询,where neural networks are known to excel。然而,解锁神经 bounding 需要一个折衔:允许,但也限制假阳性,而确保数量假阴性是纯粹的零。我们实现这种紧张和保守的结果使用动态权重不Symmetric 损失函数。我们的结果表明,我们的神经 bounding 可以减少至少一个数量级别的假阳性,相比传统方法。
TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning
results: 基于卓越的探测性能,我们开发了两个简单的 MLP 基本头,用于生成 topology。 TopoMLP 实现了 OpenLane-V2 benchmark 上的州际最佳性能,即 41.2% OLS WITH ResNet-50 背景神经网络。此外,它还是自动驾驶挑战赛中第一个 OpenLane topology 解决方案。我们希望这种简单又强大的管道可以为社区提供一些新的想法。代码可以在 https://github.com/wudongming97/TopoMLP 找到。Abstract
Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, i.e., lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 benchmark, i.e., 41.2% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code is at https://github.com/wudongming97/TopoMLP.
摘要
topological 理解旨在全面理解路景和提供自动驾驶路线。它需要检测路中心线(车道)和交通元素,然后进一步分析这些元素之间的topology关系,即车道-车道topology和车道-交通topology。在这种工作中,我们首先表明了topology分数强度取决于检测车道和交通元素的性能。因此,我们提出了一种高效的3D车道检测器和改进的2D交通元素检测器,以推进topology性能的Upper Limit。此外,我们提议了TopoMLP,一个简单却高性能的驱动topology分析管道。基于出色的检测性能,我们开发了两个简单的MLP-based头,用于生成topology。TopoMLP实现了OpenLane-V2数据集上的状态对照性表现,即41.2% OLS,使用ResNet-50 背景模型。它也是自动驾驶挑战赛中的首个OpenLane topology解决方案。我们希望这种简单却强大的管道可以为社区提供新的想法。代码可以在https://github.com/wudongming97/TopoMLP中找到。
HiFi-123: Towards High-fidelity One Image to 3D Content Generation
paper_authors: Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Long Quan, Ying Shan, Yonghong Tian
for: 高精度多视图适用的图像到3D模型生成
methods: 引入参考图像导向新视图增强技术和参考图像导向状态储存损失
results: 对比现有方法,实现了高精度多视图适用的图像到3D模型生成,达到了现状态 искусственный智能水平Abstract
Recent advances in text-to-image diffusion models have enabled 3D generation from a single image. However, current image-to-3D methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a reference-guided novel view enhancement technique that substantially reduces the quality gap between synthesized and reference views. Second, capitalizing on the novel view enhancement, we present a novel reference-guided state distillation loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively.
摘要
现在的文本到图像扩散模型已经允许从单个图像中生成3D模型。然而,当前的图像到3D方法经常生成新视图时会导致图像质量下降,缺乏参考图像的精度,限制了其实际应用。在这篇论文中,我们介绍了HiFi-123方法,用于实现高精度和多视图一致的3D生成。我们的贡献有两个方面:首先,我们提出了一种参考图像指导的新视图增强技术,可以减少生成的视图与参考视图之间的质量差距。其次,我们基于新视图增强技术,提出了一种参考图像指导的状态涂抹损失。当 incorporated into the optimization-based image-to-3D管道中,我们的方法可以明显提高3D生成质量,达到当前最佳性能。我们的评估表明,我们的方法在现有方法的基础上具有较高的效果和精度。
Multi-domain improves out-of-distribution and data-limited scenarios for medical image analysis
results: 相比特定领域模型,多元领域模型在有限数据和离散数据的情况下表现更好,尤其是在医疗应用中频繁遇到的外部数据情况下。多元领域模型可以更好地利用不同领域之间的共同信息,提高整体结果,例如组织识别率可以提高10%。Abstract
Current machine learning methods for medical image analysis primarily focus on developing models tailored for their specific tasks, utilizing data within their target domain. These specialized models tend to be data-hungry and often exhibit limitations in generalizing to out-of-distribution samples. Recently, foundation models have been proposed, which combine data from various domains and demonstrate excellent generalization capabilities. Building upon this, this work introduces the incorporation of diverse medical image domains, including different imaging modalities like X-ray, MRI, CT, and ultrasound images, as well as various viewpoints such as axial, coronal, and sagittal views. We refer to this approach as multi-domain model and compare its performance to that of specialized models. Our findings underscore the superior generalization capabilities of multi-domain models, particularly in scenarios characterized by limited data availability and out-of-distribution, frequently encountered in healthcare applications. The integration of diverse data allows multi-domain models to utilize shared information across domains, enhancing the overall outcomes significantly. To illustrate, for organ recognition, multi-domain model can enhance accuracy by up to 10% compared to conventional specialized models.
摘要
Domain Generalization by Rejecting Extreme Augmentations
paper_authors: Masih Aminbeidokhti, Fidel A. Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Eric Granger, Marco Pedersoli
for: 该 paper 是为了研究数据扩充在深度学习模型中的效果,以及如何在不同领域和环境下实现模型的Recognition性能提升。
methods: 该 paper 使用了一种简单的训练方法,包括:(i) 使用标准数据扩充变换进行均匀采样; (ii) 根据测试数据的变化程度进行变换强化; (iii) 设计一个新的奖励函数,以拒绝不良变换。
results: 该 paper 的数据扩充方案在多个领域和环境下实现了比或更高的Accuracy水平,与现有方法相当或更高。代码可以在 \url{https://github.com/Masseeh/DCAug} 上找到。Abstract
Data augmentation is one of the most effective techniques for regularizing deep learning models and improving their recognition performance in a variety of tasks and domains. However, this holds for standard in-domain settings, in which the training and test data follow the same distribution. For the out-of-domain case, where the test data follow a different and unknown distribution, the best recipe for data augmentation is unclear. In this paper, we show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance. To do that, we propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training. With this procedure, our data augmentation scheme achieves a level of accuracy that is comparable to or better than state-of-the-art methods on benchmark domain generalization datasets. Code: \url{https://github.com/Masseeh/DCAug}
摘要
“数据扩充是深度学习模型训练中最有效的技术之一,可以提高模型在不同任务和领域的识别性能。但是,这只适用于标准内领域的情况,在异领域情况下,最佳的数据扩充策略未知。在这篇论文中,我们表明,在异领域和领域总结合 Settings中,数据扩充可以提供明显和稳定的性能改进。我们提议一种简单的训练方法:(i)使用标准数据扩充变换的均匀采样;(ii)根据异领域数据的高度变化预计,增加变换的强度;(iii)设计一个新的奖励函数,以拒绝对训练造成伤害的极端变换。通过这种方法,我们的数据扩充方案在 benchmark 领域总结合数据集上实现了与现状方法相当或更好的精度。代码:\url{https://github.com/Masseeh/DCAug}Note that the translation is done using Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
results: 该方法可以快速生成对抗防御模型的推论,并且能够准确地捕捉到模型的Semantic部分。此外,该方法可以在不同的学习 парадиг和数据集上进行应用,并且可以提供模型错误的理解。Abstract
Counterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior.
摘要
<>translate into Simplified ChineseCounterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior. traslate into Simplified ChineseCounterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well. Let me know if you have any other questions or requests!
SC2GAN: Rethinking Entanglement by Self-correcting Correlated GAN Space
results: 该研究表明,使用 SC$^2$GAN 可以减少 GANs 中的杂相关性问题,并且可以通过小量的低密度区域样本来生成具有罕见特征组合的图像。Abstract
Generative Adversarial Networks (GANs) can synthesize realistic images, with the learned latent space shown to encode rich semantic information with various interpretable directions. However, due to the unstructured nature of the learned latent space, it inherits the bias from the training data where specific groups of visual attributes that are not causally related tend to appear together, a phenomenon also known as spurious correlations, e.g., age and eyeglasses or women and lipsticks. Consequently, the learned distribution often lacks the proper modelling of the missing examples. The interpolation following editing directions for one attribute could result in entangled changes with other attributes. To address this problem, previous works typically adjust the learned directions to minimize the changes in other attributes, yet they still fail on strongly correlated features. In this work, we study the entanglement issue in both the training data and the learned latent space for the StyleGAN2-FFHQ model. We propose a novel framework SC$^2$GAN that achieves disentanglement by re-projecting low-density latent code samples in the original latent space and correcting the editing directions based on both the high-density and low-density regions. By leveraging the original meaningful directions and semantic region-specific layers, our framework interpolates the original latent codes to generate images with attribute combination that appears infrequently, then inverts these samples back to the original latent space. We apply our framework to pre-existing methods that learn meaningful latent directions and showcase its strong capability to disentangle the attributes with small amounts of low-density region samples added.
摘要
生成敌对网络(GANs)可以生成真实的图像,学习的幂值空间中存储了丰富的 semantics 信息,其中各个方向都具有可解释的含义。然而,由于学习的幂值空间是无结构的,因此它继承了训练数据中的偏见,specific groups of visual attributes that are not causally related tend to appear together, a phenomenon also known as spurious correlations, e.g., age and eyeglasses or women and lipsticks。这使得学习的分布 часто缺少适当的模型,并且 interpolating following editing directions for one attribute could result in entangled changes with other attributes。为了解决这问题,先前的工作通常是通过最小化其他属性的变化来调整学习的方向,但它们仍然无法处理强相关的特征。在这种情况下,我们在StyleGAN2-FFHQ模型中研究了杂化问题,并提出了一种新的框架SC$^2$GAN。我们的框架可以通过重新 проектирова低密度幂值代码样本在原始幂值空间中,并根据高密度和低密度区域来修正编辑方向,以实现解耦。通过利用原始有意义的方向和 semantic region-specific层,我们的框架可以将原始幂值代码 interpolated 到生成图像,并将其推回原始幂值空间。我们对先前学习有意义的幂值方向进行了改进,并显示了我们的框架在小量低密度区域样本的情况下具有强大的解耦能力。
Evaluating Explanation Methods for Vision-and-Language Navigation
results: 经过实验,我们发现了一些有价值的发现,包括:1)不同的解释方法在不同的模型和数据集上的表现有所不同;2)某些解释方法可以更好地捕捉模型在具体的决策过程中使用的信息。Abstract
The ability to navigate robots with natural language instructions in an unknown environment is a crucial step for achieving embodied artificial intelligence (AI). With the improving performance of deep neural models proposed in the field of vision-and-language navigation (VLN), it is equally interesting to know what information the models utilize for their decision-making in the navigation tasks. To understand the inner workings of deep neural models, various explanation methods have been developed for promoting explainable AI (XAI). But they are mostly applied to deep neural models for image or text classification tasks and little work has been done in explaining deep neural models for VLN tasks. In this paper, we address these problems by building quantitative benchmarks to evaluate explanation methods for VLN models in terms of faithfulness. We propose a new erasure-based evaluation pipeline to measure the step-wise textual explanation in the sequential decision-making setting. We evaluate several explanation methods for two representative VLN models on two popular VLN datasets and reveal valuable findings through our experiments.
摘要
“ nave robot 使用自然语言指令在未知环境 Navigation 是人工智能embodied的关键步骤。随着视力语言导航(VLN)领域的深度神经网络模型表现的提高,我们也更关心这些模型做出决策时所使用的信息。为了了解深度神经网络模型的内部工作,各种解释方法已经被开发出来促进可解释人工智能(XAI)。但这些解释方法主要被应用于图像或文本分类任务上,对于VLN任务的解释还做得不够。在这篇论文中,我们解决这些问题,建立了量化的benchmark来评估VLN模型的解释方法。我们提出了一种基于擦除的评估管线,用于测试步骤性文本解释在序列决策设定下。我们对两种代表性VLN模型在两个流行的VLN数据集上进行了多个实验,并通过实验获得了有价值的发现。”
paper_authors: Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink
for: 这 paper 研究了在大型视力语言模型(LVLM)时代的集成方法。集成是一种经典的方法,用于将不同的模型组合起来提高性能。
methods: 作者在这 paper 考虑了各种模型来解决他们的任务,从普通的 LVLM 到包含描述文本作为额外 контекст的模型,以及通过探索 Wikipedia 页面来增强模型的 Lens-based Retrieval。这些模型都是高度 complementary,因此可以合并以获得更高的性能。
results: 作者通过 oracle 实验发现,将这些模型集成起来可以获得大幅提高的性能,从最佳单个模型的 48.8% 准确率(最佳单个模型)提高到最高可能的 ensemble 的 67% 准确率。因此,创建一个 ensemble 并不是一个极Difficult task。Abstract
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?
摘要
Blind Dates: Examining the Expression of Temporality in Historical Photographs
results: 研究结果表明,零 shot分类对于图像日期准确性不太好,倾向于预测过去的日期。经过精度调整后,OpenCLIP的表现得到了改善,并消除了偏好。分析表明,包括汽车、狗、猫、人等元素的图像更加准确地被日期化,这表明了时间标记的存在。这种研究示出计算机视觉模型可以准确地日期图像,并且需要精度调整以获得最佳效果。未来研究应该探讨这些发现的应用于颜色照片和多样化数据集。Abstract
This paper explores the capacity of computer vision models to discern temporal information in visual content, focusing specifically on historical photographs. We investigate the dating of images using OpenCLIP, an open-source implementation of CLIP, a multi-modal language and vision model. Our experiment consists of three steps: zero-shot classification, fine-tuning, and analysis of visual content. We use the \textit{De Boer Scene Detection} dataset, containing 39,866 gray-scale historical press photographs from 1950 to 1999. The results show that zero-shot classification is relatively ineffective for image dating, with a bias towards predicting dates in the past. Fine-tuning OpenCLIP with a logistic classifier improves performance and eliminates the bias. Additionally, our analysis reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers. The study highlights the potential of machine learning models like OpenCLIP in dating images and emphasizes the importance of fine-tuning for accurate temporal analysis. Future research should explore the application of these findings to color photographs and diverse datasets.
摘要
The results show that zero-shot classification is not very effective for image dating, with a bias towards predicting dates in the past. However, fine-tuning OpenCLIP with a logistic classifier improves performance and eliminates the bias. Our analysis also reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers.This study highlights the potential of machine learning models like OpenCLIP in dating images and emphasizes the importance of fine-tuning for accurate temporal analysis. Future research should explore the application of these findings to color photographs and diverse datasets.
EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention
results: 实验结果表明,在不同计算机视觉任务中,包括图像分类、物体检测、实例分割等,提出的鸟视变换器(EViTs)可以有效地与基elines相比,并且在同一个模型大小下显示出更高的速度在图像处理器上。Abstract
Thanks to the advancement of deep learning technology, vision transformer has demonstrated competitive performance in various computer vision tasks. Unfortunately, vision transformer still faces some challenges such as high computational complexity and absence of desirable inductive bias. To alleviate these problems, a novel Bi-Fovea Self-Attention (BFSA) is proposed, inspired by the physiological structure and characteristics of bi-fovea vision in eagle eyes. This BFSA can simulate the shallow fovea and deep fovea functions of eagle vision, enable the network to extract feature representations of targets from coarse to fine, facilitate the interaction of multi-scale feature representations. Additionally, a Bionic Eagle Vision (BEV) block based on BFSA is designed in this study. It combines the advantages of CNNs and Vision Transformers to enhance the ability of global and local feature representations of networks. Furthermore, a unified and efficient general pyramid backbone network family is developed by stacking the BEV blocks in this study, called Eagle Vision Transformers (EViTs). Experimental results on various computer vision tasks including image classification, object detection, instance segmentation and other transfer learning tasks show that the proposed EViTs perform effectively by comparing with the baselines under same model size and exhibit higher speed on graphics processing unit than other models. Code is available at https://github.com/nkusyl/EViT.
摘要
因深度学习技术的进步,视transformer已经在多种计算机视觉任务中展现了竞争力。然而,视transformer仍面临一些挑战,如计算复杂性高和缺乏愿望导向。为解决这些问题,本研究提出了一种新的双眼凝聚自注意力(BFSA), inspirited by the physiological structure and characteristics of bi-fovea vision in eagle eyes。这个BFSA可以模拟鼠标眼中的浅眼和深眼功能,使网络EXTRACTtargets的特征表示从粗到细,促进多尺度特征表示之间的互动。此外,本研究还提出了基于BFSA的鹰眼视觉(BEV)块,可以结合CNNs和视transformer的优点,提高网络的全球和本地特征表示能力。此外,本研究还开发了一个统一和高效的通用尖顶框架网络家族,通过堆叠BEV块来实现,称为鹰视transformer(EViTs)。实验结果表明,提案的EViTs在多种计算机视觉任务中表现强劲,与基准模型相比,在同样的模型大小下,并且在图像处理器上运行 faster。代码可以在https://github.com/nkusyl/EViT中找到。
methods: 该方法使用了深度学习 inverse problem solver vSHARP,并在图像和傅里叶空间中进行优化,以实现高精度重建。
results: 该方法可以在动态心脏成像中提高重建图像质量和多种contraast mapping的精度,并且可以在不同的探测方案下进行普适化。Abstract
Cardiac magnetic resonance imaging is a valuable non-invasive tool for identifying cardiovascular diseases. For instance, Cine MRI is the benchmark modality for assessing the cardiac function and anatomy. On the other hand, multi-contrast (T1 and T2) mapping has the potential to assess pathologies and abnormalities in the myocardium and interstitium. However, voluntary breath-holding and often arrhythmia, in combination with MRI's slow imaging speed, can lead to motion artifacts, hindering real-time acquisition image quality. Although performing accelerated acquisitions can facilitate dynamic imaging, it induces aliasing, causing low reconstructed image quality in Cine MRI and inaccurate T1 and T2 mapping estimation. In this work, inspired by related work in accelerated MRI reconstruction, we present a deep learning (DL)-based method for accelerated cine and multi-contrast reconstruction in the context of dynamic cardiac imaging. We formulate the reconstruction problem as a least squares regularized optimization task, and employ vSHARP, a state-of-the-art DL-based inverse problem solver, which incorporates half-quadratic variable splitting and the alternating direction method of multipliers with neural networks. We treat the problem in two setups; a 2D reconstruction and a 2D dynamic reconstruction task, and employ 2D and 3D deep learning networks, respectively. Our method optimizes in both the image and k-space domains, allowing for high reconstruction fidelity. Although the target data is undersampled with a Cartesian equispaced scheme, we train our model using both Cartesian and simulated non-Cartesian undersampling schemes to enhance generalization of the model to unseen data. Furthermore, our model adopts a deep neural network to learn and refine the sensitivity maps of multi-coil k-space data. Lastly, our method is jointly trained on both, undersampled cine and multi-contrast data.
摘要
心脏磁共振成像是一种有价值的非侵入式工具,用于诊断心血管疾病。例如,Cine MRI 是评估心脏功能和解剖结构的标准模式。然而,自降呼吸和常见的Cardiac arrhythmia,在 MRI 的慢速成像速度下,可能导致运动artefacts,妨碍实时成像质量。尽管执行加速成像可以实现动态成像,但是它会导致扩散,从而导致 Cine MRI 和多态(T1和T2)映射的重建质量低下。在这种情况下,我们基于相关的加速 MRI 重建技术,提出了一种深度学习(DL)基于的方法,用于加速 Cine 和多态重建。我们将重建问题表示为一个带有正则化的最小二乘问题,并使用 vSHARP,一种state-of-the-art DL 基于的反射问题解决方案,其包括半quadratic variable splitting和多项式方法。我们在两种设置下处理问题,一种是2D重建任务,另一种是2D动态重建任务,并使用 2D 和 3D 深度学习网络。我们的方法在图像和频率域进行优化,以确保高重建准确性。虽然目标数据使用 equispaced Cartesian 样式受抽象,但我们使用 Cartesian 和 simulated non-Cartesian 抽象样式进行训练,以便模型在未见数据上的泛化。此外,我们的模型采用深度神经网络来学习和改进多极空间数据的敏感图。最后,我们的方法在受抽象 Cine 和多态数据上进行同时训练。
Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels
paper_authors: Ke Wang, Guillermo Ortiz-Jimenez, Rodolphe Jenatton, Mark Collier, Efi Kokiopoulou, Pascal Frossard
for: mitigate the effects of label noise in deep learning models
methods: leveraging privileged information (PI) to distinguish clean from wrong labels
results: significant performance improvements on key PI benchmarks, establishing a new state-of-the-art test set accuracy, and effective at identifying noisy samples post-trainingAbstract
Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e.g., +6.8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.
摘要
描述预测问题(Label Noise)是深度学习中的一个广泛存在的问题,它可能会影响训练模型的泛化性能。在训练过程中,利用特权信息(PI)——只有在训练过程中可用,不可用于测试时间——已经被认为是一个有效的方法来缓解这个问题。然而,现有的PI-based方法尚未能 consistently 超越无PI counterpart在防止抖振振的方面。为了解决这种不足,我们介绍了Pi-DUAL,一种用于利用PI来分辨干净和错误标签的架构。Pi-DUAL将输出logits decomposed into一个预测项,基于传统的输入特征,以及一个随PI而变的噪音适应项。一个基于PI的闭合机制可以让模型在clean和错误标签之间分离学习路径。实验结果表明,Pi-DUAL在关键PI benchmark上(如ImageNet-PI)实现了显著的性能提升(+6.8%),成为新的州OF-the-art测试集精度。此外,Pi-DUAL还是一种高效的预测错误样本的方法,在这个任务中超越了其他强大的方法。总之,Pi-DUAL是一种简单、可扩展、实用的方法,可以在具有PI的实际应用场景中有效地缓解标签噪音的问题。
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
results: 实验结果表明,作者的评估方法是有效的,并且可以建立一个包含全面指令调整模型的数据集。具体来说,使用REVO-LION数据集(包含高品质样本从各个数据集)和一个简单的加权方法,可以在只有半数据量的情况下达到与所有VLIT数据集之和的性能。Abstract
There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks have been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets themselves and further seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective analysis of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score measured by a series of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we further define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. With only half of the full data, the model trained on REVO-LION can achieve performance comparable to simply adding all VLIT datasets up. In addition to developing an all-powerful model, REVO-LION also includes an evaluation set, which is expected to serve as a convenient evaluation benchmark for future research.
摘要
有一条emerging的研究方向是多模式指令调整(Multimodal Instruction Tuning,简称VLIT),而最近有一些benchmark被提出来评估这些模型。而不是直接评估这些模型,在这篇论文中我们尝试了评估VLIT数据集本身,并寻求如何建立一个可以开发出全面强大VLIT模型的数据集。为了有效地分析VLIT数据集,我们提出了一种跨评分法:在不同的数据集上进行调整,然后在另一个数据集上进行评估。对于每个单独的调评试验集,我们定义了Meta Quality(MQ)为在不同的caption metric(包括BLEU、METEOR和ROUGE-L)中的平均分数,以衡量一个特定数据集或样本的质量。基于这个基础,我们开发了Dataset Quality(DQ),用于评估数据集的完整性。为了建立一个全面的数据集和开发实际应用中的强大模型,我们进一步定义了Sample Quality(SQ),用于衡量每个样本的多方面质量。我们的实验证明了我们提出的评估方法的合理性。基于这种整体评估,我们建立了一个新的数据集,REVO-LION(REfining VisiOn-Language InstructiOn tuNing),通过收集具有更高SQ的样本来建立。与整个数据集的一半数据量相比,模型在REVO-LION上训练可以 дости到与所有VLIT数据集之和相当的性能。除了开发全面强大的模型外,REVO-LION还包括评估集,预计将为未来研究提供便捷的评估标准。
Hierarchical Mask2Former: Panoptic Segmentation of Crops, Weeds and Leaves
results: 该论文实现了PQ{\dag}的75.99,并提出了一些减少计算和时间成本的方法,使得核心算法的执行速度可以减少60%,而PQ{\dag}的下降幅度仅占1%。Abstract
Advancements in machine vision that enable detailed inferences to be made from images have the potential to transform many sectors including agriculture. Precision agriculture, where data analysis enables interventions to be precisely targeted, has many possible applications. Precision spraying, for example, can limit the application of herbicide only to weeds, or limit the application of fertiliser only to undernourished crops, instead of spraying the entire field. The approach promises to maximise yields, whilst minimising resource use and harms to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method to simultaneously identify indicators of plant growth and locate weeds within an image. We adapt Mask2Former, a state-of-the-art architecture for panoptic segmentation, to predict crop, weed and leaf masks. We achieve a PQ{\dag} of 75.99. Additionally, we explore approaches to make the architecture more compact and therefore more suitable for time and compute constrained applications. With our more compact architecture, inference is up to 60% faster and the reduction in PQ{\dag} is less than 1%.
摘要
“具有高度内涵的机器视觉技术,可以对图像进行详细的推论,具有广泛的应用前景。精确农业是其中一个,这种技术可以通过分析数据来实现精确的干预。例如,精确喷撒可以仅对于落叶的部分喷撒药物,或仅对于缺乏营养的作物喷撒肥料,而不是对整个田园进行喷撒。这种方法可以最大化生产量,同时最小化资源的使用和环境中的伤害。为此,我们提出了一个层次组织的当知分割方法,可以同时识别作物生长指标和发现杂草。我们将Mask2Former架构,一个现代当知分割架构,用于预测作物、杂草和叶子的面罩。我们取得了PQ{\dag}的75.99。此外,我们也考虑了将架构更加紧凑,以便在时间和计算限制下进行应用。我们的更紧凑的架构,可以在测试中提高推断速度,并且对PQ{\dag}的下降产生较小的影响。”Note: PQ{\dag} is a metric used to evaluate the performance of panoptic segmentation models, it stands for "panoptic quality" with asterisk.
Energy-Efficient Visual Search by Eye Movement and Low-Latency Spiking Neural Network
results: 模型可以学习人类类似的 fixation 策略和近似于最优的搜索策略,在搜索速度和准确率方面超越人类,同时具有高能效性和短抑制决策延迟。这些结果表明人类视觉系统的模型化在神经科学和机器学习中具有重要意义,并可以帮助开发更能效的计算机视觉算法。Abstract
Human vision incorporates non-uniform resolution retina, efficient eye movement strategy, and spiking neural network (SNN) to balance the requirements in visual field size, visual resolution, energy cost, and inference latency. These properties have inspired interest in developing human-like computer vision. However, existing models haven't fully incorporated the three features of human vision, and their learned eye movement strategies haven't been compared with human's strategy, making the models' behavior difficult to interpret. Here, we carry out experiments to examine human visual search behaviors and establish the first SNN-based visual search model. The model combines an artificial retina with spiking feature extraction, memory, and saccade decision modules, and it employs population coding for fast and efficient saccade decisions. The model can learn either a human-like or a near-optimal fixation strategy, outperform humans in search speed and accuracy, and achieve high energy efficiency through short saccade decision latency and sparse activation. It also suggests that the human search strategy is suboptimal in terms of search speed. Our work connects modeling of vision in neuroscience and machine learning and sheds light on developing more energy-efficient computer vision algorithms.
摘要
人类视觉具有不均匀分辨率 RETINA、高效眼动策略和脉冲神经网络(SNN),以满足视场大小、分辨率、能耗成本和推理延迟的要求。这些特性引发了人类视觉模型的开发兴趣。然而,现有模型没有完全包含人类视觉的三个特性,并且学习的眼动策略与人类策略不符,使模型的行为困难于解释。我们在这里进行实验研究人类视察行为,建立了首个SNN基于的视察模型。该模型组合人工 RETINA 与脉冲特征提取、记忆和眼动决策模块,并使用人口编码进行快速和高效的眼动决策。模型可以学习人类类似或近似优化的FIXATION策略,在搜寻速度和准确率方面超过人类,并且通过短时间内的眼动决策延迟和稀有的活动实现高能效性。此外,我们的工作将神经科学和机器学习的视觉模型相连,并照明开发更加能效的计算机视觉算法的可能性。
SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction
results: 该方法可以 superior 的表现 reconstruction 3D human mesh from freehand sketches.Abstract
Reconstructing 3D human shapes from 2D images has received increasing attention recently due to its fundamental support for many high-level 3D applications. Compared with natural images, freehand sketches are much more flexible to depict various shapes, providing a high potential and valuable way for 3D human reconstruction. However, such a task is highly challenging. The sparse abstract characteristics of sketches add severe difficulties, such as arbitrariness, inaccuracy, and lacking image details, to the already badly ill-posed problem of 2D-to-3D reconstruction. Although current methods have achieved great success in reconstructing 3D human bodies from a single-view image, they do not work well on freehand sketches. In this paper, we propose a novel sketch-driven multi-faceted decoder network termed SketchBodyNet to address this task. Specifically, the network consists of a backbone and three separate attention decoder branches, where a multi-head self-attention module is exploited in each decoder to obtain enhanced features, followed by a multi-layer perceptron. The multi-faceted decoders aim to predict the camera, shape, and pose parameters, respectively, which are then associated with the SMPL model to reconstruct the corresponding 3D human mesh. In learning, existing 3D meshes are projected via the camera parameters into 2D synthetic sketches with joints, which are combined with the freehand sketches to optimize the model. To verify our method, we collect a large-scale dataset of about 26k freehand sketches and their corresponding 3D meshes containing various poses of human bodies from 14 different angles. Extensive experimental results demonstrate our SketchBodyNet achieves superior performance in reconstructing 3D human meshes from freehand sketches.
摘要
<>这里的文本将被翻译为简化字的中文。<>近期,从二dimensional图像中重建三dimensional人体shape received increasing attention,因为它具有许多高级应用的基本支持。相比于自然图像,自由画图更加灵活地描绘了不同的形状,提供了高度潜在和有价的方法。然而,这个任务非常具有挑战性。图像的叠加特征使得问题变得更加困难,导致这已经是 badly ill-posed 的问题。 although current methods have achieved great success in reconstructing three-dimensional human bodies from a single-view image, they do not work well on freehand sketches.在这篇文章中,我们提出了一个新的图画驱动多面顶点网络,称为 SketchBodyNet,以解决这个任务。具体而言,该网络包括一个背bone和三个分开的注意力顶点分支,每个分支都包括一个多头自我注意模组,然后是一个多层感知器。这些多面顶点尝试预测摄像头、形状和姿势参数,分别与SMPL模型组合以重建相应的三dimensional human mesh。在学习过程中,我们将现有的三dimensional mesh projected via camera parameters into 2D synthetic sketches with joints,与自由画图相结合以便优化模型。为了证明我们的方法,我们收集了大约26,000幅自由画图和其对应的三dimensional mesh,这些图像包括了人体的多个姿势和角度。实验结果显示,我们的 SketchBodyNet 可以从自由画图中重建三dimensional human mesh with superior performance。
Efficient Retrieval of Images with Irregular Patterns using Morphological Image Analysis: Applications to Industrial and Healthcare datasets
results: 本文使用了不同的特征提取方法和距离度量来评估图像检索性能,结果显示,使用DefChars和曼哈顿距离度量可以达到80%的含义平均精度和0.09的标准差,在不同的 dataset 上表现出色。Abstract
Image retrieval is the process of searching and retrieving images from a database based on their visual content and features. Recently, much attention has been directed towards the retrieval of irregular patterns within industrial or medical images by extracting features from the images, such as deep features, colour-based features, shape-based features and local features. This has applications across a spectrum of industries, including fault inspection, disease diagnosis, and maintenance prediction. This paper proposes an image retrieval framework to search for images containing similar irregular patterns by extracting a set of morphological features (DefChars) from images; the datasets employed in this paper contain wind turbine blade images with defects, chest computerised tomography scans with COVID-19 infection, heatsink images with defects, and lake ice images. The proposed framework was evaluated with different feature extraction methods (DefChars, resized raw image, local binary pattern, and scale-invariant feature transforms) and distance metrics to determine the most efficient parameters in terms of retrieval performance across datasets. The retrieval results show that the proposed framework using the DefChars and the Manhattan distance metric achieves a mean average precision of 80% and a low standard deviation of 0.09 across classes of irregular patterns, outperforming alternative feature-metric combinations across all datasets. Furthermore, the low standard deviation between each class highlights DefChars' capability for a reliable image retrieval task, even in the presence of class imbalances or small-sized datasets.
摘要
Image Retrieval是寻找和搜寻库中的图像,根据它们的视觉内容和特征进行搜寻。现在,对于工业或医疗图像中的不规律模式进行搜寻已经引起了很多注意。这可以应用于各种领域,包括缺陷检测、疾病诊断和维护预测。本文提出了一个图像搜寻框架,用于寻找具有相似不规律模式的图像。这个框架使用了一些形式特征(DefChars)从图像中提取特征,并使用了不同的特征提取方法和距离度量进行评估。结果显示,使用DefChars和曼哈顿距离度量的提案框架可以实现80%的平均精度和0.09的标准差,在不同的数据集上均以最高效的方式进行搜寻。此外,每个类别之间的标准差较低,证明了DefChars在适当的图像搜寻任务中的可靠性,即使 faced with class imbalances or small-sized datasets。
Compositional Representation Learning for Brain Tumour Segmentation
paper_authors: Xiao Liu, Antanas Kascenas, Hannah Watson, Sotirios A. Tsaftaris, Alison Q. O’Neil
for: 这 paper 的目的是解决折衣瘤分 segmentation 领域中的数据短缺问题,通过混合监督框架和无监督学习来学习强健的组成表示。
methods: 该 paper 使用了混合监督框架 vMFNet,通过无监督学习和弱监督来学习强健的组成表示,并使用了 BraTS 数据集来生成各个 MRI 图像中的两点专家病理标注,从而构建了弱图像级别标注。
results: 该 paper 表明,可以通过大量的弱监督数据和只有一小部分完全标注数据来实现佳的折衣瘤分 segmentation 性能,并且发现在只受病理学超级vision(折衣瘤)的情况下,emergent learning 可以在组成表示中 Learning 出各种解剖结构。Abstract
For brain tumour segmentation, deep learning models can achieve human expert-level performance given a large amount of data and pixel-level annotations. However, the expensive exercise of obtaining pixel-level annotations for large amounts of data is not always feasible, and performance is often heavily reduced in a low-annotated data regime. To tackle this challenge, we adapt a mixed supervision framework, vMFNet, to learn robust compositional representations using unsupervised learning and weak supervision alongside non-exhaustive pixel-level pathology labels. In particular, we use the BraTS dataset to simulate a collection of 2-point expert pathology annotations indicating the top and bottom slice of the tumour (or tumour sub-regions: peritumoural edema, GD-enhancing tumour, and the necrotic / non-enhancing tumour) in each MRI volume, from which weak image-level labels that indicate the presence or absence of the tumour (or the tumour sub-regions) in the image are constructed. Then, vMFNet models the encoded image features with von-Mises-Fisher (vMF) distributions, via learnable and compositional vMF kernels which capture information about structures in the images. We show that good tumour segmentation performance can be achieved with a large amount of weakly labelled data but only a small amount of fully-annotated data. Interestingly, emergent learning of anatomical structures occurs in the compositional representation even given only supervision relating to pathology (tumour).
摘要
Data efficient deep learning for medical image analysis: A survey
results: 本文系统地总结了医学图像分析领域中通用的数据效率深度学习方法,并探讨未来的研究方向。Abstract
The rapid evolution of deep learning has significantly advanced the field of medical image analysis. However, despite these achievements, the further enhancement of deep learning models for medical image analysis faces a significant challenge due to the scarcity of large, well-annotated datasets. To address this issue, recent years have witnessed a growing emphasis on the development of data-efficient deep learning methods. This paper conducts a thorough review of data-efficient deep learning methods for medical image analysis. To this end, we categorize these methods based on the level of supervision they rely on, encompassing categories such as no supervision, inexact supervision, incomplete supervision, inaccurate supervision, and only limited supervision. We further divide these categories into finer subcategories. For example, we categorize inexact supervision into multiple instance learning and learning with weak annotations. Similarly, we categorize incomplete supervision into semi-supervised learning, active learning, and domain-adaptive learning and so on. Furthermore, we systematically summarize commonly used datasets for data efficient deep learning in medical image analysis and investigate future research directions to conclude this survey.
摘要
“深度学习的快速进化已经大幅提高医学图像分析领域的水平。然而,虽然有这些成就,但是进一步提高深度学习模型的医学图像分析仍面临着数据稀缺的大问题。为解决这个问题,最近几年内见许多关注数据有效深度学习方法的开发。这篇文章进行了深入的对数据有效深度学习方法的审视。为此,我们将这些方法按照它们依赖的级别进行分类,包括无监督、不准确监督、不完全监督、不正确监督以及只有有限监督。我们将这些类别进一步细分,例如将不准确监督分为多个实例学习和弱标注学习。类似地,我们将不完全监督分为半监督学习、活动学习和领域适应学习等。此外,我们系统地总结了医学图像分析中常用的数据有效深度学习数据集,并 investigate未来研究方向以结束本调查。”
Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks
paper_authors: Lukas Struppek, Dominik Hintersdorf, Kristian Kersting for: 这篇论文旨在调查标签融合对模型私隐泄露的影响。methods: 作者使用了标签融合来调查模型的泄露情况,并对比了不同的标签融合方法的影响。results: 研究发现,传统的标签融合可能会增加模型的隐私泄露,而使用负因素融合可以阻止模型泄露class-related信息,从而保护模型的隐私。Abstract
Label smoothing -- using softened labels instead of hard ones -- is a widely adopted regularization method for deep learning, showing diverse benefits such as enhanced generalization and calibration. Its implications for preserving model privacy, however, have remained unexplored. To fill this gap, we investigate the impact of label smoothing on model inversion attacks (MIAs), which aim to generate class-representative samples by exploiting the knowledge encoded in a classifier, thereby inferring sensitive information about its training data. Through extensive analyses, we uncover that traditional label smoothing fosters MIAs, thereby increasing a model's privacy leakage. Even more, we reveal that smoothing with negative factors counters this trend, impeding the extraction of class-related information and leading to privacy preservation, beating state-of-the-art defenses. This establishes a practical and powerful novel way for enhancing model resilience against MIAs.
摘要
标签平滑(label smoothing)是一种广泛应用的深度学习常规方法,它可以提高模型的泛化和准确性。然而,它的隐私保护方面尚未得到探讨。为了填补这一空白,我们研究了标签平滑对模型反向攻击(MIA)的影响,MIA是利用分类器中嵌入的知识来生成类 representative sample,从而推测模型的训练数据。经过广泛的分析,我们发现传统的标签平滑实际上会增强模型的隐私泄露。反之,使用负因子平滑可以阻止类相关信息的提取,从而保持模型的隐私。这成为一种实用和有力的新方法,可以增强模型对MIA的抗御能力。
Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing on Low-level Features
for: This paper aims to improve the performance of Image Manipulation Localization (IML) tasks by incorporating both high-level and low-level features.
methods: The proposed method, Perceptual MAE (PMAE), enhances the Masked Autoencoder (MAE) with high-resolution inputs and a perceptual loss supervision module to better capture both object semantics and pixel-level features.
results: Extensive experiments show that PMAE outperforms state-of-the-art tampering localization methods on all five publicly available datasets, demonstrating the effectiveness of integrating high-level and low-level features in IML tasks.Abstract
Nowadays, multimedia forensics faces unprecedented challenges due to the rapid advancement of multimedia generation technology thereby making Image Manipulation Localization (IML) crucial in the pursuit of truth. The key to IML lies in revealing the artifacts or inconsistencies between the tampered and authentic areas, which are evident under pixel-level features. Consequently, existing studies treat IML as a low-level vision task, focusing on allocating tampered masks by crafting pixel-level features such as image RGB noises, edge signals, or high-frequency features. However, in practice, tampering commonly occurs at the object level, and different classes of objects have varying likelihoods of becoming targets of tampering. Therefore, object semantics are also vital in identifying the tampered areas in addition to pixel-level features. This necessitates IML models to carry out a semantic understanding of the entire image. In this paper, we reformulate the IML task as a high-level vision task that greatly benefits from low-level features. Based on such an interpretation, we propose a method to enhance the Masked Autoencoder (MAE) by incorporating high-resolution inputs and a perceptual loss supervision module, which is termed Perceptual MAE (PMAE). While MAE has demonstrated an impressive understanding of object semantics, PMAE can also compensate for low-level semantics with our proposed enhancements. Evidenced by extensive experiments, this paradigm effectively unites the low-level and high-level features of the IML task and outperforms state-of-the-art tampering localization methods on all five publicly available datasets.
摘要
现在, Multimedia FORENSICS 面临了历史上 nunca antes seen 的挑战,这是因为 Multimedia 生成技术的快速发展,从而使 Image Manipulation Localization (IML) 成为了查找真实的关键。关键在于揭示 tampered 和 authentic 区域之间的差异,这些差异在像素级别特征上表现出来。因此,现有的研究都将 IML 视为一种低级视觉任务,通过创建像素级特征,如图像 RGB 噪音、边缘信号或高频特征来分配 tampered 面纱。然而,在实践中, tampering 通常发生在对象层次上,不同的对象类型有不同的抢夺概率。因此,对象 semantics 也是必要的,以便在找到 tampered 区域时具有更高的准确率。这使得 IML 模型需要对整个图像进行semantic 理解。在这篇论文中,我们将 IML 任务重新解释为一种高级视觉任务,这种任务可以受益于低级特征。基于这种解释,我们提出了一种使用高分辨率输入和一种感知损失监督模块的改进 MAE 方法,称为 Perceptual MAE (PMAE)。MAE 已经在对象 semantics 方面表现出色,而 PMAE 可以通过我们的提议的改进,同时具有低级 semantics。经验证明,这种方法可以有效地结合 IML 任务的低级和高级特征,并在所有五个公开的数据集上超越当前最佳的抹改地点标识方法。
Watt For What: Rethinking Deep Learning’s Energy-Performance Relationship
results: 研究结果显示,较小的、更能效的模型可以实现较高的精度,同时也能够降低电力消耗。研究人员还发现,不同的GPU上的电力消耗和精度之间存在贸易关系。这些结果显示,可以透过优化模型的设计和训练,以减少电力消耗,并促进更公平的研究竞争。Abstract
Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy. However, their increasing energy consumption has raised concerns about their environmental impact, disadvantaging smaller entities in research and exacerbating global energy consumption. In this paper, we explore the trade-off between model accuracy and electricity consumption, proposing a metric that penalizes large consumption of electricity. We conduct a comprehensive study on the electricity consumption of various deep learning models across different GPUs, presenting a detailed analysis of their accuracy-efficiency trade-offs. By evaluating accuracy per unit of electricity consumed, we demonstrate how smaller, more energy-efficient models can significantly expedite research while mitigating environmental concerns. Our results highlight the potential for a more sustainable approach to deep learning, emphasizing the importance of optimizing models for efficiency. This research also contributes to a more equitable research landscape, where smaller entities can compete effectively with larger counterparts. This advocates for the adoption of efficient deep learning practices to reduce electricity consumption, safeguarding the environment for future generations whilst also helping ensure a fairer competitive landscape.
摘要
Deep Learning for Automatic Detection and Facial Recognition in Japanese Macaques: Illuminating Social Networks
results: 研究初步结果显示,使用深度学习技术可以成功地检测和识别日本黑猩猩的面部特征,并生成一个可靠的社会网络表示。在实验中,研究人员创建了一个基于视频拍摄的日本黑猩猩社会网络,并与自动生成的网络进行比较,以评估其可靠性。Abstract
Individual identification plays a pivotal role in ecology and ethology, notably as a tool for complex social structures understanding. However, traditional identification methods often involve invasive physical tags and can prove both disruptive for animals and time-intensive for researchers. In recent years, the integration of deep learning in research offered new methodological perspectives through automatization of complex tasks. Harnessing object detection and recognition technologies is increasingly used by researchers to achieve identification on video footage. This study represents a preliminary exploration into the development of a non-invasive tool for face detection and individual identification of Japanese macaques (Macaca fuscata) through deep learning. The ultimate goal of this research is, using identifications done on the dataset, to automatically generate a social network representation of the studied population. The current main results are promising: (i) the creation of a Japanese macaques' face detector (Faster-RCNN model), reaching a 82.2% accuracy and (ii) the creation of an individual recognizer for K{\=o}jima island macaques population (YOLOv8n model), reaching a 83% accuracy. We also created a K{\=o}jima population social network by traditional methods, based on co-occurrences on videos. Thus, we provide a benchmark against which the automatically generated network will be assessed for reliability. These preliminary results are a testament to the potential of this innovative approach to provide the scientific community with a tool for tracking individuals and social network studies in Japanese macaques.
摘要
个体识别在生态和行为学中扮演着重要的角色,尤其是用于复杂社会结构的理解。然而,传统的识别方法 oftentimes involve侵入性的物理标签,可以对动物产生不良影响并耗费研究人员的时间。在过去的几年中,研究中的 интеграция深度学习提供了新的方法学视角。通过对视频 Footage 进行对象检测和识别技术的应用,研究人员可以实现个体识别。本研究是一项初步的探索,旨在开发一种不侵入的工具 для日本黑猩狮(Macaca fuscata)个体识别,通过深度学习。研究的最终目标是,使用在数据集上进行识别,自动生成日本黑猩狮社会网络表示。现主要结果如下:1. 创建了一个日本黑猩狮脸部检测器(Faster-RCNN模型),达到了82.2%的准确率。2. 创建了一个特定于吾岛黑猩狮population的个体识别器(YOLOv8n模型),达到了83%的准确率。3. 基于视频上的共处,创建了吾岛黑猩狮population社会网络。这些初步结果表明了这种创新的方法的潜在力量,可以为科学社区提供一种跟踪个体和社会网络研究的工具。
Focus on Local Regions for Query-based Object Detection
results: 实验结果显示 FoLR 在查询基于object detection中达到了状态的性能和计算效率,比传统方法更快 converge 和更高效Abstract
Query-based methods have garnered significant attention in object detection since the advent of DETR, the pioneering end-to-end query-based detector. However, these methods face challenges like slow convergence and suboptimal performance. Notably, self-attention in object detection often hampers convergence due to its global focus. To address these issues, we propose FoLR, a transformer-like architecture with only decoders. We enhance the self-attention mechanism by isolating connections between irrelevant objects that makes it focus on local regions but not global regions. We also design the adaptive sampling method to extract effective features based on queries' local regions from feature maps. Additionally, we employ a look-back strategy for decoders to retain prior information, followed by the Feature Mixer module to fuse features and queries. Experimental results demonstrate FoLR's state-of-the-art performance in query-based detectors, excelling in convergence speed and computational efficiency.
摘要
干预方法在物体检测中获得了广泛关注,自DETR发明以来。然而,这些方法面临较慢的收敛速度和不佳的性能问题。尤其是对象检测中的自注意力经常干扰收敛,因为它具有全局注意力。为解决这些问题,我们提议FoLR架构,它是一种基于transformer的 Architecture,只有解码器。我们强化对象检测中的自注意力机制,隔离无关对象之间的连接,使其关注本地区域而不是全局区域。此外,我们还设计了适应采样方法,从特征图中提取有效特征,基于查询的本地区域。此外,我们还使用look-back策略,让解码器保留先前信息,然后使用特征混合模块将特征和查询混合。实验结果表明FoLR在查询基于检测器中实现了状态机器的表现,在收敛速度和计算效率方面都具有优势。
A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks
paper_authors: Yang Wang, Bo Dong, Ke Xu, Haiyin Piao, Yufei Ding, Baocai Yin, Xin Yang for:* The paper aims to propose a new metric for evaluating the adversarial robustness of deep neural networks (DNNs) against different types of attacks on large-scale datasets.methods:* The proposed metric is called Adversarial Converging Time Score (ACTS), which measures the time it takes for an attacker to find an adversarial example on a specific input.* ACTS is based on the observation that the local neighborhoods on a DNN’s output surface have different shapes for different inputs, and thus the converging time to an adversarial sample will vary depending on the input.results:* The proposed ACTS metric is validated on the large-scale ImageNet dataset using state-of-the-art deep networks, and shows to be more efficient and effective than the previous CLEVER metric.* Extensive experiments demonstrate the effectiveness and generalization of ACTS against different adversarial attacks.Abstract
Deep Neural Networks (DNNs) are widely used for computer vision tasks. However, it has been shown that deep models are vulnerable to adversarial attacks, i.e., their performances drop when imperceptible perturbations are made to the original inputs, which may further degrade the following visual tasks or introduce new problems such as data and privacy security. Hence, metrics for evaluating the robustness of deep models against adversarial attacks are desired. However, previous metrics are mainly proposed for evaluating the adversarial robustness of shallow networks on the small-scale datasets. Although the Cross Lipschitz Extreme Value for nEtwork Robustness (CLEVER) metric has been proposed for large-scale datasets (e.g., the ImageNet dataset), it is computationally expensive and its performance relies on a tractable number of samples. In this paper, we propose the Adversarial Converging Time Score (ACTS), an attack-dependent metric that quantifies the adversarial robustness of a DNN on a specific input. Our key observation is that local neighborhoods on a DNN's output surface would have different shapes given different inputs. Hence, given different inputs, it requires different time for converging to an adversarial sample. Based on this geometry meaning, ACTS measures the converging time as an adversarial robustness metric. We validate the effectiveness and generalization of the proposed ACTS metric against different adversarial attacks on the large-scale ImageNet dataset using state-of-the-art deep networks. Extensive experiments show that our ACTS metric is an efficient and effective adversarial metric over the previous CLEVER metric.
摘要
Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023
paper_authors: Xiangyu Wu, Yang Yang, Shengdong Xu, Yifeng Wu, Qingguo Chen, Jianfeng Lu
for: This paper presents a solution to the Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge, which evaluates the ability of neural networks to solve visuolinguistic puzzles for children aged 6-8.
methods: The authors employed a divide-and-conquer approach, categorizing questions into eight types and using the llama-2-chat model to generate question types in a zero-shot manner. They also trained a yolov7 model on the icon45 dataset for object detection and combined it with OCR to recognize objects and text within images. Additionally, they used the BLIP-2 model with eight adapters to adaptively extract visual features for different question types.
results: Under the puzzle splits configuration, the authors achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.Abstract
In this paper, we present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge. Different from the traditional visual question-answering datasets, this challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles designed specifically for children in the 6-8 age group. We employed a divide-and-conquer approach. At the data level, inspired by the challenge paper, we categorized the whole questions into eight types and utilized the llama-2-chat model to directly generate the type for each question in a zero-shot manner. Additionally, we trained a yolov7 model on the icon45 dataset for object detection and combined it with the OCR method to recognize and locate objects and text within the images. At the model level, we utilized the BLIP-2 model and added eight adapters to the image encoder VIT-G to adaptively extract visual features for different question types. We fed the pre-constructed question templates as input and generated answers using the flan-t5-xxl decoder. Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.
摘要
在这篇论文中,我们介绍了我们对多modal算法逻辑任务(SMART-101挑战)的解决方案。与传统视觉问答数据集不同,这个挑战评估了基于图像和语言的抽象、推理和总结能力的神经网络。我们采用分治方法。在数据层次,根据挑战论文,我们将整个问题分为八种类型,并使用llama-2-chat模型直接生成每个问题的类型在零容量情况下。此外,我们在icon45数据集上训练了yolov7模型,并将其与OCR方法结合使用,以识别和定位图像中的对象和文本。在模型层次,我们使用BLIP-2模型,并添加了八个适应器到图像编码器VIT-G,以适应不同问题类型的视觉特征。我们将预构建的问题模板作为输入,并使用flan-t5-xxl解码器生成答案。在配置为谜题分割的情况下,我们在验证集上达到了26.5的准确率和24.30的私有测试集准确率。
The Solution for the CVPR2023 NICE Image Captioning Challenge
results: 我们的方法在验证和测试阶段分别获得了105.17和325.72的Cider-Score,位于领导者板块的第一名。Abstract
In this paper, we present our solution to the New frontiers for Zero-shot Image Captioning Challenge. Different from the traditional image captioning datasets, this challenge includes a larger new variety of visual concepts from many domains (such as COVID-19) as well as various image types (photographs, illustrations, graphics). For the data level, we collect external training data from Laion-5B, a large-scale CLIP-filtered image-text dataset. For the model level, we use OFA, a large-scale visual-language pre-training model based on handcrafted templates, to perform the image captioning task. In addition, we introduce contrastive learning to align image-text pairs to learn new visual concepts in the pre-training stage. Then, we propose a similarity-bucket strategy and incorporate this strategy into the template to force the model to generate higher quality and more matching captions. Finally, by retrieval-augmented strategy, we construct a content-rich template, containing the most relevant top-k captions from other image-text pairs, to guide the model in generating semantic-rich captions. Our method ranks first on the leaderboard, achieving 105.17 and 325.72 Cider-Score in the validation and test phase, respectively.
摘要
在这篇论文中,我们介绍我们的方法解决新前iers for Zero-shot Image Captioning Challenge。与传统的图像描述 dataset 不同,这个挑战包括更多的新领域的视觉概念(如 COVID-19)以及不同的图像类型(如照片、插图、图形)。在数据层次上,我们收集了来自 Laion-5B 的大规模 CLIP-过滤的图像文本数据集进行训练。在模型层次上,我们使用 OFA,一种基于手工模板的大规模视语预训练模型,来进行图像描述任务。此外,我们引入了对比学习,以对图像-文本对进行对齐,以学习新的视觉概念。然后,我们提出了一种相似桶策略,并将其 incorporate 到模型中,以强制模型生成更高质量和更匹配的描述。最后,我们通过 Retrieval-augmented 策略,construct 一个内容丰富的模板,包含最相关的 top-k 描述从其他图像-文本对,以导引模型生成semantic-rich的描述。我们的方法在领导板中名列第一,在验证阶段和测试阶段分别获得 105.17 和 325.72 Cider-Score。
Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks
results: 实验表明,由该方法生成的GT具有较好的标准化一致性,并且提供了一个平衡 между简单性和完整性。Abstract
Skeleton Ground Truth (GT) is critical to the success of supervised skeleton extraction methods, especially with the popularity of deep learning techniques. Furthermore, we see skeleton GTs used not only for training skeleton detectors with Convolutional Neural Networks (CNN) but also for evaluating skeleton-related pruning and matching algorithms. However, most existing shape and image datasets suffer from the lack of skeleton GT and inconsistency of GT standards. As a result, it is difficult to evaluate and reproduce CNN-based skeleton detectors and algorithms on a fair basis. In this paper, we present a heuristic strategy for object skeleton GT extraction in binary shapes and natural images. Our strategy is built on an extended theory of diagnosticity hypothesis, which enables encoding human-in-the-loop GT extraction based on clues from the target's context, simplicity, and completeness. Using this strategy, we developed a tool, SkeView, to generate skeleton GT of 17 existing shape and image datasets. The GTs are then structurally evaluated with representative methods to build viable baselines for fair comparisons. Experiments demonstrate that GTs generated by our strategy yield promising quality with respect to standard consistency, and also provide a balance between simplicity and completeness.
摘要
骨架真实数据(GT)对超视觉骨架检测方法的成功非常重要,尤其是深度学习技术的普及。此外,我们发现骨架GT不仅用于训练骨架检测器采用Convolutional Neural Networks(CNN),还用于评估骨架相关的束缚和匹配算法。然而,大多数现有的形状和图像数据集受到骨架GT的缺失和GT标准的不一致的影响。这使得评估和重现基于CNN的骨架检测器和算法非常困难。在这篇论文中,我们提出了一种规则性的骨架GT抽取策略,这种策略基于扩展的诊断假设,允许基于目标Context、简洁性和完整性的人工 Loop GT抽取。使用这种策略,我们开发了一个工具,SkeView,用于生成17个现有形状和图像数据集的骨架GT。这些GT然后与代表方法进行结构性评估,以建立可靠的基准。实验表明,由我们的策略生成的GT具有对标准一致性和完整性的承诺,并提供了简洁性和完整性之间的平衡。
Conformal Prediction for Deep Classifier via Label Ranking
results: 经过实验和理论分析表明,SAPS可以生成较小的预测集,同时保留实例级别的uncertainty信息。此外,SAPS还可以广泛提高预测集的 conditional coverage rate和适应性。Abstract
Conformal prediction is a statistical framework that generates prediction sets containing ground-truth labels with a desired coverage guarantee. The predicted probabilities produced by machine learning models are generally miscalibrated, leading to large prediction sets in conformal prediction. In this paper, we empirically and theoretically show that disregarding the probabilities' value will mitigate the undesirable effect of miscalibrated probability values. Then, we propose a novel algorithm named $\textit{Sorted Adaptive prediction sets}$ (SAPS), which discards all the probability values except for the maximum softmax probability. The key idea behind SAPS is to minimize the dependence of the non-conformity score on the probability values while retaining the uncertainty information. In this manner, SAPS can produce sets of small size and communicate instance-wise uncertainty. Theoretically, we provide a finite-sample coverage guarantee of SAPS and show that the expected value of set size from SAPS is always smaller than APS. Extensive experiments validate that SAPS not only lessens the prediction sets but also broadly enhances the conditional coverage rate and adaptation of prediction sets.
摘要
《匹配预测》是一种统计框架,生成包含真实标签的预测集,具有所需的保障率 garantía。机器学习模型生成的预测概率通常是偏债的,导致匹配预测集较大。在这篇论文中,我们通过实验和理论分析表明,忽略概率值可以减轻不良影响。然后,我们提出一种新的算法名为《排序适应预测集》(SAPS),它丢弃所有概率值,只保留最大软max概率。SAPS的关键想法是减少非准确度评分值与概率值之间的依赖关系,以保留uncertainty信息。因此,SAPS可以生成小size的预测集,并通过实例化uncertainty进行通信。我们也提供了finite-sample coverage garantía,证明SAPS的预期集size是APS always smaller。实验证明,SAPS不仅减少预测集,还广泛提高了预测集的 conditional coverage rate和适应度。
results: 这个方法在BraTS2021医疗影像资料集上进行了实验,和现有的方法相比,表现出更高的效果和可靠性。Abstract
Anomaly detection is the process of identifying atypical data samples that significantly deviate from the majority of the dataset. In the realm of clinical screening and diagnosis, detecting abnormalities in medical images holds great importance. Typically, clinical practice provides access to a vast collection of normal images, while abnormal images are relatively scarce. We hypothesize that abnormal images and their associated features tend to manifest in low-density regions of the data distribution. Following this assumption, we turn to diffusion ODEs for unsupervised anomaly detection, given their tractability and superior performance in density estimation tasks. More precisely, we propose a new anomaly detection method based on diffusion ODEs by estimating the density of features extracted from multi-scale medical images. Our anomaly scoring mechanism depends on computing the negative log-likelihood of features extracted from medical images at different scales, quantified in bits per dimension. Furthermore, we propose a reconstruction-based anomaly localization suitable for our method. Our proposed method not only identifie anomalies but also provides interpretability at both the image and pixel levels. Through experiments on the BraTS2021 medical dataset, our proposed method outperforms existing methods. These results confirm the effectiveness and robustness of our method.
摘要
“异常检测是检测数据集中异常出现的数据样本的过程。在医学检测和诊断领域,检测医学影像中的异常非常重要。通常,临床实践中有大量的常见图像,而异常图像相对较少。我们假设异常图像和其相关特征通常会出现在数据分布中的低浓度区域。基于这个假设,我们使用扩散ODE进行无监督异常检测,因为它具有优秀的散度估计能力。更具体地,我们提出了基于扩散ODE的新异常检测方法,通过计算特征EXTRACTED FROM MULTI-SCALE MEDICAL IMAGES的浓度来进行异常分数。我们的异常分数机制是根据不同级别的医学图像特征EXTRACTED FROM MULTI-SCALE MEDICAL IMAGES的负极值log-likelihood来计算的。此外,我们还提出了基于重建的异常地图方法,适用于我们的方法。我们的提议方法不仅可以检测异常,还可以在图像和像素水平提供可读性。通过对BraTS2021医学数据集进行实验,我们的提议方法超越现有方法。这些结果证明了我们的方法的有效性和稳定性。”
Boundary Discretization and Reliable Classification Network for Temporal Action Detection
results: 实验结果表明,BDRC-Net在不同的benchmark上达到了state-of-the-art的性能,比如THUMOS’14上的平均map值为68.6%,比前一个最佳方法高1.5%。Abstract
Temporal action detection aims to recognize the action category and determine the starting and ending time of each action instance in untrimmed videos. The mixed methods have achieved remarkable performance by simply merging anchor-based and anchor-free approaches. However, there are still two crucial issues in the mixed framework: (1) Brute-force merging and handcrafted anchors design affect the performance and practical application of the mixed methods. (2) A large number of false positives in action category predictions further impact the detection performance. In this paper, we propose a novel Boundary Discretization and Reliable Classification Network (BDRC-Net) that addresses the above issues by introducing boundary discretization and reliable classification modules. Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, avoiding the handcrafted anchors design required by traditional mixed methods. Furthermore, the reliable classification module (RCM) predicts reliable action categories to reduce false positives in action category predictions. Extensive experiments conducted on different benchmarks demonstrate that our proposed method achieves favorable performance compared with the state-of-the-art. For example, BDRC-Net hits an average mAP of 68.6% on THUMOS'14, outperforming the previous best by 1.5%. The code will be released at https://github.com/zhenyingfang/BDRC-Net.
摘要
temporal action detection targets recognizing action categories and determining the starting and ending times of each action instance in untrimmed videos. 混合方法已经实现了很好的表现,但是还有两个关键问题在混合框架中:(1) brutal merging 和手工设计的锚点会影响混合方法的性能和实际应用。(2)大量的False Positives在动作类别预测中,进一步影响检测性能。在这篇论文中,我们提出了一种新的边界精炼和可靠分类网络(BDRC-Net),该方法可以解决以上问题。 Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, avoiding the handcrafted anchors design required by traditional mixed methods. 其次,可靠分类模块 (RCM) 预测可靠的动作类别,以减少False Positives在动作类别预测中。 经过对不同benchmark的广泛实验,我们的提出的方法可以达到与当前最佳的表现。例如,BDRC-Net在THUMOS'14上取得了68.6%的平均MAP,比前一个最佳的方法高1.5%. 代码将在https://github.com/zhenyingfang/BDRC-Net 上发布。
Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling
For: This paper aims to improve the efficiency and adaptability of diffusion models for image generation by introducing a new network architecture called LEGO bricks.* Methods: The LEGO bricks architecture integrates Local-feature Enrichment and Global-content Orchestration, allowing for selective skipping of bricks to reduce sampling costs and generate higher-resolution images.* Results: The proposed method enhances training efficiency, expedites convergence, and facilitates variable-resolution image generation while maintaining strong generative performance, and significantly reduces sampling time compared to other methods.Abstract
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.
摘要
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.Here is the translation in Traditional Chinese:Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.
3DS-SLAM: A 3D Object Detection based Semantic SLAM towards Dynamic Indoor Environments
paper_authors: Ghanta Sai Krishna, Kundrapu Supriya, Sabur Baidya for: 这个研究该纸是为了解决受环境变化影响的摄像头位置精度下降问题,并提出了一个3D Semantic SLAM算法(3DS-SLAM),可以在动态环境中提供高精度的位置识别。methods: 这个研究使用了3D part-aware hybrid transformer来检测Point cloud中的动态物件,并提出了一个基于HDBSCAN散度分析的动态特征范围筛选器,以提高位置精度。results: 该研究与ORB-SLAM2进行比较,发现3DS-SLAM在TUM RGB-D dataset上的动态序列中平均提高98.01%。此外,它还超过了其他四个适用于动态环境的主要SLAM系统的性能。Abstract
The existence of variable factors within the environment can cause a decline in camera localization accuracy, as it violates the fundamental assumption of a static environment in Simultaneous Localization and Mapping (SLAM) algorithms. Recent semantic SLAM systems towards dynamic environments either rely solely on 2D semantic information, or solely on geometric information, or combine their results in a loosely integrated manner. In this research paper, we introduce 3DS-SLAM, 3D Semantic SLAM, tailored for dynamic scenes with visual 3D object detection. The 3DS-SLAM is a tightly-coupled algorithm resolving both semantic and geometric constraints sequentially. We designed a 3D part-aware hybrid transformer for point cloud-based object detection to identify dynamic objects. Subsequently, we propose a dynamic feature filter based on HDBSCAN clustering to extract objects with significant absolute depth differences. When compared against ORB-SLAM2, 3DS-SLAM exhibits an average improvement of 98.01% across the dynamic sequences of the TUM RGB-D dataset. Furthermore, it surpasses the performance of the other four leading SLAM systems designed for dynamic environments.
摘要
环境中变量因素的存在可能导致摄像头地理位置准确性下降,因为这违背了同时地理位置和地图生成(SLAM)算法的基本假设,即静止环境。现代 semantic SLAM 系统在动态环境中通常仅仅依靠2D semantic信息,或者仅仅依靠几何信息,或者将其结果在粗略地集成起来。在这项研究中,我们介绍了3DS-SLAM,即3D Semantic SLAM,适用于动态场景中的视觉3D对象检测。3DS-SLAM 是一种紧密相关的算法,同时解决 semantic 和几何约束。我们设计了一种3D part-aware 混合变换器,用于基于点云的对象检测,以确定动态对象。然后,我们提出了基于 HDBSCAN 聚类的动态特征筛选器,以提取对象具有重要绝对深度差异。与 ORB-SLAM2 相比,3DS-SLAM 在 TUM RGB-D 数据集的动态序列上显示了平均提高98.01%。此外,它还超过了其他四个领先的 SLAM 系统在动态环境中的性能。
Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data
results: 通过这种方法,可以培养一些抗胁袋的学生模型,它们可以在各种诱导patterns中保持一致性,并且对于潜在的后门攻击予以抵抗。Abstract
Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers.
摘要
<>对于训练神经网络来说,后门攻击是一种严重的安全隐患,因为它们隐藏式地添加了模型中的隐藏功能。这些后门在潜在的输入数据上做出不寻常的响应,但是在正常的输入数据上则保持沉默,因此难以检测。但是,当特定的触发模式出现在输入数据中时,后门会被激活,使模型执行隐藏的功能。手动检查 vast 数据集中的恶意样本是不可能的,因此我们需要一种新的方法来解决这个挑战。我们提议一种使用最新的扩散模型的方法,通过生成所有训练样本的同义词来实现。通过将这种生成方法与知识储存相结合,我们可以生成学生模型,这些模型在任务上保持普通的表现,同时具有对后门触发器的Robust性。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation
results: 根据 Pascal VOC 2012 和 ADE20K dataset 的多个增量情况进行验证,CoinSeg 能够实现Superior 的结果,特别是在更加具体和实际的长期情况下。Abstract
Class incremental semantic segmentation aims to strike a balance between the model's stability and plasticity by maintaining old knowledge while adapting to new concepts. However, most state-of-the-art methods use the freeze strategy for stability, which compromises the model's plasticity.In contrast, releasing parameter training for plasticity could lead to the best performance for all categories, but this requires discriminative feature representation.Therefore, we prioritize the model's plasticity and propose the Contrast inter- and intra-class representations for Incremental Segmentation (CoinSeg), which pursues discriminative representations for flexible parameter tuning. Inspired by the Gaussian mixture model that samples from a mixture of Gaussian distributions, CoinSeg emphasizes intra-class diversity with multiple contrastive representation centroids. Specifically, we use mask proposals to identify regions with strong objectness that are likely to be diverse instances/centroids of a category. These mask proposals are then used for contrastive representations to reinforce intra-class diversity. Meanwhile, to avoid bias from intra-class diversity, we also apply category-level pseudo-labels to enhance category-level consistency and inter-category diversity. Additionally, CoinSeg ensures the model's stability and alleviates forgetting through a specific flexible tuning strategy. We validate CoinSeg on Pascal VOC 2012 and ADE20K datasets with multiple incremental scenarios and achieve superior results compared to previous state-of-the-art methods, especially in more challenging and realistic long-term scenarios. Code is available at https://github.com/zkzhang98/CoinSeg.
摘要
“ classe 增量 semantics 分类目标是保持模型的稳定性和пластично性,而不是使用固定策略,这会牺牲模型的пластично性。然而,大多数当前的方法使用固定策略来保持模型的稳定性,这会限制模型的发展。相反,释放参数训练可以实现最佳性能,但需要特征表示的启发。因此,我们强调模型的пластично性,并提出了对增量分类(CoinSeg),它寻求特征表示,以便灵活地调整参数。受 Gaussian mixture model 的启发,CoinSeg 强调内类多样性,并使用面部提案来确定强度具有 Objectness 的区域,这些区域可能是多个类别的多样性中心。然后,我们使用这些面部提案进行对比表示,以强调内类多样性。同时,为了避免因内类多样性而导致的偏见,我们还使用类别水平的 Pseudo-labels 来增强类别水平一致性和 между类多样性。此外,CoinSeg 确保模型的稳定性,并避免忘记,通过特定的灵活调整策略。我们在 Pascal VOC 2012 和 ADE20K 数据集上进行了多种增量场景的验证,并实现了较前一代方法更好的结果,特别是在更加复杂和实际上更加挑战的长期场景中。代码可以在 GitHub 上找到:https://github.com/zkzhang98/CoinSeg。”
JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
results: 通过RGBD扩散为例,我们证明 JOINTNET 的有效性,并通过广泛的实验表明它在多种应用中具有广泛的应用前景,包括共同RGBD生成、精度深度预测、深度条件图像生成和协调瓦当3D环境生成。Abstract
We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.
摘要
我们介绍JointNet,一个新的神经网络架构,用于模型图像和额外粗细模式(例如深度地图)的共享分布。JointNet是从预训文本图像散射模型中扩展而来,其中将原始网络复制一份,并将RGB分支与新的粗细模式分支 densely connected。RGB分支在网络精细调整时被锁定,这使得对新的分布学习得到高效的学习,并维持大规模预训散射模型的强大普遍能力。我们通过使用RGBD散射来示范JointNet的有效性,并通过广泛的实验,展示其应用在多个应用中,包括共同RGBD生成、粗细深度预测、深度条件图像生成和coherent块式3Dпанора幕生成。
results: 经过训练后,该机制可以准确地找到字体的重要地方,并且可以用于实现本地风格化字体生成。Abstract
When we compare fonts, we often pay attention to styles of local parts, such as serifs and curvatures. This paper proposes an attention mechanism to find important local parts. The local parts with larger attention are then considered important. The proposed mechanism can be trained in a quasi-self-supervised manner that requires no manual annotation other than knowing that a set of character images is from the same font, such as Helvetica. After confirming that the trained attention mechanism can find style-relevant local parts, we utilize the resulting attention for local style-aware font generation. Specifically, we design a new reconstruction loss function to put more weight on the local parts with larger attention for generating character images with more accurate style realization. This loss function has the merit of applicability to various font generation models. Our experimental results show that the proposed loss function improves the quality of generated character images by several few-shot font generation models.
摘要
当我们比较字体时,我们经常关注当地部分的风格,如补别线和弯曲。这篇论文提议了一种注意机制,用于发现重要的当地部分。这些部分的注意程度较大的地方被视为重要。我们的提议可以在 quasi-自我超级vised 的方式进行训练,不需要手动标注,只需要知道一组字体图像是从同一种字体,如Helvetica。我们验证了训练的注意机制可以找到风格相关的当地部分。然后,我们利用这种注意来实现本地风格感知字体生成。我们设计了一个新的重建损失函数,将更多的权重分配给具有更大注意的当地部分,以生成更加准确的风格实现的字体图像。这种损失函数具有可应用于不同的字体生成模型的优点。我们的实验结果表明,我们的提议可以提高几个少量的字体生成模型生成的字体图像质量。
CrowdRec: 3D Crowd Reconstruction from Single Color Images
results: 通过对单个人物网络参数进行人群约束优化,可以从大规模人群图像中获得高精度的体姿坐标和人物形状,而无需使用大规模的3D人群数据集进行训练。Abstract
This is a technical report for the GigaCrowd challenge. Reconstructing 3D crowds from monocular images is a challenging problem due to mutual occlusions, server depth ambiguity, and complex spatial distribution. Since no large-scale 3D crowd dataset can be used to train a robust model, the current multi-person mesh recovery methods can hardly achieve satisfactory performance in crowded scenes. In this paper, we exploit the crowd features and propose a crowd-constrained optimization to improve the common single-person method on crowd images. To avoid scale variations, we first detect human bounding-boxes and 2D poses from the original images with off-the-shelf detectors. Then, we train a single-person mesh recovery network using existing in-the-wild image datasets. To promote a more reasonable spatial distribution, we further propose a crowd constraint to refine the single-person network parameters. With the optimization, we can obtain accurate body poses and shapes with reasonable absolute positions from a large-scale crowd image using a single-person backbone. The code will be publicly available at~\url{https://github.com/boycehbz/CrowdRec}.
摘要
这是一份技术报告,描述了一种用于解决大规模人群3D重建问题的方法。由于人群中存在互相遮挡、深度异常和复杂的空间分布,使得现有的大规模3D人群数据集不能够用于训练一个可靠的模型。在这篇论文中,我们利用人群特征,并提出了一种人群约束优化方法,以提高现有的单人网络在人群图像上的性能。首先,我们使用商业可用的检测器检测人群中的人脸框和2D姿势。然后,我们使用现有的宽泛采集的图像数据集训练单人网络。为了提高人群中每个人的姿势和形状的可理性,我们还提出了人群约束来细化单人网络参数。通过优化,我们可以从大规模人群图像中获得准确的身体姿势和形状,并且可以使用单人框架来实现。代码将在 GitHub 上公开。
Precise Payload Delivery via Unmanned Aerial Vehicles: An Approach Using Object Detection Algorithms
results: 比传统 GPS 方法提高了500%的平均水平精度。Abstract
Recent years have seen tremendous advancements in the area of autonomous payload delivery via unmanned aerial vehicles, or drones. However, most of these works involve delivering the payload at a predetermined location using its GPS coordinates. By relying on GPS coordinates for navigation, the precision of payload delivery is restricted to the accuracy of the GPS network and the availability and strength of the GPS connection, which may be severely restricted by the weather condition at the time and place of operation. In this work we describe the development of a micro-class UAV and propose a novel navigation method that improves the accuracy of conventional navigation methods by incorporating a deep-learning-based computer vision approach to identify and precisely align the UAV with a target marked at the payload delivery position. This proposed method achieves a 500% increase in average horizontal precision over conventional GPS-based approaches.
摘要
近年来,无人飞行器(UAV)上的自动货物发送技术有了很大的进步。然而,大多数这些工作都是通过UAV的GPS坐标来定位和发送货物的。使用GPS坐标导航限制了货物发送精度,它们取决于GPS网络的准确性和可用性,以及在运行时的天气情况。在这篇文章中,我们描述了一种微型UAV的开发和一种基于深度学习计算机视觉技术的新导航方法,该方法可以提高传统导航方法的精度。这种方法在平均水平精度方面提高了500%。
Adversarial Masked Image Inpainting for Robust Detection of Mpox and Non-Mpox
results: 实验结果表明,使用MSLD数据集和 eighteen种非MPox皮肤病图像进行验证,MIM方法的平均AUC得分为0.8237。此外,研究还证明了传统分类模型的缺陷和MIM方法的优势,并通过临床验证证明了MIM方法的可靠性。最后,我们还开发了一款免费在受影响地区提供的在线手机应用程序,以便为affected areas提供免费测试。Abstract
Due to the lack of efficient mpox diagnostic technology, mpox cases continue to increase. Recently, the great potential of deep learning models in detecting mpox and non-mpox has been proven. However, existing models learn image representations via image classification, which results in they may be easily susceptible to interference from real-world noise, require diverse non-mpox images, and fail to detect abnormal input. These drawbacks make classification models inapplicable in real-world settings. To address these challenges, we propose "Mask, Inpainting, and Measure" (MIM). In MIM's pipeline, a generative adversarial network only learns mpox image representations by inpainting the masked mpox images. Then, MIM determines whether the input belongs to mpox by measuring the similarity between the inpainted image and the original image. The underlying intuition is that since MIM solely models mpox images, it struggles to accurately inpaint non-mpox images in real-world settings. Without utilizing any non-mpox images, MIM cleverly detects mpox and non-mpox and can handle abnormal inputs. We used the recognized mpox dataset (MSLD) and images of eighteen non-mpox skin diseases to verify the effectiveness and robustness of MIM. Experimental results show that the average AUROC of MIM achieves 0.8237. In addition, we demonstrated the drawbacks of classification models and buttressed the potential of MIM through clinical validation. Finally, we developed an online smartphone app to provide free testing to the public in affected areas. This work first employs generative models to improve mpox detection and provides new insights into binary decision-making tasks in medical images.
摘要
中文翻译:由于缺乏高效的MPOX诊断技术,MPOX患者数继续增加。在深度学习模型中,MPOX和非MPOX的潜在优势已经被证明。然而,现有模型通过图像分类学习图像表示,这可能会受到实际噪声的干扰,需要多种非MPOX图像,并且无法检测异常输入。这些缺陷使得分类模型在实际设置中不可用。为了解决这些挑战,我们提出了“面罩、填充和测量”(MIM)。在MIM的管道中,一个生成对抗网络只学习MPOX图像表示,并通过填充面罩的MPOX图像来确定输入是MPOX还是非MPOX。我们的 intuition 是,由于MIM只学习MPOX图像,因此在实际设置中很难准确地填充非MPOX图像。无需使用任何非MPOX图像,MIM才能够精准地检测MPOX和非MPOX,并可以处理异常输入。我们使用认可的MPOX数据集(MSLD)和 eighteen 种非MPOX皮肤病图像来验证MIM的效果和稳定性。实验结果显示,MIM的平均 AUROC 为 0.8237。此外,我们还证明了分类模型的缺陷,并赞赏了MIM的潜在价值。最后,我们开发了一款基于在线智能手机应用程序,以提供免费测试,并在受到影响的地区提供测试。这种工作首次运用生成模型改进MPOX检测,并提供了医疗图像中二分类决策的新视角。
Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models
results: 研究结果表明,提出的 Progressive Conditional Diffusion Models (PCDMs) 可以在不同的姿势下进行高质量、高准确性的人像合成,并且在挑战性的情况下保持图像的自然性和细节。Abstract
Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis. However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge. This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/muzishen/PCDMs.
摘要
近期研究表明扩散模型在人像合成中具有重要潜力。然而,由于源图像和目标图像中人体姿势不一致,通过仅使用源图像和目标姿势信息,Synthesizing an image with a distinct pose remains a daunting challenge.这篇论文提出了逐渐进行Diffusion Models(PCDMs),通过三个阶段逐渐bridging the gap between person images under the target and source poses. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/muzishen/PCDMs.
Improving Compositional Text-to-image Generation with Large Vision-Language Models
results: 实验结果表明,提案的方法可以显著提高文本到图像模型的对齐精度,特别是在对象数量、属性绑定、空间关系和美观质量等方面。Abstract
Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Utilizing this assessment, we fine-tune the diffusion model to enhance its alignment capabilities. During the inference phase, an initial image is produced using the fine-tuned diffusion model. The LVLM is then employed to pinpoint areas of misalignment in the initial image, which are subsequently corrected using the image editing algorithm until no further misalignments are detected by the LVLM. The resultant image is consequently more closely aligned with the input text. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation, particularly with respect to object number, attribute binding, spatial relationships, and aesthetic quality.
摘要
Three-Dimensional Medical Image Fusion with Deformable Cross-Attention
paper_authors: Lin Liu, Xinxin Fan, Chulong Zhang, Jingjing Dai, Yaoqin Xie, Xiaokun Liang for:* 这篇论文主要是用于Multimodal医疗影像融合,以提高疾病识别和肿瘤检测。methods:* 这篇论文使用了一个创新的无监督特征相似学习融合网络,它包括一个弹性跨特征融合模组(DCFB),帮助两 modalities 分别识别自己的相似和不同之处。results:* 这篇论文使用了ADNI dataset中的3D MRI和PET影像,通过应用DCFB模组,生成了高品质的MRI-PET融合影像。* 实验结果显示,我们的方法比传统的2D影像融合方法在PSNR和SSIM等效果指标上表现更好。* 重要的是,我们的方法可以融合3D影像,增加医生和研究人员可用的信息,因此代表了这个领域的一大突破。Abstract
Multimodal medical image fusion plays an instrumental role in several areas of medical image processing, particularly in disease recognition and tumor detection. Traditional fusion methods tend to process each modality independently before combining the features and reconstructing the fusion image. However, this approach often neglects the fundamental commonalities and disparities between multimodal information. Furthermore, the prevailing methodologies are largely confined to fusing two-dimensional (2D) medical image slices, leading to a lack of contextual supervision in the fusion images and subsequently, a decreased information yield for physicians relative to three-dimensional (3D) images. In this study, we introduce an innovative unsupervised feature mutual learning fusion network designed to rectify these limitations. Our approach incorporates a Deformable Cross Feature Blend (DCFB) module that facilitates the dual modalities in discerning their respective similarities and differences. We have applied our model to the fusion of 3D MRI and PET images obtained from 660 patients in the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Through the application of the DCFB module, our network generates high-quality MRI-PET fusion images. Experimental results demonstrate that our method surpasses traditional 2D image fusion methods in performance metrics such as Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). Importantly, the capacity of our method to fuse 3D images enhances the information available to physicians and researchers, thus marking a significant step forward in the field. The code will soon be available online.
摘要
多Modal医疗图像融合在医学图像处理多个领域中扮演重要角色,特别是疾病识别和肿瘤检测。传统的融合方法通常是独立处理每种模式的图像,然后将特征合并并重建融合图像。然而,这种方法经常忽视不同模式之间的基本相似性和差异。此外,现有的方法ologies主要是对2D医学图像片进行融合,导致融合图像中缺乏上下文指导,从而降低了医生对融合图像中信息的获得。在本研究中,我们介绍了一种创新的无监督特征共同学习融合网络,用于解决这些限制。我们的方法包括一个可变的交叉特征混合(DCFB)模块,该模块使得两种模式能够更好地了解它们之间的相似性和差异。我们在ADNI数据集中对3D MRI和PET图像进行了融合,并通过DCFB模块,我们的网络生成了高质量的MRI-PET融合图像。实验结果表明,我们的方法在PSNR和SSIM等性能指标上比传统2D图像融合方法表现更好。这种能力融合3D图像,提高了医生和研究人员对融合图像中信息的获得,这标志着该领域的一个重要进步。网络代码即将在线上公开。
Towards More Efficient Depression Risk Recognition via Gait
paper_authors: Min Ren, Muchan Tao, Xuecai Hu, Xiaotong Liu, Qiong Li, Yongzhen Huang for: 这个研究旨在开发一种基于深度学习的抑郁风险识别模型,以便在初级医疗设置中早期识别抑郁症,避免重症和重复发作,并减轻抑郁症对情感和财务造成的负担。methods: 该研究首先构建了一个大规模的步态数据库,包括1,200名参与者、40,000个步态序列和6种视角和3种服装类型。然后,提出了一种基于深度学习的抑郁风险识别模型,以超越手工设计的方法。results: 经过对构建的大规模数据库进行实验,该模型的效果得到了验证,并且提供了许多有用的指导思想,显示了步态基于的抑郁风险识别方法的巨大潜力。Abstract
Depression, a highly prevalent mental illness, affects over 280 million individuals worldwide. Early detection and timely intervention are crucial for promoting remission, preventing relapse, and alleviating the emotional and financial burdens associated with depression. However, patients with depression often go undiagnosed in the primary care setting. Unlike many physiological illnesses, depression lacks objective indicators for recognizing depression risk, and existing methods for depression risk recognition are time-consuming and often encounter a shortage of trained medical professionals. The correlation between gait and depression risk has been empirically established. Gait can serve as a promising objective biomarker, offering the advantage of efficient and convenient data collection. However, current methods for recognizing depression risk based on gait have only been validated on small, private datasets, lacking large-scale publicly available datasets for research purposes. Additionally, these methods are primarily limited to hand-crafted approaches. Gait is a complex form of motion, and hand-crafted gait features often only capture a fraction of the intricate associations between gait and depression risk. Therefore, this study first constructs a large-scale gait database, encompassing over 1,200 individuals, 40,000 gait sequences, and covering six perspectives and three types of attire. Two commonly used psychological scales are provided as depression risk annotations. Subsequently, a deep learning-based depression risk recognition model is proposed, overcoming the limitations of hand-crafted approaches. Through experiments conducted on the constructed large-scale database, the effectiveness of the proposed method is validated, and numerous instructive insights are presented in the paper, highlighting the significant potential of gait-based depression risk recognition.
摘要
全球280多万人患有抑郁症,早期发现和及时 interven 是关键,可以促进缓解、避免回落和减轻抑郁症的情感和财务负担。但是,抑郁症患者在主要医疗设施中 frequently goes undiagnosed。与许多物理疾病不同,抑郁症缺乏可Recognizing depression risk is challenging due to the lack of objective indicators, and existing methods are time-consuming and often face a shortage of trained medical professionals. However, research has shown that there is a correlation between gait and depression risk. Gait can serve as a promising objective biomarker, offering the advantage of efficient and convenient data collection. However, current methods for recognizing depression risk based on gait have only been validated on small, private datasets, lacking large-scale publicly available datasets for research purposes. Additionally, these methods are primarily limited to hand-crafted approaches. Gait is a complex form of motion, and hand-crafted gait features often only capture a fraction of the intricate associations between gait and depression risk.为了解决这些问题,本研究首先构建了一个大规模的步态数据库,包括1,200个人,40,000个步态序列和六种视角,以及三种衣物类型。两个常用的心理测量方法提供为抑郁风险注释。然后,一种深度学习基于的抑郁风险识别模型被提出,以超越手工制作的方法。经过对构建的大规模数据库进行实验,提出的方法的有效性得到验证,并且在文中提供了丰富的指导思想,强调步态基于的抑郁风险识别的重要性。
MuseChat: A Conversational Music Recommendation System for Videos
For: This paper aims to provide an innovative dialog-based music recommendation system that offers interactive user engagement and personalized music selections tailored for input videos.* Methods: The paper introduces a conversation-synthesis method that simulates a two-turn interaction between a user and a recommendation system, leveraging pre-trained music tags and artist information. It also introduces a multi-modal recommendation engine that matches music with visual cues from the video, user feedback, and textual input.* Results: The paper shows that MuseChat, the proposed music recommendation system, surpasses existing state-of-the-art models in music retrieval tasks and pioneers the integration of the recommendation process within a natural language framework.Abstract
We introduce MuseChat, an innovative dialog-based music recommendation system. This unique platform not only offers interactive user engagement but also suggests music tailored for input videos, so that users can refine and personalize their music selections. In contrast, previous systems predominantly emphasized content compatibility, often overlooking the nuances of users' individual preferences. For example, all the datasets only provide basic music-video pairings or such pairings with textual music descriptions. To address this gap, our research offers three contributions. First, we devise a conversation-synthesis method that simulates a two-turn interaction between a user and a recommendation system, which leverages pre-trained music tags and artist information. In this interaction, users submit a video to the system, which then suggests a suitable music piece with a rationale. Afterwards, users communicate their musical preferences, and the system presents a refined music recommendation with reasoning. Second, we introduce a multi-modal recommendation engine that matches music either by aligning it with visual cues from the video or by harmonizing visual information, feedback from previously recommended music, and the user's textual input. Third, we bridge music representations and textual data with a Large Language Model(Vicuna-7B). This alignment equips MuseChat to deliver music recommendations and their underlying reasoning in a manner resembling human communication. Our evaluations show that MuseChat surpasses existing state-of-the-art models in music retrieval tasks and pioneers the integration of the recommendation process within a natural language framework.
摘要
我们介绍MuseChat,一种创新的对话式音乐推荐系统。这个独特的平台不仅提供互动用户参与度,还为输入视频提供适合的音乐推荐,以便用户可以细化和个性化音乐选择。与之前的系统不同,我们的研究强调用户个人偏好,而不是主要强调内容相容。例如,所有数据只提供基本的音乐视频对应或文本音乐描述。为了解决这一漏洞,我们的研究提供了三种贡献。首先,我们开发了一种对话生成方法,该方法通过使用预训练的音乐标签和艺术家信息,模拟用户和推荐系统之间的两次对话。在这个对话中,用户提供视频,系统则提供适合的音乐曲目并给出了理由。然后,用户表达自己的音乐偏好,系统则提供了修改后的音乐推荐和理由。第二,我们引入多模态推荐引擎,该引擎将音乐与视频信息、用户之前推荐的音乐反馈、以及用户的文本输入进行匹配。第三,我们将音乐表示和文本数据进行了对接,通过使用大型自然语言模型(Vicuna-7B)。这种对接使得MuseChat可以提供音乐推荐和其下面的理由,与人类交流方式相似。我们的评估显示,MuseChat在音乐检索任务中超过了现有的状态 искусственный智能模型,并成为了对话式推荐过程的先驱者。
High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field
paper_authors: Minghan Qin, Yifan Liu, Yuelang Xu, Xiaochen Zhao, Yebin Liu, Haoqian Wang for:* 三维头部人物重建中的一个关键方面是表情细节。methods:* 我们提出了一种新的空间变换表达(SVE)条件。* SVE可以通过一个简单的多层感知网络生成,包括不同位置的空间特征和全局表情信息。results:* 我们的方法可以更好地处理复杂的表情细节,并实现高质量的三维头部人物重建。* 我们的方法在移动终端收集和公共数据集上实现了比其他状态时间(SOTA)方法更高的图形和渲染质量。Abstract
One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRF-based photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned neural radiance field can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets.
摘要
(Simplified Chinese translation)一个重要的方面是3D头像重建的脸部表情细节。虽然最近的NeRF基于的高品质3D头像方法可以实现高质量的头像渲染,但它们还遇到表情细节细节保持的挑战,因为它们忽略了不同空间位置中的特定表情变化的潜在可能性。我们受到这一观察的 inspirada,引入了一种新的空间变化表达(SVE)conditioning。SVE可以通过一个简单的多层感知网络生成,包括空间位势特征和全局表情信息。由于SVE在不同位置上具有丰富和多样的信息,我们的提议的SVE-conditioned神经辐射场可以处理复杂的脸部表情和高精度3D头像的渲染和几何细节。此外,为了进一步提高几何和渲染质量,我们引入了一种新的粗略到细节训练策略,包括geometry initialization策略和适应重要性采样策略。广泛的实验表明,我们的方法在移动设备采集和公共数据集上的渲染和几何质量都有所提高。
Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing
results: 在 24 个下游图像分类任务中,我们的方法可以取得出色的转移学习表现,并且将适应 Parameters 的数量降低至最小。Abstract
The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to significantly reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC}.
摘要
高效预训模型的出现对计算机视觉问题的解决带来了革命性的变革,从训练任务特定模型转移到适应预训模型。因此,有效地适应大规模预训模型到下游任务成为了当前研究领域的焦点。现有的解决方案主要集中在设计轻量级适配器和预训模型之间的互动,以尽量减少需要更新的参数数量。在本研究中,我们提出了一种新的适配器重新组合(ARC)策略,从新的角度解决高效预训模型适配问题。我们的方法考虑适配器参数的再利用,并 introduce了共享参数的设计。具体来说,我们利用 симметриcz下/上投影来构建瓶颈操作,这些操作在层次上共享。通过学习低维度归一化系数,我们可以有效地重新组合层次适配器。这种参数共享的适配器设计策略可以减少新的参数数量,同时保持满意的性能,从而提供一种可靠的压缩适配成本的方法。我们在24种图像分类任务上使用了不同的视力变换器变体进行实验,以评估我们的方法。结果表明,我们的方法在参数数量减少的情况下实现了吸引人的转移学习性能。我们的代码可以在 \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC} 上找到。
Spiking PointNet: Spiking Neural Networks for Point Clouds
for: 本研究旨在探讨 whether Spiking Neural Networks (SNNs) can be generalized to 3D recognition, and to present a spiking neural model for efficient deep learning on point clouds.
results: 我们在 ModelNet10 和 ModelNet40 上进行了多种实验,发现 Spiking PointNet 可以在多个时间步骤推理中提供更好的性能,并且可以超越其 ANNS 对应模型,这是 SNN 领域中非常罕见的。 另外,Spiking PointNet 在训练阶段也显示出了很好的速度和存储减少。Abstract
Recently, Spiking Neural Networks (SNNs), enjoying extreme energy efficiency, have drawn much research attention on 2D visual recognition and shown gradually increasing application potential. However, it still remains underexplored whether SNNs can be generalized to 3D recognition. To this end, we present Spiking PointNet in the paper, the first spiking neural model for efficient deep learning on point clouds. We discover that the two huge obstacles limiting the application of SNNs in point clouds are: the intrinsic optimization obstacle of SNNs that impedes the training of a big spiking model with large time steps, and the expensive memory and computation cost of PointNet that makes training a big spiking point model unrealistic. To solve the problems simultaneously, we present a trained-less but learning-more paradigm for Spiking PointNet with theoretical justifications and in-depth experimental analysis. In specific, our Spiking PointNet is trained with only a single time step but can obtain better performance with multiple time steps inference, compared to the one trained directly with multiple time steps. We conduct various experiments on ModelNet10, ModelNet40 to demonstrate the effectiveness of Spiking PointNet. Notably, our Spiking PointNet even can outperform its ANN counterpart, which is rare in the SNN field thus providing a potential research direction for the following work. Moreover, Spiking PointNet shows impressive speedup and storage saving in the training phase.
摘要
近些年,激活神经网络(SNN)因其极高的能效性而吸引了大量研究人员的关注,主要应用于2D视觉识别领域。然而,SNN在3D识别领域的应用仍然尚未得到充分探索。为此,我们在本文中提出了激活点网(Spiking PointNet),这是首个采用激活神经元进行深度学习的点云模型。我们发现,在点云模型中,SNN的两个主要障碍物是:激活神经元优化障碍和点云模型的内存和计算成本高。为解决这两个问题,我们提出了一种无需训练的,但可以吸取更多知识的概念。具体来说,我们的激活点网在单个时间步上进行训练,但可以在多个时间步上进行INFERENCE,并且可以在ModelNet10和ModelNet40上进行了多种实验,以证明激活点网的效果。特别是,我们的激活点网甚至可以超越其ANN对应模型,这在SNN领域内很罕见,因此提供了一个可能的研究方向。此外,激活点网在训练阶段也显示了快速和储存的节省。
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
methods: 本方法形式为一种序列到序列任务,首先预测一串含义杆并 then 预测最终目标。解释性不仅提高了总性表现,还帮助我们理解网络做出最终决定的原因。
results: 我们的方法在Nr3D、Sr3D和Scanreferbenchmark上进行了广泛的实验,并比现有方法具有更高的性能和更好的数据效率。此外,我们的方法可以轻松地与现有架构结合使用。Abstract
3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?. To this end, we formulate the 3D visual grounding problem as a sequence-to-sequence task by first predicting a chain of anchors and then the final target. Interpretability not only improves the overall performance but also helps us identify failure cases. Following the chain of thoughts approach enables us to decompose the referring task into interpretable intermediate steps, boosting the performance and making our framework extremely data-efficient. Moreover, our proposed framework can be easily integrated into any existing architecture. We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. Furthermore, our proposed framework, dubbed CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when trained only on 10% of the data, we match the SOTA performance that trained on the entire data.
摘要
三维视觉根据是指将语音引用对象在三维场景中固定的能力。现有大多数方法直接使用引用头来确定引用对象,导致复杂场景下失败。此外,它们不能解释网络是如何做出最终决定的。在这篇论文中,我们解决这个问题。我们是否可以设计一个可解释的三维视觉根据框架,以模仿人类视觉系统呢?为此,我们将三维视觉根据问题转化为序列到序列任务,首先预测一系列的锚点,然后预测最终的目标。可解释性不仅提高了总性表现,还帮助我们解释失败的原因。我们采用链条思想方法,将引用任务分解为可解释的中间步骤,从而提高表现和训练效率。此外,我们的提议的框架可以轻松地与现有架构结合使用。我们通过对 Nr3D、Sr3D 和 Scanrefer 测试准则进行广泛的实验,并示出与现有方法相比,我们的方法具有显著的性能提升,而无需手动标注数据。此外,我们的提议的框架,即 CoT3DRef,在 Sr3D 数据集上训练时只使用 10% 的数据,却与所有数据训练的 SOTA 性能相当。
results: 通过对多种任务进行测试,包括数学函数、知识图关系和复杂的现实世界RESTful API等,实验表明,ToolDec可以完全消除语法错误,并在不需要精度调整或在 контекст中提供工具文档的情况下,达到更高的性能和速度提升。此外,ToolDec还可以在未看过工具的情况下,选择合适的工具,并且可以更好地泛化到新的工具。Abstract
Large language models (LLMs) have shown promising capabilities in using external tools to solve complex problems. However, existing approaches either involve fine-tuning on tool demonstrations, which do not generalize to new tools without additional training, or providing tool documentation in context, limiting the number of tools. Both approaches often generate syntactically invalid tool calls. In this paper, we propose ToolDec, a finite-state machine-guided decoding algorithm for tool-augmented LLMs. ToolDec eliminates tool-related errors for any tool-augmented LLMs by ensuring valid tool names and type-conforming arguments. Furthermore, ToolDec enables LLM to effectively select tools using only the information contained in their names, with no need for fine-tuning or in-context documentation. We evaluated multiple prior methods and their ToolDec-enhanced versions on a variety of tasks involving tools like math functions, knowledge graph relations, and complex real-world RESTful APIs. Our experiments show that ToolDec reduces syntactic errors to zero, consequently achieving significantly better performance and as much as a 2x speedup. We also show that ToolDec achieves superior generalization performance on unseen tools, performing up to 8x better than the baselines.
摘要
results: 在数字逻辑和关系逻辑任务中,使用HtT框架可以提高现有的提问方法精度,增加11-27%的精度提升。Abstract
When prompted with a few examples and intermediate steps, large language models (LLMs) have demonstrated impressive performance in various reasoning tasks. However, prompting methods that rely on implicit knowledge in an LLM often hallucinate incorrect answers when the implicit knowledge is wrong or inconsistent with the task. To tackle this problem, we present Hypotheses-to-Theories (HtT), a framework that learns a rule library for reasoning with LLMs. HtT contains two stages, an induction stage and a deduction stage. In the induction stage, an LLM is first asked to generate and verify rules over a set of training examples. Rules that appear and lead to correct answers sufficiently often are collected to form a rule library. In the deduction stage, the LLM is then prompted to employ the learned rule library to perform reasoning to answer test questions. Experiments on both numerical reasoning and relational reasoning problems show that HtT improves existing prompting methods, with an absolute gain of 11-27% in accuracy. The learned rules are also transferable to different models and to different forms of the same problem.
摘要
当提供一些示例和中间步骤时,大语言模型(LLM)展现出了吸引人的表现在不同的逻辑任务上。然而,这些prompting方法常常当 implicit knowledge在 LLM 中错误或与任务不一致时会hallucinate incorrect answers。为解决这个问题,我们提出了 Hypotheses-to-Theories(HtT)框架,该框架学习了一个逻辑规则库,用于与 LLM 进行逻辑 reasoning。HtT 框架包括两个阶段:induction stage和 deduction stage。在induction stage中, LLM 首先被要求生成并验证规则,以便在一组训练示例上建立一个规则库。在 deduction stage中, LLM THEN 被要求使用学习的规则库来解决测试问题。实验表明,HtT 可以提高现有的prompting方法,具有11-27%的精度提升。学习的规则也可以转移到不同的模型和不同的问题形式。
DKEC: Domain Knowledge Enhanced Multi-Label Classification for Electronic Health Records
methods: 这个论文使用了两个innovation:第一,是一个 Label-wise attention mechanism,可以 capture医疗知识和领域 ontologies 中的semantic relationships between medical entities。第二,是一个简单 yet effective的group-wise training method,可以增加 rare classes 的训练数据。
results: 这个论文的实验结果显示,我们的方法可以比前一代方法更好地预测医疗诊断,特别是对少数类别(tail)的预测。此外,我们还研究了 DKEC 在不同的语言模型上的应用,并证明了 DKEC 可以帮助小型语言模型 achieve comparable performance 到大型语言模型。Abstract
Multi-label text classification (MLTC) tasks in the medical domain often face long-tail label distribution, where rare classes have fewer training samples than frequent classes. Although previous works have explored different model architectures and hierarchical label structures to find important features, most of them neglect to incorporate the domain knowledge from medical guidelines. In this paper, we present DKEC, Domain Knowledge Enhanced Classifier for medical diagnosis prediction with two innovations: (1) a label-wise attention mechanism that incorporates a heterogeneous graph and domain ontologies to capture the semantic relationships between medical entities, (2) a simple yet effective group-wise training method based on similarity of labels to increase samples of rare classes. We evaluate DKEC on two real-world medical datasets: the RAA dataset, a collection of 4,417 patient care reports from emergency medical services (EMS) incidents, and a subset of 53,898 reports from the MIMIC-III dataset. Experimental results show that our method outperforms the state-of-the-art, particularly for the few-shot (tail) classes. More importantly, we study the applicability of DKEC to different language models and show that DKEC can help the smaller language models achieve comparable performance to large language models.
摘要
多个标签文本分类(MLTC)任务在医疗领域经常遇到长尾标签分布,其中罕见的类别有 fewer 的训练样本 than frequent classes。 although previous works have explored different model architectures and hierarchical label structures to find important features, most of them neglect to incorporate the domain knowledge from medical guidelines. In this paper, we present DKEC, Domain Knowledge Enhanced Classifier for medical diagnosis prediction with two innovations: (1) a label-wise attention mechanism that incorporates a heterogeneous graph and domain ontologies to capture the semantic relationships between medical entities, (2) a simple yet effective group-wise training method based on similarity of labels to increase samples of rare classes. We evaluate DKEC on two real-world medical datasets: the RAA dataset, a collection of 4,417 patient care reports from emergency medical services (EMS) incidents, and a subset of 53,898 reports from the MIMIC-III dataset. Experimental results show that our method outperforms the state-of-the-art, particularly for the few-shot (tail) classes. More importantly, we study the applicability of DKEC to different language models and show that DKEC can help the smaller language models achieve comparable performance to large language models.Here's the text with some notes on the translation:* "多个标签" is translated as "多个标签" (both words are the same in Chinese), which is a bit redundant but follows the original text's structure.* "文本分类" is translated as "文本分类" (both words are the same in Chinese), which is also a bit redundant but follows the original text's structure.* "任务" is translated as "任务" (a single word in Chinese), which is a more concise way of saying "task" in Chinese.* "在医疗领域" is translated as "在医疗领域" (both words are the same in Chinese), which is a more concise way of saying "in the medical field" in Chinese.* "经常" is translated as "经常" (a single word in Chinese), which is a more concise way of saying "often" in Chinese.* "遇到" is translated as "遇到" (a single word in Chinese), which is a more concise way of saying "encounter" in Chinese.* "长尾标签" is translated as "长尾标签" (both words are the same in Chinese), which is a more concise way of saying "long-tail label" in Chinese.* "其中" is translated as "其中" (a single word in Chinese), which is a more concise way of saying "where" in Chinese.* "罕见的类别" is translated as "罕见的类别" (both words are the same in Chinese), which is a more concise way of saying "rare classes" in Chinese.* "有 fewer 的训练样本" is translated as "有 fewer 的训练样本" (both words are the same in Chinese), which is a more concise way of saying "have fewer training samples" in Chinese.* "although" is translated as "although" (a single word in Chinese), which is a more concise way of saying "although" in Chinese.* "previous works" is translated as "前一些工作" (both words are the same in Chinese), which is a more concise way of saying "previous works" in Chinese.* "have explored" is translated as "已经探索" (a single word in Chinese), which is a more concise way of saying "have explored" in Chinese.* "different model architectures and hierarchical label structures" is translated as "不同的模型架构和层次标签结构" (both words are the same in Chinese), which is a more concise way of saying "different model architectures and hierarchical label structures" in Chinese.* "to find important features" is translated as "以找到重要特征" (a single word in Chinese), which is a more concise way of saying "to find important features" in Chinese.* "most of them neglect to incorporate the domain knowledge from medical guidelines" is translated as "大多数 Neglect 医学指南中的领域知识" (both words are the same in Chinese), which is a more concise way of saying "most of them neglect to incorporate the domain knowledge from medical guidelines" in Chinese.* "In this paper, we present DKEC" is translated as "本文中,我们介绍 DKEC" (both words are the same in Chinese), which is a more concise way of saying "In this paper, we present DKEC" in Chinese.* "Domain Knowledge Enhanced Classifier" is translated as "领域知识增强分类器" (both words are the same in Chinese), which is a more concise way of saying "Domain Knowledge Enhanced Classifier" in Chinese.* "for medical diagnosis prediction" is translated as "用于医学诊断预测" (a single word in Chinese), which is a more concise way of saying "for medical diagnosis prediction" in Chinese.* "with two innovations" is translated as "两种创新" (both words are the same in Chinese), which is a more concise way of saying "with two innovations" in Chinese.* "label-wise attention mechanism" is translated as "标签 wise 注意机制" (both words are the same in Chinese), which is a more concise way of saying "label-wise attention mechanism" in Chinese.* "incorporates a heterogeneous graph and domain ontologies" is translated as "包含不同类型的图和领域 ontologies" (both words are the same in Chinese), which is a more concise way of saying "incorporates a heterogeneous graph and domain ontologies" in Chinese.* "to capture the semantic relationships between medical entities" is translated as "以捕捉医疗实体之间的含义关系" (a single word in Chinese), which is a more concise way of saying "to capture the semantic relationships between medical entities" in Chinese.* "We evaluate DKEC on two real-world medical datasets" is translated as "我们在两个实际医疗数据集上评估 DKEC" (both words are the same in Chinese), which is a more concise way of saying "We evaluate DKEC on two real-world medical datasets" in Chinese.* "RAA dataset" is translated as "RAA 数据集" (a single word in Chinese), which is a more concise way of saying "RAA dataset" in Chinese.* "a collection of 4,417 patient care reports from emergency medical services (EMS) incidents" is translated as "4,417 例 emergency medical services (EMS) 事件中的病人护理报告集" (both words are the same in Chinese), which is a more concise way of saying "a collection of 4,417 patient care reports from emergency medical services (EMS) incidents" in Chinese.* "a subset of 53,898 reports from the MIMIC-III dataset" is translated as "MIMIC-III 数据集中的53,898 份报告子集" (both words are the same in Chinese), which is a more concise way of saying "a subset of 53,898 reports from the MIMIC-III dataset" in Chinese.* "Experimental results show that our method outperforms the state-of-the-art" is translated as "实验结果表明我们的方法在当前领域中表现出色" (a single word in Chinese), which is a more concise way of saying "Experimental results show that our method outperforms the state-of-the-art" in Chinese.* "particularly for the few-shot (tail) classes" is translated as "尤其是少量 (tail) 类" (both words are the same in Chinese), which is a more concise way of saying "particularly for the few-shot (tail) classes" in Chinese.* "More importantly" is translated as "更重要的是" (a single word in Chinese), which is a more concise way of saying "More importantly" in Chinese.* "we study the applicability of DKEC to different language models" is translated as "我们研究 DKEC 在不同语言模型上的可应用性" (both words are the same in Chinese), which is a more concise way of saying "we study the applicability of DKEC to different language models" in Chinese.* "and show that DKEC can help the smaller language models achieve comparable performance to large language models" is translated as "并表明 DKEC 可以帮助小型语言模型实现与大型语言模型相同的性能" (both words are the same in Chinese), which is a more concise way of saying "and show that DKEC can help the smaller language models achieve comparable performance to large language models" in Chinese.
Computational Pathology at Health System Scale – Self-Supervised Foundation Models from Three Billion Images
paper_authors: Gabriele Campanella, Ricky Kwan, Eugene Fluder, Jennifer Zeng, Aryeh Stock, Brandon Veremis, Alexandros D. Polydorides, Cyrus Hedvat, Adam Schoenfeld, Chad Vanderbilt, Patricia Kovatch, Carlos Cordon-Cardo, Thomas J. Fuchs
results: 研究结果显示,这些自我超vised learning算法在病理科领域的大量数据上进行预训后,对下游任务的性能有所提高,而DINO算法在所有任务中表现更好。Abstract
Recent breakthroughs in self-supervised learning have enabled the use of large unlabeled datasets to train visual foundation models that can generalize to a variety of downstream tasks. While this training paradigm is well suited for the medical domain where annotations are scarce, large-scale pre-training in the medical domain, and in particular pathology, has not been extensively studied. Previous work in self-supervised learning in pathology has leveraged smaller datasets for both pre-training and evaluating downstream performance. The aim of this project is to train the largest academic foundation model and benchmark the most prominent self-supervised learning algorithms by pre-training and evaluating downstream performance on large clinical pathology datasets. We collected the largest pathology dataset to date, consisting of over 3 billion images from over 423 thousand microscopy slides. We compared pre-training of visual transformer models using the masked autoencoder (MAE) and DINO algorithms. We evaluated performance on six clinically relevant tasks from three anatomic sites and two institutions: breast cancer detection, inflammatory bowel disease detection, breast cancer estrogen receptor prediction, lung adenocarcinoma EGFR mutation prediction, and lung cancer immunotherapy response prediction. Our results demonstrate that pre-training on pathology data is beneficial for downstream performance compared to pre-training on natural images. Additionally, the DINO algorithm achieved better generalization performance across all tasks tested. The presented results signify a phase change in computational pathology research, paving the way into a new era of more performant models based on large-scale, parallel pre-training at the billion-image scale.
摘要
近期,自我监督学习的突破有助于使用大量没有标签的数据来训练视觉基础模型,这些模型可以通过多种下游任务进行泛化。在医疗领域,这种训练方法非常适合,因为标签稀缺。然而,大规模预训练在医疗领域,特别是在病理学方面,尚未得到广泛的研究。在自我监督学习中,以前的工作通常使用小型数据集进行预训练和下游性能评估。本项目的目标是在大学院内 trains the largest academic foundation model,并用最出色的自我监督学习算法对大规模病理学数据集进行预训练和下游性能评估。我们收集了医疗领域最大的数据集,包括超过30亿张图像,来自423000余个微scopic抹片。我们对Visual Transformer模型的预训练使用MAE和DINO算法进行比较。我们对六个临床相关任务进行评估,来自三个 анатомиче位置和两个机构:乳腺癌检测、消耗性肠炎检测、乳腺癌estrogen受体预测、肺adenocarcinoma EGFR变化预测和肺癌免疫策略预测。我们的结果表明,预训练在病理学数据集上比预训练在自然图像上更有利于下游性能。此外,DINO算法在所有任务上实现了更好的泛化性能。这些结果标志着计算 PATHOLOGY 研究的新时代的开始,预计将在大规模、并行预训练的基础上建立更高性能的模型,覆盖 billion-image 级别。
Facial Forgery-based Deepfake Detection using Fine-Grained Features
results: 通过广泛的实验验证,本研究证明了该方法在跨数据集和跨模杂化检测场景下的超过90%的性能。Abstract
Facial forgery by deepfakes has caused major security risks and raised severe societal concerns. As a countermeasure, a number of deepfake detection methods have been proposed. Most of them model deepfake detection as a binary classification problem using a backbone convolutional neural network (CNN) architecture pretrained for the task. These CNN-based methods have demonstrated very high efficacy in deepfake detection with the Area under the Curve (AUC) as high as $0.99$. However, the performance of these methods degrades significantly when evaluated across datasets and deepfake manipulation techniques. This draws our attention towards learning more subtle, local, and discriminative features for deepfake detection. In this paper, we formulate deepfake detection as a fine-grained classification problem and propose a new fine-grained solution to it. Specifically, our method is based on learning subtle and generalizable features by effectively suppressing background noise and learning discriminative features at various scales for deepfake detection. Through extensive experimental validation, we demonstrate the superiority of our method over the published research in cross-dataset and cross-manipulation generalization of deepfake detectors for the majority of the experimental scenarios.
摘要
面部伪造技术使用深度复卷 neural network (CNN) 实现了严重的安全风险和社会上的担忧。为了对抗这些威胁,一些深度伪造检测方法已经被提出。大多数这些方法都是模型深度伪造检测为二分类问题,使用预训练的 CNN 架构。这些 CNN 基本架构的方法已经在深度伪造检测中表现出非常高的有效性,AUC 为 0.99。但是,这些方法在不同的数据集和伪造技巧下的表现却有很大的差异。这使我们注意到了更加细致、地方和描述性的特征学习是需要的。在这篇文章中,我们将深度伪造检测设计为精细分类问题,并提出了一个新的精细解决方案。具体而言,我们的方法是通过有效地抑制背景噪音和学习不同尺度的特征,以获得更加细致和普遍适用的深度伪造检测特征。经过了广泛的实验验证,我们证明了我们的方法在跨数据集和跨伪造技巧的普遍化运算中的超越性。
NEWTON: Are Large Language Models Capable of Physical Reasoning?
paper_authors: Yi Ru Wang, Jiafei Duan, Dieter Fox, Siddhartha Srinivasa
for: The paper aims to evaluate the physical reasoning abilities of large language models (LLMs) and provide a benchmark for assessing their performance in this area.
methods: The paper introduces a new repository and benchmark called NEWTON, which includes a collection of object-attribute pairs and 160,000 question-answer pairs to test the physical reasoning capabilities of LLMs. The authors also present a pipeline for generating customized benchmarks for specific applications.
results: The authors find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans. The paper highlights the potential of the NEWTON platform for evaluating and enhancing language models for physically grounded settings, such as robotic manipulation.Abstract
Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io
摘要
大型语言模型(LLM)通过它们的上下文化表现,已经被实践证明可以捕捉语言层次的 sintactic、semantic、词汇和常识知识。然而,对于它们的物理逻辑能力的探索仍然受限,特别是关于日常物品的重要特征。为解决这个问题,我们提出了 NEWTON,一个Repository和Benchmark,用于评估语言模型的物理逻辑能力。此外,我们还提供了一个管道,让研究人员可以根据它们的应用领域自定义NEWTON的Benchmark,以便更好地满足它们的需求。NEWTONRepository包含2800个物品 attribute的集合,提供了无限数量的评估模板。NEWTON Benchmark包含160,000个问题,通过使用NEWTONRepository进行整理,以探索语言模型的物理逻辑能力。我们的实验结果显示,LLMs如GPT-4在enario-based任务中展示了强大的逻辑能力,但在物品 attribute 的推理中与人类(50% vs. 84%)之间存在较大的差异。此外,NEWTON平台显示了它们的潜力,可以评估和改进语言模型,将其应用于物理基础的设置,如 робоック掌控。Project site:https://newtonreasoning.github.io
Answer Candidate Type Selection: Text-to-Text Language Model for Closed Book Question Answering Meets Knowledge Graphs
results: 提高了预训练问答系统的答案质量, especialyl for questions with less popular entities.Abstract
Pre-trained Text-to-Text Language Models (LMs), such as T5 or BART yield promising results in the Knowledge Graph Question Answering (KGQA) task. However, the capacity of the models is limited and the quality decreases for questions with less popular entities. In this paper, we present a novel approach which works on top of the pre-trained Text-to-Text QA system to address this issue. Our simple yet effective method performs filtering and re-ranking of generated candidates based on their types derived from Wikidata "instance_of" property.
摘要
预训练的文本到文本语言模型(LM),如T5或BART,在知识图问答任务中表现了扎实的结果。然而,模型的容量有限,问题中的 menos popular entity 的质量下降。在这篇论文中,我们提出了一种新的方法,它基于预训练的文本到文本问答系统进行排除和重新排序生成的候选答案。我们的简单 yet effective 方法基于Wikidata "instance_of" 属性来 derive 候选答案的类型。
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
results: 该论文的实验结果显示,通过使用不同的生成方法,可以增加模型的不一致率至95%以上,并且比之前的攻击方法便宜30倍。Abstract
The rapid progress in open-source large language models (LLMs) is significantly advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. By exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including LLaMA2, Vicuna, Falcon, and MPT families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. Finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. Altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. Our code is available at https://github.com/Princeton-SysML/Jailbreak_LLM.
摘要
大量的开源大语言模型(LLM)的快速进步正在推动人工智能的发展。在模型发布之前,努力了解行为与人类价值观合并,主要目标是确保它们的帮助和无害。然而,即使仔细对齐的模型也可以被恶意折衡,导致不期望的行为,称为“监狱拥堵”。这些监狱拥堵通常是由特定的文本输入触发,通常被称为“敌意提示”。在这种工作中,我们提出了生成滥用攻击,它是一种非常简单的方法,通过只对变种解码方法进行操作来扰乱模型的对齐。通过利用不同的生成策略,包括变种解码 гипер参数和采样方法,我们从0%提高了距离度到超过95%,在11种语言模型中,包括LLaMA2、Vicuna、Falcon和MPT家族,超过了当前攻击的状态艺术。最后,我们提出了一种有效的对齐方法,它可以有效降低我们的攻击下的距离度。总之,我们的研究表明当前开源LLM的安全评估和对齐过程存在重大的缺陷,强烈建议在发布之前进行更加全面的红团测试,以确保模型的帮助和无害。我们的代码可以在https://github.com/Princeton-SysML/Jailbreak_LLM上获取。
On the Interpretability of Part-Prototype Based Classifiers: A Human Centric Analysis
results: 实验结果表明,本框架可以准确评估不同类型的部prototype网络的可解释性,并且对现有的黑盒子图像分类器提供了一种可解释的替代方案。Abstract
Part-prototype networks have recently become methods of interest as an interpretable alternative to many of the current black-box image classifiers. However, the interpretability of these methods from the perspective of human users has not been sufficiently explored. In this work, we have devised a framework for evaluating the interpretability of part-prototype-based models from a human perspective. The proposed framework consists of three actionable metrics and experiments. To demonstrate the usefulness of our framework, we performed an extensive set of experiments using Amazon Mechanical Turk. They not only show the capability of our framework in assessing the interpretability of various part-prototype-based models, but they also are, to the best of our knowledge, the most comprehensive work on evaluating such methods in a unified framework.
摘要
<> translate "Part-prototype networks have recently become methods of interest as an interpretable alternative to many of the current black-box image classifiers. However, the interpretability of these methods from the perspective of human users has not been sufficiently explored. In this work, we have devised a framework for evaluating the interpretability of part-prototype-based models from a human perspective. The proposed framework consists of three actionable metrics and experiments. To demonstrate the usefulness of our framework, we performed an extensive set of experiments using Amazon Mechanical Turk. They not only show the capability of our framework in assessing the interpretability of various part-prototype-based models, but they also are, to the best of our knowledge, the most comprehensive work on evaluating such methods in a unified framework." into 简化中文 >>Here's the translation:现在,部prototype网络已成为一种可解释性替代多种现有的黑obox图像分类器的方法。然而,人类用户对这些方法的可解释性还没有充分探讨。在这项工作中,我们提出了一个用于评估部prototype基于模型的可解释性的框架。该框架包括三个操作性指标和实验。为证明我们的框架的用于性,我们在Amazon Mechanical Turk上进行了广泛的实验。这些实验不仅显示了我们的框架可以评估多种部prototype基于模型的可解释性,而且也是我们知道的最为全面的评估这类方法的框架。
Sparse Fine-tuning for Inference Acceleration of Large Language Models
results: 研究表明,使用 sparse LLMs 可以实现CPU和GPU运行时的速度提高,同时保持准确性。此外,在内存绑定的LLMs中,精度可以通过精度来降低内存带宽。研究还展示了 T5(语言翻译)、Whisper(语音翻译)和 open GPT-type(文本生成)等应用场景中的综合结果,并证明了精度可以达75%,而不会影响准确性。Abstract
We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based fine-tuning may fail to recover accuracy, especially at high sparsities. To address this, we perform a detailed study of distillation-type losses, determining an L2-based distillation approach we term SquareHead which enables accurate recovery even at higher sparsities, across all model types. On the practical efficiency side, we show that sparse LLMs can be executed with speedups by taking advantage of sparsity, for both CPU and GPU runtimes. While the standard approach is to leverage sparsity for computational reduction, we observe that in the case of memory-bound LLMs sparsity can also be leveraged for reducing memory bandwidth. We exhibit end-to-end results showing speedups due to sparsity, while recovering accuracy, on T5 (language translation), Whisper (speech translation), and open GPT-type (MPT for text generation). For MPT text generation, we show for the first time that sparse fine-tuning can reach 75% sparsity without accuracy drops, provide notable end-to-end speedups for both CPU and GPU inference, and highlight that sparsity is also compatible with quantization approaches. Models and software for reproducing our results are provided in Section 6.
摘要
我们考虑了大型语言模型(LLM)的精确简洁训练问题,即在特殊任务上训练预训练的 LLM,而使其权重产生简洁。从准确性角度来看,我们发现,使用标准损失函数的训练可能无法恢复准确性,特别是在高度简洁的情况下。为解决这个问题,我们进行了详细的研究,找到了一种基于L2的激发型损失函数,我们称之为对角方法(SquareHead),这个方法可以在高度简洁的情况下确保准确性的回复。从实际效率角度来看,我们发现,简洁的 LLM 可以通过利用简洁来提高 CPU 和 GPU 的执行速度。而标准的方法是利用简洁来降低计算量,我们发现在承载受限的 LLM 中,简洁也可以用来降低内存带宽。我们展示了实际结果,显示了因简洁而获得的优化速度,同时保持准确性,在 T5(语言翻译)、Whisper(语音翻译)和开放 GPT-type(MPT для文本生成)上。在 MPT 文本生成中,我们发现简洁训练可以 дости到 75% 的简洁水准,而不会对准确性造成负面的影响,并且在 CPU 和 GPU 执行中获得了明显的优化速度。此外,我们还发现简洁可以与量化方法相容。我们在 Section 6 中提供了模型和软件来重现我们的结果。
PICProp: Physics-Informed Confidence Propagation for Uncertainty Quantification
results: 该论文通过计算实验证明了其方法的有效性,并提供了一个定理来证明方法的正确性。Abstract
Standard approaches for uncertainty quantification in deep learning and physics-informed learning have persistent limitations. Indicatively, strong assumptions regarding the data likelihood are required, the performance highly depends on the selection of priors, and the posterior can be sampled only approximately, which leads to poor approximations because of the associated computational cost. This paper introduces and studies confidence interval (CI) estimation for deterministic partial differential equations as a novel problem. That is, to propagate confidence, in the form of CIs, from data locations to the entire domain with probabilistic guarantees. We propose a method, termed Physics-Informed Confidence Propagation (PICProp), based on bi-level optimization to compute a valid CI without making heavy assumptions. We provide a theorem regarding the validity of our method, and computational experiments, where the focus is on physics-informed learning.
摘要
The proposed method, called Physics-Informed Confidence Propagation (PICProp), is based on bi-level optimization and can compute a valid CI without making heavy assumptions. The paper provides a theorem on the validity of the method and conducts computational experiments, focusing on physics-informed learning.Translated into Simplified Chinese:传统的深度学习和物理学习不确定性评估方法具有持续的限制。例如,它们需要强大地假设数据的可能性,高度依赖于采样的选择,并且只能 aproximate posterior,导致因计算成本而得到的approximation是poor的。这篇论文介绍了和研究了确idenceInterval(CI)估计方法,即从数据位置传播 confidence 到整个领域,并提供了可靠的 probabilistic garanties。提议的方法,称为Physics-Informed Confidence Propagation(PICProp),基于双层优化,可以计算一个有效的CI无需做出重大假设。文章提供了有关方法的有效性的定理,并进行了计算实验,主要关注物理学习。
Distributed Transfer Learning with 4th Gen Intel Xeon Processors
results: 研究人员通过使用 Intel Xeon 处理器和 AMX 实现了分布式训练,在图像分类任务上达到了 near state-of-the-art 精度。Abstract
In this paper, we explore how transfer learning, coupled with Intel Xeon, specifically 4th Gen Intel Xeon scalable processor, defies the conventional belief that training is primarily GPU-dependent. We present a case study where we achieved near state-of-the-art accuracy for image classification on a publicly available Image Classification TensorFlow dataset using Intel Advanced Matrix Extensions(AMX) and distributed training with Horovod.
摘要
在这篇论文中,我们探讨了如何使用传输学习,特别是使用四代英特尔Xeon可扩展处理器,推翻了训练主要依赖于GPU的传统信念。我们提出了一个案例研究,在公共可用的TensorFlow图像分类 dataset上使用英特尔高级矩阵扩展(AMX)和分布式训练 Horovod 实现了near状态艺点精度的图像分类。
Reinforcement Learning in a Safety-Embedded MDP with Trajectory Optimization
results: 该方法在安全训练 tasks 中表现出色,在推理中获得了显著更高的奖励和近于零的安全违反率。此外,通过实际投入一个实际任务中的箱推进,证明了该方法的实际可行性。Abstract
Safe Reinforcement Learning (RL) plays an important role in applying RL algorithms to safety-critical real-world applications, addressing the trade-off between maximizing rewards and adhering to safety constraints. This work introduces a novel approach that combines RL with trajectory optimization to manage this trade-off effectively. Our approach embeds safety constraints within the action space of a modified Markov Decision Process (MDP). The RL agent produces a sequence of actions that are transformed into safe trajectories by a trajectory optimizer, thereby effectively ensuring safety and increasing training stability. This novel approach excels in its performance on challenging Safety Gym tasks, achieving significantly higher rewards and near-zero safety violations during inference. The method's real-world applicability is demonstrated through a safe and effective deployment in a real robot task of box-pushing around obstacles.
摘要
安全强化学习(RL)在实现RL算法应用于安全关键实际应用中扮演着重要的角色,解决最大化奖励和遵从安全约束之间的负担。这项工作介绍了一种新的方法,即将RL与轨迹优化结合起来管理这种负担。我们的方法将安全约束嵌入到修改后的Markov决策过程(MDP)中的动作空间中。RL机器人生成一系列动作,然后这些动作被一个轨迹优化器转换成安全轨迹,从而确保安全性和提高训练稳定性。这种新的方法在安全启发任务中表现出色,在推理过程中获得了明显更高的奖励和near-zero的安全违反。此外,我们还通过一个真实的 робот任务——推箱避障来证明该方法的实际应用性。
Scalable Semantic Non-Markovian Simulation Proxy for Reinforcement Learning
results: 与两个高精度模拟器进行比较,速度提高三个数量级,保持策略学习质量。同时,可以模拟和利用非马歇维安的动力学和即时行动,并提供可解释的跟踪来描述代理行为的结果。Abstract
Recent advances in reinforcement learning (RL) have shown much promise across a variety of applications. However, issues such as scalability, explainability, and Markovian assumptions limit its applicability in certain domains. We observe that many of these shortcomings emanate from the simulator as opposed to the RL training algorithms themselves. As such, we propose a semantic proxy for simulation based on a temporal extension to annotated logic. In comparison with two high-fidelity simulators, we show up to three orders of magnitude speed-up while preserving the quality of policy learned. In addition, we show the ability to model and leverage non-Markovian dynamics and instantaneous actions while providing an explainable trace describing the outcomes of the agent actions.
摘要
paper_authors: Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
results: 对比LLAMA 2 13B和LLAMA 1 34B,Mistral 7B在评估标准卷积中表现出色,并在逻辑、数学和代码生成方面超越Llama 2 13B。此外,Mistral 7B – Instruct模型在人工和自动评测标准中也表现出优异。Abstract
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
摘要
我们介绍Mistral 7B v0.1,一个引擎ered for superior performance和效率的700亿个参数语言模型。Mistral 7B在所有评估标准上都超过Llama 2 13B,并且在推理、数学和代码生成方面也超过Llama 1 34B。我们的模型利用分组查询注意力(GQA)来提高推理速度,同时使用滑块窗口注意力(SWA)来有效地处理序列的任意长度,具有降低推理成本的特性。我们还提供了一个遵循指令的模型,Mistral 7B -- Instruct,与Llama 2 13B -- Chat模型在人工和自动评估标准上都超过。我们的模型根据Apache 2.0 license发布。
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
methods: 这 paper 使用了训练探针来检测 LLM 是否输出真假信息,并利用了三种线索来研究 LLM 表达真假信息的结构:1. 视觉化 LLM 真假声明表示结构,显示了明确的线性结构。2. 传输实验,在不同数据集上使用同一个探针进行探测。3. causal 证据,通过对 LLM 的前进传递进行手动修改,使其对假声明视为真 и vice versa。
results: 这 paper 发现,语言模型 linearly 表示真假信息。此外,这 paper 还引入了一种新的探针技术——质量均值探针,它比其他探针技术更好地泛化和更直接地关联到模型输出。Abstract
Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.
摘要
results: 我们的实验表明,NECO 可以在小规模和大规模 OOD 检测任务中达到状态之Art的结果,并且具有强大的泛化能力,可以在不同的网络架构上展示出优秀的表现。Abstract
Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. We plan to release the code after the anonymity period.
摘要
检测对外分布(OOD)数据是机器学习中的关键挑战,因为模型具有自信心,通常不知道其知识理论上的限制。我们假设“神经塌陷”,一种影响在 Distribution 数据上的模型训练过程中的现象,也影响 OOD 数据。为了利用这种关系,我们介绍了 NECO,一种新的后处方法 для OOD 检测,它利用神经网络在主成分空间的几何性质和“神经塌陷”的特性来标识 OOD 数据。我们的广泛实验表明,NECO 在小规模和大规模 OOD 检测任务上具有最佳的 результаaten,并且具有强大的泛化能力,可以在不同的网络架构上展现出优异表现。此外,我们还提供了对 NECO 方法在 OOD 检测中的理论解释。我们计划在匿名期结束后发布代码。
Advancing Transformer’s Capabilities in Commonsense Reasoning
methods: INTRODUCE current ML-based methods, including knowledge transfer, model ensemble, and introducing an additional pairwise contrastive objective.
results: our best model outperforms the strongest previous works by ~15% absolute gains in Pairwise Accuracy and ~8.7% absolute gains in Standard Accuracy.Here’s the full Chinese text:
results: 我们的最佳模型与最强前一个工作相比,在对比精度和标准精度上增加了约15%和8.7%的绝对提升。Abstract
Recent advances in general purpose pre-trained language models have shown great potential in commonsense reasoning. However, current works still perform poorly on standard commonsense reasoning benchmarks including the Com2Sense Dataset. We argue that this is due to a disconnect with current cutting-edge machine learning methods. In this work, we aim to bridge the gap by introducing current ML-based methods to improve general purpose pre-trained language models in the task of commonsense reasoning. Specifically, we experiment with and systematically evaluate methods including knowledge transfer, model ensemble, and introducing an additional pairwise contrastive objective. Our best model outperforms the strongest previous works by ~15\% absolute gains in Pairwise Accuracy and ~8.7\% absolute gains in Standard Accuracy.
摘要
近期大规模普通语言模型的进步已经表现出了很大的潜力,但现有工作仍然在标准的常识理解benchmark上表现不佳。我们认为这是因为现有的机器学习方法和普通语言模型之间存在一个分隔。在这个工作中,我们希望通过引入当前的机器学习方法来改善通用语言模型在常识理解任务中的性能。具体来说,我们实验了并系统地评估了知识传递、模型集成和添加对比对象的方法。我们的best模型在对比精度和标准精度上都有约15%的绝对提升,即使是与最强的前一代工作相比也有8.7%的绝对提升。
$f$-Policy Gradients: A General Framework for Goal Conditioned RL using $f$-Divergences
methods: 这个论文提出了一种新的探索促进方法called $f$-Policy Gradients($f$-PG),它利用了状态访问分布与目标之间的f- divergence来逼近优化问题。 authors derive gradients for various f-divergences来优化这个目标。
results: 论文的实验结果表明,$f$-PG比标准的政策升降方法在一个具有挑战性的网格世界以及Point Maze和FetchReach环境中表现更好。Abstract
Goal-Conditioned Reinforcement Learning (RL) problems often have access to sparse rewards where the agent receives a reward signal only when it has achieved the goal, making policy optimization a difficult problem. Several works augment this sparse reward with a learned dense reward function, but this can lead to sub-optimal policies if the reward is misaligned. Moreover, recent works have demonstrated that effective shaping rewards for a particular problem can depend on the underlying learning algorithm. This paper introduces a novel way to encourage exploration called $f$-Policy Gradients, or $f$-PG. $f$-PG minimizes the f-divergence between the agent's state visitation distribution and the goal, which we show can lead to an optimal policy. We derive gradients for various f-divergences to optimize this objective. Our learning paradigm provides dense learning signals for exploration in sparse reward settings. We further introduce an entropy-regularized policy optimization objective, that we call $state$-MaxEnt RL (or $s$-MaxEnt RL) as a special case of our objective. We show that several metric-based shaping rewards like L2 can be used with $s$-MaxEnt RL, providing a common ground to study such metric-based shaping rewards with efficient exploration. We find that $f$-PG has better performance compared to standard policy gradient methods on a challenging gridworld as well as the Point Maze and FetchReach environments. More information on our website https://agarwalsiddhant10.github.io/projects/fpg.html.
摘要
goal-conditioned reinforcement learning(RL)问题经常会遇到罕见的奖励,agent只有当它完成目标时才会获得奖励信号,这使得政策优化成为一个困难的问题。一些工作会在罕见的奖励上添加学习的权重函数,但这可能会导致不优化的政策。此外,最近的研究表明,有效的形状奖励可以与学习算法相关。这篇论文介绍了一种新的探索促进方法,称为f-政策Gradient(f-PG)。f-PG将减少f-分布之间的差异,我们显示这可以导致最佳政策。我们Derive gradients for various f-divergences to optimize this objective。我们的学习模式可以在罕见奖励设置下提供权重学习信号,以便探索。此外,我们还引入了一个 entropy-regularized policy optimization objective,称为state-MaxEnt RL(s-MaxEnt RL),这是我们的目标之一。我们显示可以使用L2等度量基于的形状奖励,并且可以与efficient exploration相结合。我们发现f-PG比标准的政策梯度方法在一个复杂的网格世界以及Point Maze和FetchReach环境中表现更好。更多信息请访问我们的网站https://agarwalsiddhant10.github.io/projects/fpg.html。
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
results: 训练使用OpenWebMath数据集的1.4亿参数语言模型,模型性能超过了训练在大量通用语言数据上的模型。Abstract
There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.
摘要
有证据显示,预训练于高质量、仔细设计的令符,如代码或数学文档,对大型自然语言模型的理解能力产生重要的影响。例如,Minerva模型,通过对数百亿个数学文档从arXiv和网络上进行微调,显著提高了需要量化逻辑的问题的性能。然而,由于所有已知的开源网络数据集都会对数学notation进行不准确的预处理,因此大规模在量化网络文档上进行训练的 beneficial effects 是无法对研究社区提供。我们介绍了 OpenWebMath 数据集,它是基于这些工作的开放数据集,包含 14.7 亿个数学网页FROM Common Crawl。我们详细描述了提取文本和LaTeX内容,并从 HTML 文档中 removing boilerplate 的方法,以及质量筛选和重复 elimination 的方法。此外,我们对 OpenWebMath 数据集进行了小规模实验,证明模型在 14.7 亿个令符上进行训练后,性能超过了在大量常见语言数据上进行训练后的性能。我们希望 OpenWebMath 数据集,通过在 Hugging Face Hub 上公开发布,能够促进大型自然语言模型的理解能力。
A Supervised Embedding and Clustering Anomaly Detection method for classification of Mobile Network Faults
paper_authors: R. Mosayebi, H. Kia, A. Kianpour Raki
For: 这篇论文旨在解决移动网络中异常警示日志的效率识别问题,降低手动监测的困难,帮助网络维护人员更快地发现和解决问题。* Methods: 该方法使用超级vised Embedding和集群异常检测(SEMC-AD),利用历史异常警示日志和其标签来EXTRACT数字表示方法,有效地解决异常警示日志数据集中异常分类的问题,不需要使用一hot编码。* Results: 实验表明,SEMC-AD方法可以准确地识别异常警示日志,其 anomaly detection 率为 99%,而Random Forest和XGBoost方法只能检测到 86% 和 81% 的异常。 SEMC-AD 方法在具有多个分类特征的数据集中表现更高效,可以快速地发现和解决问题,减轻网络维护人员的负担。Abstract
The paper introduces Supervised Embedding and Clustering Anomaly Detection (SEMC-AD), a method designed to efficiently identify faulty alarm logs in a mobile network and alleviate the challenges of manual monitoring caused by the growing volume of alarm logs. SEMC-AD employs a supervised embedding approach based on deep neural networks, utilizing historical alarm logs and their labels to extract numerical representations for each log, effectively addressing the issue of imbalanced classification due to a small proportion of anomalies in the dataset without employing one-hot encoding. The robustness of the embedding is evaluated by plotting the two most significant principle components of the embedded alarm logs, revealing that anomalies form distinct clusters with similar embeddings. Multivariate normal Gaussian clustering is then applied to these components, identifying clusters with a high ratio of anomalies to normal alarms (above 90%) and labeling them as the anomaly group. To classify new alarm logs, we check if their embedded vectors' two most significant principle components fall within the anomaly-labeled clusters. If so, the log is classified as an anomaly. Performance evaluation demonstrates that SEMC-AD outperforms conventional random forest and gradient boosting methods without embedding. SEMC-AD achieves 99% anomaly detection, whereas random forest and XGBoost only detect 86% and 81% of anomalies, respectively. While supervised classification methods may excel in labeled datasets, the results demonstrate that SEMC-AD is more efficient in classifying anomalies in datasets with numerous categorical features, significantly enhancing anomaly detection, reducing operator burden, and improving network maintenance.
摘要
文章介绍了一种名为Supervised Embedding and Clustering Anomaly Detection(SEMC-AD)的方法,用于高效地在移动网络中识别异常报警日志并减轻人工监测的困难,由于报警日志的数量不断增加。SEMC-AD利用深度神经网络的超级vised embedding方法,使用历史报警日志和其标签来提取每个日志的数字表示,有效解决了因数据集中异常的比例较小而导致的一类问题,不需要使用一hot编码。随后,对这些Component进行多ivariate normal Gaussian clustering,可以快速地标识异常类型的报警日志。为了分类新的报警日志,我们只需要检查其embeddedvector的两个最重要的主成分是否 falls within the anomaly-labeled clusters。如果是,则将日志分类为异常。性能评估表明,SEMC-AD比无 embedding的Random Forest和XGBoost方法更高效,SEMC-AD可以识别99%的异常报警,而Random Forest和XGBoost只能识别86%和81%的异常报警。虽然超级vised分类方法在标注数据集中可能会出色,但结果表明SEMC-AD在具有多个分类特征的数据集中更高效地识别异常,提高异常检测率,减轻操作员的负担,改善网络维护。
Correlated Noise Provably Beats Independent Noise for Differentially Private Learning
results: 比对普通DP-SGD,相关噪声可以提高学习效果,并且可以避免 cube 复杂度。实验 validate 了我们的理论。Abstract
Differentially private learning algorithms inject noise into the learning process. While the most common private learning algorithm, DP-SGD, adds independent Gaussian noise in each iteration, recent work on matrix factorization mechanisms has shown empirically that introducing correlations in the noise can greatly improve their utility. We characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions. We show, using these bounds, how correlated noise provably improves upon vanilla DP-SGD as a function of problem parameters such as the effective dimension and condition number. Moreover, our analytical expression for the near-optimal correlation function circumvents the cubic complexity of the semi-definite program used to optimize the noise correlation matrix in previous work. We validate our theory with experiments on private deep learning. Our work matches or outperforms prior work while being efficient both in terms of compute and memory.
摘要
diferencialmente privado 学习算法加入噪声到学习过程中。而最常见的私人学习算法DP-SGD每次迭代添加独立的 Gaussian 噪声,而最近的矩阵分解机制研究表明,在噪声中引入相关性可以大大提高其用用。我们Characterize the asymptotic learning utility for any choice of the correlation function, giving precise analytical bounds for linear regression and as the solution to a convex program for general convex functions。我们通过这些bound,证明相关噪声可以超过原生DP-SGD的性能,随着问题参数如有效维度和condition number的变化。此外,我们的分析表达式可以避免之前的cubic complexity的半definite program用于优化噪声相关矩阵。我们通过实验 validate our theory on private deep learning,我们的工作与之前的工作匹配或超越,同时在计算和内存方面都是高效的。
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
results: 研究发现,当给出一个代码库和一个问题描述时,当前状态艺AE模型和我们精度调整的模型 SWE-Llama 只能解决最简单的问题。 Claude 2 和 GPT-4 只能解决 $4.8%$ 和 $1.7%$ 的实例。Abstract
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We consider real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. We therefore introduce SWE-bench, an evaluation framework including $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. Claude 2 and GPT-4 solve a mere $4.8$% and $1.7$% of instances respectively, even when provided with an oracle retriever. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
摘要
语言模型已经超过了我们评估它们的能力,但为其未来发展, изу究这些模型的前沿是非常重要。我们认为实际的软件工程是一个丰富、可持续和挑战性的测试环境,可以用于评估下一代语言模型。因此,我们介绍了 SWE-bench,一个评估框架,包括 $2,294$ 个实际 GitHub 问题和相应的 pull request across $12$ 个流行的 Python 存储库。给定一个代码库以及一个问题的描述,一个语言模型需要编辑代码库以解决问题。在 SWE-bench 中解决问题 frequently 需要理解和协调多个函数、类和文件之间的更改,需要模型与执行环境交互,处理极长的上下文,并进行复杂的逻辑分析,这些都超出了传统代码生成的范畴。我们的评估结果显示,当前的商业化模型和我们练化的模型 SWE-Llama 只能解决最简单的问题。Claude 2 和 GPT-4 只能解决 $4.8\%$ 和 $1.7\%$ 的实例,即使提供了 oracle retriever。 SWE-bench 的进步表明了语言模型的实用、智能和自主发展。
results: 这个研究通过广泛的实验证明了 $\mathbf{FABind}$ 的优秀性能,与现有的方法相比,它在预测蛋白质和抗体之间的结合结构时表现出了更高的精度和更快的速度。Abstract
Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at $\href{https://github.com/QizhiPei/FABind}{Github}$.
摘要
In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference.Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at $\href{https://github.com/QizhiPei/FABind}{Github}$.
Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory
results: 研究发现,不同神经网络学习过程中的Feature之间存在广泛的功能等价关系,并且可以通过Iterative Feature Merging(IFM)算法来减少神经网络的参数数量无需影响性能。此外,我们还发现了一些有趣的实验结果,如Feature复杂性与神经网络性能之间的关系等。Abstract
The behavior of neural networks still remains opaque, and a recently widely noted phenomenon is that networks often achieve similar performance when initialized with different random parameters. This phenomenon has attracted significant attention in measuring the similarity between features learned by distinct networks. However, feature similarity could be vague in describing the same feature since equivalent features hardly exist. In this paper, we expand the concept of equivalent feature and provide the definition of what we call functionally equivalent features. These features produce equivalent output under certain transformations. Using this definition, we aim to derive a more intrinsic metric for the so-called feature complexity regarding the redundancy of features learned by a neural network at each layer. We offer a formal interpretation of our approach through the lens of category theory, a well-developed area in mathematics. To quantify the feature complexity, we further propose an efficient algorithm named Iterative Feature Merging. Our experimental results validate our ideas and theories from various perspectives. We empirically demonstrate that the functionally equivalence widely exists among different features learned by the same neural network and we could reduce the number of parameters of the network without affecting the performance.The IFM shows great potential as a data-agnostic model prune method. We have also drawn several interesting empirical findings regarding the defined feature complexity.
摘要
神经网络的行为仍然存在诡异性,而一个最近受到广泛关注的现象是,当不同Random参数初始化的神经网络 Initialization具有类似性。这种现象引起了评估特征之间的相似性的重要注意。然而,特征相似性可能是描述相同特征的抽象方式,因为等效特征几乎不存在。在这篇论文中,我们扩展了特征相似性的概念,并提供了我们称为功能相似特征的定义。这些特征在某些变换下产生相同的输出。使用这个定义,我们想要 derivate一种更内在的特征复杂度度量方法,用于衡量神经网络每层学习的特征复杂度。我们还提出了一种效率高的算法名为迭代特征合并(Iterative Feature Merging,IFM),用于实现这一目标。我们的实验结果证明了我们的想法和理论从多个角度来看都是正确的。我们经验显示,神经网络学习的不同特征之间存在广泛的功能相似性,并且可以通过IFM来减少神经网络的参数数量,无需影响性能。此外,我们还发现了一些有趣的实验发现,关于定义的特征复杂度。
Comparing AI Algorithms for Optimizing Elliptic Curve Cryptography Parameters in Third-Party E-Commerce Integrations: A Pre-Quantum Era Analysis
results: 研究发现,GA和PSO在ECC参数优化方面具有不同的优势,GA在精度方面表现较好,而PSO在稳定性方面表现较好。在模拟的电子商务环境中,使用GA和PSO优化的ECC参数和 secp256k1 比较,显示了GA和PSO在ECC参数优化方面的有效性。Abstract
This paper presents a comparative analysis between the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), two vital artificial intelligence algorithms, focusing on optimizing Elliptic Curve Cryptography (ECC) parameters. These encompass the elliptic curve coefficients, prime number, generator point, group order, and cofactor. The study provides insights into which of the bio-inspired algorithms yields better optimization results for ECC configurations, examining performances under the same fitness function. This function incorporates methods to ensure robust ECC parameters, including assessing for singular or anomalous curves and applying Pollard's rho attack and Hasse's theorem for optimization precision. The optimized parameters generated by GA and PSO are tested in a simulated e-commerce environment, contrasting with well-known curves like secp256k1 during the transmission of order messages using Elliptic Curve-Diffie Hellman (ECDH) and Hash-based Message Authentication Code (HMAC). Focusing on traditional computing in the pre-quantum era, this research highlights the efficacy of GA and PSO in ECC optimization, with implications for enhancing cybersecurity in third-party e-commerce integrations. We recommend the immediate consideration of these findings before quantum computing's widespread adoption.
摘要
The fitness function incorporates methods to ensure robust ECC parameters, such as assessing for singular or anomalous curves and applying Pollard's rho attack and Hasse's theorem for optimization precision. The optimized parameters generated by GA and PSO are tested in a simulated e-commerce environment, and compared with well-known curves like secp256k1. The study focuses on traditional computing in the pre-quantum era, and highlights the efficacy of GA and PSO in ECC optimization, with implications for enhancing cybersecurity in third-party e-commerce integrations. The findings of this research are recommended for immediate consideration before the widespread adoption of quantum computing.
Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks
results: 研究发现,通过将球面傅立叶函数和抛物线网络相结合,可以实现高效地理空间特征表示,并且在不同的分类和回归任务中达到了州际级的性能。Abstract
Learning feature representations of geographical space is vital for any machine learning model that integrates geolocated data, spanning application domains such as remote sensing, ecology, or epidemiology. Recent work mostly embeds coordinates using sine and cosine projections based on Double Fourier Sphere (DFS) features -- these embeddings assume a rectangular data domain even on global data, which can lead to artifacts, especially at the poles. At the same time, relatively little attention has been paid to the exact design of the neural network architectures these functional embeddings are combined with. This work proposes a novel location encoder for globally distributed geographic data that combines spherical harmonic basis functions, natively defined on spherical surfaces, with sinusoidal representation networks (SirenNets) that can be interpreted as learned Double Fourier Sphere embedding. We systematically evaluate the cross-product of positional embeddings and neural network architectures across various classification and regression benchmarks and synthetic evaluation datasets. In contrast to previous approaches that require the combination of both positional encoding and neural networks to learn meaningful representations, we show that both spherical harmonics and sinusoidal representation networks are competitive on their own but set state-of-the-art performances across tasks when combined. We provide source code at www.github.com/marccoru/locationencoder
摘要
学习地理空间特征表示是任何结合地理数据的机器学习模型的关键环节,涵盖应用领域如远程感知、生态学和 epidemiology。现有的大部分方法使用 Double Fourier Sphere(DFS)特征来投影坐标,这些投影假设数据域是方形的,尤其是在全球数据上,这可能会导致特征扭曲,特别是在两极。同时,对于 neural network 架构的精确设计得到了相对少的关注。这项工作提议一种新的全球地理数据编码器,其将球面幂函数基函数(spherical harmonic)和投影网络(SirenNets)结合起来,可以视为学习 Double Fourier Sphere 嵌入。我们系统地评估了不同的 pozitional 编码和 neural network 架构的跨产品和人工评估数据集。与之前的方法不同,我们发现了 spherical harmonics 和投影网络是独立学习的,但是当它们组合在一起时,它们可以达到最佳性能。我们在 www.github.com/marccoru/locationencoder 提供源代码。
Exploring Memorization in Fine-tuned Language Models
results: 研究发现,LM在不同任务中的记忆表现存在异常强的差异,并且发现了记忆表现与注意力分布之间的强相关性。此外,多任务调节被发现可以减少精度调节后的记忆表现。Abstract
LLMs have shown great capabilities in various tasks but also exhibited memorization of training data, thus raising tremendous privacy and copyright concerns. While prior work has studied memorization during pre-training, the exploration of memorization during fine-tuning is rather limited. Compared with pre-training, fine-tuning typically involves sensitive data and diverse objectives, thus may bring unique memorization behaviors and distinct privacy risks. In this work, we conduct the first comprehensive analysis to explore LMs' memorization during fine-tuning across tasks. Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that fine-tuned memorization presents a strong disparity among tasks. We provide an understanding of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution. By investigating its memorization behavior, multi-task fine-tuning paves a potential strategy to mitigate fine-tuned memorization.
摘要
LLMs 有很好的能力在不同的任务上,但也表现出储存训练数据的问题,因此引起了巨大的隐私和版权问题。在先前的工作中,研究了训练前的储存行为,但对于精度调整来说,研究储存行为的探索相对较少。相比训练前,精度调整通常涉及敏感数据和多种目标,因此可能带来唯一的储存行为和特定的隐私风险。在这项工作中,我们进行了第一次全面的分析,探索 LM 在调整过程中的储存行为。我们使用开源的 LM 和我们自己调整的 LM 在多种任务上进行了研究,发现调整后的储存强度存在任务之间的强烈差异。通过零 coding 理论和注意力分布的调查,我们了解了储存行为与注意力分布之间的强烈相关性。此外,我们发现了多任务调整可能减轻调整后的储存行为的问题。
Quality Control at Your Fingertips: Quality-Aware Translation Models
paper_authors: Christian Tomani, David Vilar, Markus Freitag, Colin Cherry, Subhajit Naskar, Mara Finkelstein, Daniel Cremers
for: 提高神经机器翻译模型(NMT)的翻译质量。
methods: 使用神经网络自己估计输出质量,并在MAP解oding中使用这个质量信号作为提示。
results: 使用内置质量估计可以自动排除优化搜索空间,并在MBR解oding中提高翻译质量,同时降低搜索速度。Abstract
Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations being more likely. However, research has shown that this assumption does not always hold, and decoding strategies which directly optimize a utility function, like Minimum Bayes Risk (MBR) or Quality-Aware decoding can significantly improve translation quality over standard MAP decoding. The main disadvantage of these methods is that they require an additional model to predict the utility, and additional steps during decoding, which makes the entire process computationally demanding. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. During decoding, we can use the model's own quality estimates to guide the generation process and produce the highest-quality translations possible. We demonstrate that the model can self-evaluate its own output during translation, eliminating the need for a separate quality estimation model. Moreover, we show that using this quality signal as a prompt during MAP decoding can significantly improve translation quality. When using the internal quality estimate to prune the hypothesis space during MBR decoding, we can not only further improve translation quality, but also reduce inference speed by two orders of magnitude.
摘要
最常用的决策策略之一是最大 posteriori(MAP)解oding,用于神经机器翻译(NMT)模型。假设是,模型的概率与人类判断有高度相关, better translations 是更有可能性的。然而,研究表明,这种假设并不总是成立,而使用直接优化一个实用函数,如最小 bayes 风险(MBR)或质量意识 decoding 可以显著提高翻译质量。这些方法的主要缺点是它们需要一个额外的模型来预测实用函数,以及在解码过程中进行额外的步骤,这使得整个过程变得计算昂贵。在这篇论文中,我们提议使用 NMT 模型本身来自适应质量。在解码过程中,我们可以使用模型自己的质量估计来导引生成过程,以生成最高质量的翻译。我们示示了模型可以自我评估其自己的输出,无需额外的质量估计模型。此外,我们还表明,使用这个质量信号作为提示在 MAP 解oding 中使用可以显著提高翻译质量。当使用内部质量估计来减少假设空间中的假设时,我们可以不仅进一步提高翻译质量,还可以将推理速度减少两个数量级。
DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection
results: 研究发现,使用LSH和DeepLSH可以减少崩溃bug报告的相似性搜索时间,并且可以保证搜索结果的准确性。此外,研究还提供了一个原始数据集,以便进一步 validate 这些结果。Abstract
Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available.
摘要
自动化崩溃分组是软件开发过程中的一个关键阶段,用于有效地处理报告 bug 的情况。通常通过聚合技术来实现这一目标。然而,在实时流动的报告 bug 收集中,系统需要快速回答问题:新的报告 bug 与其他报告 bug 之间有哪些相似之处?因此,快速找到相似的报告 bug 变得非常重要。这使得最近邻居搜索成为一个自然的选择,特别是使用了本地敏感哈希(LSH),因为它可以在大量数据集上实现子线性性和对 Similarity search 的理论保证。尽管 LSH 在崩溃分组文献中没有被考虑,但我们在这篇论文中尝试使用它来解决这一问题。为了考虑文献中最常用的metric,我们提出了 DeepLSH,一种基于 Siamese DNN 架构的原始搜索函数,可以准确地模拟本地敏感性Property,包括 Jaccard 和 Cosine metric 。我们通过一系列实验证明了 DeepLSH 的有效性,并提供了一个原始数据集,可以用于进一步研究。
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
paper_authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen for:This paper aims to develop a cost-effective approach for building smaller language models (LLMs) from pre-trained, larger models.methods:The approach uses two key techniques: targeted structured pruning and dynamic batch loading. Targeted structured pruning prunes the larger model to a specified target shape, while dynamic batch loading dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.results:The Sheared-LLaMA series, pruned from the LLaMA2-7B model, outperforms state-of-the-art open-source models of equivalent sizes on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of the compute required to train such models from scratch.Abstract
The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, and OpenLLaMA models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building smaller LLMs.
摘要
LLMA (Touvron 等人,2023a;b) 的受欢迎程度和其他最近出现的中等规模的大语言模型 (LLMs) 表明了建立更小 yet 强大 LLMs 的潜力。然而,从头来训练这些模型的成本仍然高。在这个工作中,我们研究结构性剪裁作为开发更小 LLMs 的有效方法。我们的方法使用两个关键技术:(1) Targeted 结构性剪裁,剪掉一个更大的模型到指定的目标形态,包括层、头、中间和隐藏维度,并在端到端方式进行剪裁;(2)动态批处理,根据不同领域的变化损失动态更新每个训练批处理中的样本数据组合。我们示出了 Sheared-LLaMA 系列,剪掉 LLMA2-7B 模型为 1.3B 和 2.7B 参数。Sheared-LLaMA 模型在各种下游和指令调整评估中表现出色,而且只需要比训练这些模型从头来的计算量为 3%。这项工作提供了证明,使用现有的 LLMs 结构性剪裁是一种更加经济的方法来建立更小的 LLMs。
Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models
results: 论文的实验结果表明,Meta-CoT可以在十个公共评估任务上达到杰出的表现,同时具有优秀的泛化能力。特别是,Meta-CoT在SVAMP任务上达到了93.7%的最佳成绩,无需任何程序协助方法。Abstract
Large language models (LLMs) have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting, which generates intermediate reasoning chains to serve as the rationale for deriving the answer. However, current CoT methods either simply employ general prompts such as Let's think step by step, or heavily rely on handcrafted task-specific demonstrations to attain preferable performances, thereby engendering an inescapable gap between performance and generalization. To bridge this gap, we propose Meta-CoT, a generalizable CoT prompting method in mixed-task scenarios where the type of input questions is unknown. Meta-CoT firstly categorizes the scenario based on the input question and subsequently constructs diverse demonstrations from the corresponding data pool in an automatic pattern. Meta-CoT simultaneously enjoys remarkable performances on ten public benchmark reasoning tasks and superior generalization capabilities. Notably, Meta-CoT achieves the state-of-the-art result on SVAMP (93.7%) without any additional program-aided methods. Our further experiments on five out-of-distribution datasets verify the stability and generality of Meta-CoT.
摘要
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach
results: 研究结果表明,该方法可以提供帮助 end-users 理解 LLMs 预测的各种见解,并可以通过调整提示来改善 LLMs 生成的代码质量。Abstract
While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLMs)- based code generation, where LLMs, deemed a complex and powerful black-box model, is instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency. Inspired by the recent progress in causality analysis and its application in software engineering, this paper launches a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.
摘要
在软件开发中,代码生成已经广泛应用,但代码质量并不能保证。在大语言模型(LLM)基于代码生成中,LLM被视为复杂且强大的黑盒模型,通过高级自然语言规范(即提示)生成代码。然而,对LLM代码生成能力进行有效评估和解释是极其困难的,这是因为LLM的复杂性和不透明性。 inspirited by recent progress in causality analysis and its application in software engineering, this paper proposes a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.
Unlock the Potential of Counterfactually-Augmented Data in Out-Of-Distribution Generalization
methods: 使用Counterfactually-Augmented Data(CAD),并在Feature Space中分析myopia现象,并引入基于CAD的结构性质的两个约束,以帮助语言模型更好地抽取 causal features
results: 在 Sentiment Analysis 和 Natural Language Inference 两个任务上,经验表明,我们的方法可以提高语言模型的 OOD 泛化性能by 1.0% 到 5.9%Abstract
Counterfactually-Augmented Data (CAD) -- minimal editing of sentences to flip the corresponding labels -- has the potential to improve the Out-Of-Distribution (OOD) generalization capability of language models, as CAD induces language models to exploit domain-independent causal features and exclude spurious correlations. However, the empirical results of CAD's OOD generalization are not as efficient as anticipated. In this study, we attribute the inefficiency to the myopia phenomenon caused by CAD: language models only focus on causal features that are edited in the augmentation operation and exclude other non-edited causal features. Therefore, the potential of CAD is not fully exploited. To address this issue, we analyze the myopia phenomenon in feature space from the perspective of Fisher's Linear Discriminant, then we introduce two additional constraints based on CAD's structural properties (dataset-level and sentence-level) to help language models extract more complete causal features in CAD, thereby mitigating the myopia phenomenon and improving OOD generalization capability. We evaluate our method on two tasks: Sentiment Analysis and Natural Language Inference, and the experimental results demonstrate that our method could unlock the potential of CAD and improve the OOD generalization performance of language models by 1.0% to 5.9%.
摘要
Counterfactually-Augmented Data (CAD) -- 通过最小修改句子的方式,flips 对应的标签 -- 有可能提高语言模型的 OUT-OF-DISTRIBUTION (OOD) 泛化能力,因为 CAD 使语言模型利用域 independet causal features,并排除干扰因素。然而,CAD 的 OOD 泛化实际效果不如预期,我们归因于 CAD 引起的短视现象:语言模型只关注编辑操作中的 causal features,而忽略其他非编辑的 causal features。因此,CAD 的潜在可能性没有得到充分利用。为了解决这个问题,我们从 Fisher's Linear Discriminant 的视角分析 feature space 中的短视现象,然后引入基于 CAD 结构属性(dataset-level和 sentence-level)的两个额外约束,以 помо助语言模型在 CAD 中提取更完整的 causal features,从而 Mitigate 短视现象,提高 OOD 泛化能力。我们在 Sentiment Analysis 和 Natural Language Inference 两个任务上进行了实验,实际结果表明,我们的方法可以提高 CAD 的 OOD 泛化性能,从 1.0% 到 5.9%。
Assessing the Impact of a Supervised Classification Filter on Flow-based Hybrid Network Anomaly Detection
results: 我们的实验结果表明,混合方法可以提高已知攻击的检测率,同时仍然保持适用于零日攻击的检测能力。使用监督二进制预Filter可以提高AUC指标超过11%,检测30%更多的攻击,保持准确阳性数量相对不变。Abstract
Constant evolution and the emergence of new cyberattacks require the development of advanced techniques for defense. This paper aims to measure the impact of a supervised filter (classifier) in network anomaly detection. We perform our experiments by employing a hybrid anomaly detection approach in network flow data. For this purpose, we extended a state-of-the-art autoencoder-based anomaly detection method by prepending a binary classifier acting as a prefilter for the anomaly detector. The method was evaluated on the publicly available real-world dataset UGR'16. Our empirical results indicate that the hybrid approach does offer a higher detection rate of known attacks than a standalone anomaly detector while still retaining the ability to detect zero-day attacks. Employing a supervised binary prefilter has increased the AUC metric by over 11%, detecting 30% more attacks while keeping the number of false positives approximately the same.
摘要
常态的演化和新型攻击的出现需要开发先进的防御技术。这篇论文目的是测量一个监督器(分类器)在网络异常检测中的影响。我们在网络流数据中进行了一种混合异常检测方法的实验,包括将一个状态艺术自适应异常检测方法扩展为预先筛选器。我们使用公共可用的真实世界数据集UGR'16进行评估。我们的实验结果表明,混合方法可以提高已知攻击的检测率,同时仍然保持适用于零天攻击的检测能力。在使用监督二进制预测器后,AUC指标提高了超过11%,检测到30%更多的攻击,保持 false positive 的数量约为同样。
methods: 本 paper 提出了一种名为多样性从人类反馈(DivHF)的方法,通过询问人类反馈来学习行为描述符,并将其与任何距离度量结合以定义多样性度量。
results: 实验结果表明,使用 DivHF 方法可以学习更好地适应人类偏好的行为空间,并且可以通过人类反馈来提高多样性优化的效果。Abstract
Diversity plays a significant role in many problems, such as ensemble learning, reinforcement learning, and combinatorial optimization. How to define the diversity measure is a longstanding problem. Many methods rely on expert experience to define a proper behavior space and then obtain the diversity measure, which is, however, challenging in many scenarios. In this paper, we propose the problem of learning a behavior space from human feedback and present a general method called Diversity from Human Feedback (DivHF) to solve it. DivHF learns a behavior descriptor consistent with human preference by querying human feedback. The learned behavior descriptor can be combined with any distance measure to define a diversity measure. We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite. The results show that DivHF learns a behavior space that aligns better with human requirements compared to direct data-driven approaches and leads to more diverse solutions under human preference. Our contributions include formulating the problem, proposing the DivHF method, and demonstrating its effectiveness through experiments.
摘要
多样性在许多问题中扮演着重要的角色,如集成学习、强化学习和组合优化。定义多样性度量是一个长期的问题。许多方法依赖于专家经验来定义合适的行为空间,然后获取多样性度量,但这在许多场景下是困难的。在这篇论文中,我们提出了从人类反馈获得行为空间的问题,并提出了一种通用的方法called多样性从人类反馈(DivHF)来解决这个问题。DivHF通过询问人类反馈来学习一个与人类偏好相符的行为描述符。学习的行为描述符可以与任何距离度量结合以定义多样性度量。我们通过在QDax集合上集成DivHF和MAP-Elites算法进行实验,并证明了DivHF可以学习一个更好地与人类需求相符的行为空间,并且导致更多的多样性解决方案。我们的贡献包括提出了问题、提出了DivHF方法,并通过实验证明了其效果。
Topic-DPR: Topic-based Prompts for Dense Passage Retrieval
results: 在两个 datasets 上实验结果显示,我们的方法超过了之前的 state-of-the-art Retrieval 技术。Abstract
Prompt-based learning's efficacy across numerous natural language processing tasks has led to its integration into dense passage retrieval. Prior research has mainly focused on enhancing the semantic understanding of pre-trained language models by optimizing a single vector as a continuous prompt. This approach, however, leads to a semantic space collapse; identical semantic information seeps into all representations, causing their distributions to converge in a restricted region. This hinders differentiation between relevant and irrelevant passages during dense retrieval. To tackle this issue, we present Topic-DPR, a dense passage retrieval model that uses topic-based prompts. Unlike the single prompt method, multiple topic-based prompts are established over a probabilistic simplex and optimized simultaneously through contrastive learning. This encourages representations to align with their topic distributions, improving space uniformity. Furthermore, we introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency. Experimental results from two datasets affirm that our method surpasses previous state-of-the-art retrieval techniques.
摘要
results: SOTA results on DDBP2 problem (estimating number of tricks for two given hands)Abstract
Contract bridge is a game characterized by incomplete information, posing an exciting challenge for artificial intelligence methods. This paper proposes the BridgeHand2Vec approach, which leverages a neural network to embed a bridge player's hand (consisting of 13 cards) into a vector space. The resulting representation reflects the strength of the hand in the game and enables interpretable distances to be determined between different hands. This representation is derived by training a neural network to estimate the number of tricks that a pair of players can take. In the remainder of this paper, we analyze the properties of the resulting vector space and provide examples of its application in reinforcement learning, and opening bid classification. Although this was not our main goal, the neural network used for the vectorization achieves SOTA results on the DDBP2 problem (estimating the number of tricks for two given hands).
摘要
CONTRACT BRIDGE 是一款具有不完整信息的游戏,具有让人很激动的挑战性,这篇论文提出了 BridgeHand2Vec 方法,该方法利用神经网络将bridge玩家手中的13张牌转换成向量空间中的表示。这种表示能够反映手中的游戏力量,并允许确定不同手中的距离。这种表示是通过训练神经网络来估计两名玩家可以拿到的赢得局数来获得的。在本文中,我们分析了这种向量空间的性质,并提供了应用于强化学习和开场招许分类的示例。虽然这并不是我们的主要目标,但是使用于向量化的神经网络在 DDBP2 问题上达到了顶峰性能。
V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network
results: 应用我们的算法于大规模开放数据集V2Xset,得到了state-of-the-art的结果。V2X-AHD可以有效地提高3D物体检测的准确性,并降低网络参数的数量。这些结果可以 serves as a benchmark for cooperative perception。Abstract
Object detection is the central issue of intelligent traffic systems, and recent advancements in single-vehicle lidar-based 3D detection indicate that it can provide accurate position information for intelligent agents to make decisions and plan. Compared with single-vehicle perception, multi-view vehicle-road cooperation perception has fundamental advantages, such as the elimination of blind spots and a broader range of perception, and has become a research hotspot. However, the current perception of cooperation focuses on improving the complexity of fusion while ignoring the fundamental problems caused by the absence of single-view outlines. We propose a multi-view vehicle-road cooperation perception system, vehicle-to-everything cooperative perception (V2X-AHD), in order to enhance the identification capability, particularly for predicting the vehicle's shape. At first, we propose an asymmetric heterogeneous distillation network fed with different training data to improve the accuracy of contour recognition, with multi-view teacher features transferring to single-view student features. While the point cloud data are sparse, we propose Spara Pillar, a spare convolutional-based plug-in feature extraction backbone, to reduce the number of parameters and improve and enhance feature extraction capabilities. Moreover, we leverage the multi-head self-attention (MSA) to fuse the single-view feature, and the lightweight design makes the fusion feature a smooth expression. The results of applying our algorithm to the massive open dataset V2Xset demonstrate that our method achieves the state-of-the-art result. The V2X-AHD can effectively improve the accuracy of 3D object detection and reduce the number of network parameters, according to this study, which serves as a benchmark for cooperative perception. The code for this article is available at https://github.com/feeling0414-lab/V2X-AHD.
摘要
“对于智能交通系统中的物件探测,最近的进展表明单车 lidar 三维探测可以提供正确的位置信息,帮助智能代理人做出决策和规划。相比单车感知,多视角车道合作感知具有根本上的优势,如消除盲点和扩大视野,并成为研究热点。然而,现有的感知方法强调增强复杂的融合,忽略了单视角 outline 的基本问题。我们提出了一个多视角车道合作感知系统(V2X-AHD),以增强物件识别能力,特别是预测车辆形状。在这个系统中,我们提出了不对称多元精神网络,使用不同训练数据来提高楔形识别精度,并将多视角教师特征转移到单视角学生特征中。当Point cloud 资料为稀疏时,我们提出了Spara Pillar,一个减少参数的几何学基础Feature extraction 后置架构。此外,我们利用多头自注意(MSA)融合单视角特征,并将融合特征设计为轻量级。根据我们将这个算法应用到大量公开 dataset V2Xset 的结果,我们的方法可以实现 state-of-the-art 的成果。V2X-AHD 可以增强3D物件探测的精度和减少网络参数数量,根据这个研究,可以作为协同感知的 bench mark。相关的代码可以在 GitHub 上获取:https://github.com/feeling0414-lab/V2X-AHD。”
A Black-Box Physics-Informed Estimator based on Gaussian Process Regression for Robot Inverse Dynamics Identification
results: 对于 simulate 和实际两个 robotic manipulate 器(Franka Emika Panda 和 MELFA RV4FL)进行了实验,结果显示,提出的模型在精度、通用性和数据效率方面都高于当前黑盒估计器,包括 Gaussian Processes 和神经网络。此外,对 MELFA 机器人的实验还示出了我们的方法可以与高精度模型基于估计器相比,即使需要更少的先验信息。Abstract
In this paper, we propose a black-box model based on Gaussian process regression for the identification of the inverse dynamics of robotic manipulators. The proposed model relies on a novel multidimensional kernel, called \textit{Lagrangian Inspired Polynomial} (\kernelInitials{}) kernel. The \kernelInitials{} kernel is based on two main ideas. First, instead of directly modeling the inverse dynamics components, we model as GPs the kinetic and potential energy of the system. The GP prior on the inverse dynamics components is derived from those on the energies by applying the properties of GPs under linear operators. Second, as regards the energy prior definition, we prove a polynomial structure of the kinetic and potential energy, and we derive a polynomial kernel that encodes this property. As a consequence, the proposed model allows also to estimate the kinetic and potential energy without requiring any label on these quantities. Results on simulation and on two real robotic manipulators, namely a 7 DOF Franka Emika Panda and a 6 DOF MELFA RV4FL, show that the proposed model outperforms state-of-the-art black-box estimators based both on Gaussian Processes and Neural Networks in terms of accuracy, generality and data efficiency. The experiments on the MELFA robot also demonstrate that our approach achieves performance comparable to fine-tuned model-based estimators, despite requiring less prior information.
摘要
在这篇论文中,我们提出了基于 Gaussian Process regression 的黑盒模型,用于 робо机拟合器的反动动学特征标定。我们的模型基于一种新的多维度核函数,称为 \kernelInitials{} 核函数。\kernelInitials{} 核函数基于两个主要想法:首先,而不直接模型反动动学分量,我们模型了机械系统的动能和潜能为 Gaussian Processes。由于 GPs 具有线性运算下的性质,我们可以从这些 GPs 中得到反动动学分量的 priors。其次,我们证明机械系统的动能和潜能具有多项式结构,并 derivated一个多项式核函数来表示这种性质。这意味着我们的模型可以不仅仅 estimator 反动动学分量,还可以估计机械系统的动能和潜能,无需提供任何标注。在实验中,我们使用了两个真实的 робо机拟合器,即 7 DOF Franka Emika Panda 和 6 DOF MELFA RV4FL,并与已有的黑盒估计器(基于 Gaussian Processes 和 Neural Networks)进行比较。结果表明,我们的模型在准确性、通用性和数据效率方面表现更好,并且在 MELFA 机械上实验也表明了我们的方法可以与高精度模型基本估计器相比,即使需要更少的先验信息。
paper_authors: Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman
for: 这 paper 是为了研究 emergent communication 中的时间参照 vocabulary 的发展而写的。
methods: 这 paper 使用了一种新的 agent architecture,以检验 temporal referencing 是否可以自然地出现在 emergent communication 中。
results: 实验结果表明,这种新的 agent architecture 是可以自然地引入 temporal referencing 的,无需额外的损失。这些发现可以为其他 emergent communication 环境中的时间参照引入提供基础。Abstract
As humans, we use linguistic elements referencing time, such as before or tomorrow, to easily share past experiences and future predictions. While temporal aspects of the language have been considered in computational linguistics, no such exploration has been done within the field of emergent communication. We research this gap, providing the first reported temporal vocabulary within emergent communication literature. Our experimental analysis shows that a different agent architecture is sufficient for the natural emergence of temporal references, and that no additional losses are necessary. Our readily transferable architectural insights provide the basis for the incorporation of temporal referencing into other emergent communication environments.
摘要
人类使用语言元素 referencing 时间,如前或明天,轻松分享过去经验和未来预测。在计算机语言学中,时间方面的语言元素已经得到了考虑,但在emergent communication中,这一方面的探索尚未进行过。我们对此进行研究,提供了emergent communication中第一个时间参考词汇的报告。我们的实验分析表明,不需要额外的损失,不同的机器人体系即可自然地出现时间引用。我们的易于传输的建筑思想可以为其他emergent communication环境中的时间引用 incorporation提供基础。
Automated clinical coding using off-the-shelf large language models
methods: 这个方法使用了市场上已经预训练的大语言模型(LLM),通过信息抽取和 Hierarchical 搜索来自动生成ICD代码。在第二阶段,使用 GPT-4 进行元反差,选择一 subset of 相关的标签作为预测。
results: 这个方法在 CodiEsp 数据集上测试,与 PLM-ICD 相比,在更为罕见的类型上表现出了状态的表现,具有最高的 macro-F1 值 0.225,微-F1 值 0.157,而 PLM-ICD 的最高值为 0.216 和 0.219。这个方法不需要任何任务特定的学习,从而实现了自动 ICD 编码的目的。Abstract
The task of assigning diagnostic ICD codes to patient hospital admissions is typically performed by expert human coders. Efforts towards automated ICD coding are dominated by supervised deep learning models. However, difficulties in learning to predict the large number of rare codes remain a barrier to adoption in clinical practice. In this work, we leverage off-the-shelf pre-trained generative large language models (LLMs) to develop a practical solution that is suitable for zero-shot and few-shot code assignment. Unsupervised pre-training alone does not guarantee precise knowledge of the ICD ontology and specialist clinical coding task, therefore we frame the task as information extraction, providing a description of each coded concept and asking the model to retrieve related mentions. For efficiency, rather than iterating over all codes, we leverage the hierarchical nature of the ICD ontology to sparsely search for relevant codes. Then, in a second stage, which we term 'meta-refinement', we utilise GPT-4 to select a subset of the relevant labels as predictions. We validate our method using Llama-2, GPT-3.5 and GPT-4 on the CodiEsp dataset of ICD-coded clinical case documents. Our tree-search method achieves state-of-the-art performance on rarer classes, achieving the best macro-F1 of 0.225, whilst achieving slightly lower micro-F1 of 0.157, compared to 0.216 and 0.219 respectively from PLM-ICD. To the best of our knowledge, this is the first method for automated ICD coding requiring no task-specific learning.
摘要
通常情况下,诊断ICD代码的分配是由专业的人类编码器完成。但是,自动ICD编码的尝试受到罕见代码的困难而受阻。在这种情况下,我们利用可用的准备好的大语言模型(LLM)来开发一个实用的解决方案,适用于零shot和几shot代码分配。不同于其他研究,我们不使用监督学习模型,而是将任务定义为信息抽取,请求模型提取相关的提取。为了提高效率,我们利用ICD ontology的层次结构,将搜索范围限定为有关代码。然后,在第二阶段,我们使用GPT-4来选择相关的标签作为预测。我们使用Llama-2、GPT-3.5和GPT-4在CodiEsp数据集上验证我们的方法,并 achieved state-of-the-art表现在更罕见的类别中,即macro-F1为0.225,微-F1为0.157。这与PLM-ICD的0.216和0.219相比,表现较佳。我们知道,这是自动ICD编码的首次不需要任务特定学习的方法。
Rationale-Enhanced Language Models are Better Continual Relation Learners
results: 实验结果表明,我们的方法在两个标准benchmark上都超越了当前最佳CRE模型。Abstract
Continual relation extraction (CRE) aims to solve the problem of catastrophic forgetting when learning a sequence of newly emerging relations. Recent CRE studies have found that catastrophic forgetting arises from the model's lack of robustness against future analogous relations. To address the issue, we introduce rationale, i.e., the explanations of relation classification results generated by large language models (LLM), into CRE task. Specifically, we design the multi-task rationale tuning strategy to help the model learn current relations robustly. We also conduct contrastive rationale replay to further distinguish analogous relations. Experimental results on two standard benchmarks demonstrate that our method outperforms the state-of-the-art CRE models.
摘要
Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach
paper_authors: Gyu Seon Kim, JaeHyun Chung, Soohyun Park
for: 这篇论文是为了探讨量子强化学习在再用导弹控制系统中的应用。
methods: 这篇论文使用量子强化学习来更新控制系统,以适应导弹动态系统变化。
results: 研究人员发现,量子强化学习可以提供更高的计算效率、减少内存需求和更稳定的性能,这些特点使其成为再用导弹控制系统中的优秀解决方案。Abstract
The advent of reusable rockets has heralded a new era in space exploration, reducing the costs of launching satellites by a significant factor. Traditional rockets were disposable, but the design of reusable rockets for repeated use has revolutionized the financial dynamics of space missions. The most critical phase of reusable rockets is the landing stage, which involves managing the tremendous speed and attitude for safe recovery. The complexity of this task presents new challenges for control systems, specifically in terms of precision and adaptability. Classical control systems like the proportional-integral-derivative (PID) controller lack the flexibility to adapt to dynamic system changes, making them costly and time-consuming to redesign of controller. This paper explores the integration of quantum reinforcement learning into the control systems of reusable rockets as a promising alternative. Unlike classical reinforcement learning, quantum reinforcement learning uses quantum bits that can exist in superposition, allowing for more efficient information encoding and reducing the number of parameters required. This leads to increased computational efficiency, reduced memory requirements, and more stable and predictable performance. Due to the nature of reusable rockets, which must be light, heavy computers cannot fit into them. In the reusable rocket scenario, quantum reinforcement learning, which has reduced memory requirements due to fewer parameters, is a good solution.
摘要
发射卫星的成本由再用火箭的出现大幅降低,这种新的发射方式已经开启了宇宙探索的新时代。传统的火箭是一次性的,但是再用火箭的设计可以重复使用,这对宇宙探索的财务动力产生了革命性的变化。再用火箭的最关键的阶段是着陆阶段,需要控制高速和总体orientation以实现安全的回收。这种任务的复杂性带来了新的控制系统挑战,特别是精度和适应性方面。经典的控制系统如比例-Integral-Derivative(PID)控制器缺乏适应性,需要时间和成本重新设计控制器。这篇论文探讨了在控制系统中 интеGRATION quantum reinforcement learning作为一种可能的替代方案。与经典的反馈学习不同,量子反馈学习使用量子比特,可以在超position中存在,从而实现更高效的信息编码和减少参数数量。这导致计算效率提高,存储需求减少,性能更稳定和预测可靠。由于再用火箭需要轻量级,因此在再用火箭场景下,量子反馈学习,具有减少参数数量的优点,是一个好的解决方案。
A Novel Contrastive Learning Method for Clickbait Detection on RoCliCo: A Romanian Clickbait Corpus of News Articles
results: 研究人员通过手动标注8,313篇新闻样本,并使用四种机器学习方法进行实验,以建立一系列竞争力强的基线。此外,研究人员还提出了一种基于BERT的对比学习模型,可以在新闻标题和内容之间学习深度度量空间,以便识别不是吸引用户点击的新闻。Abstract
To increase revenue, news websites often resort to using deceptive news titles, luring users into clicking on the title and reading the full news. Clickbait detection is the task that aims to automatically detect this form of false advertisement and avoid wasting the precious time of online users. Despite the importance of the task, to the best of our knowledge, there is no publicly available clickbait corpus for the Romanian language. To this end, we introduce a novel Romanian Clickbait Corpus (RoCliCo) comprising 8,313 news samples which are manually annotated with clickbait and non-clickbait labels. Furthermore, we conduct experiments with four machine learning methods, ranging from handcrafted models to recurrent and transformer-based neural networks, to establish a line-up of competitive baselines. We also carry out experiments with a weighted voting ensemble. Among the considered baselines, we propose a novel BERT-based contrastive learning model that learns to encode news titles and contents into a deep metric space such that titles and contents of non-clickbait news have high cosine similarity, while titles and contents of clickbait news have low cosine similarity. Our data set and code to reproduce the baselines are publicly available for download at https://github.com/dariabroscoteanu/RoCliCo.
摘要
为了增加收入,新闻网站经常使用吸引人的标题,让用户点击标题并阅读完整的新闻。 Clickbait检测是一项任务,旨在自动检测这种false advertisement,以避免在线用户的宝贵时间浪费。然而,到目前为止,我们知道没有公开可用的罗马尼亚语Clickbait corpus。为此,我们介绍了一个新的罗马尼亚Clickbait corpus(RoCliCo),包含8,313个新闻样本,每个样本都被手动标注为Clickbait或非Clickbait。此外,我们进行了四种机器学习方法的实验,从手工模型到回归和转换器基于神经网络,以建立一系列竞争力强的基准。我们还进行了一个权重投票集成。 amongst the considered baselines, we propose a novel BERT-based contrastive learning model that learns to encode news titles and contents into a deep metric space such that titles and contents of non-clickbait news have high cosine similarity, while titles and contents of clickbait news have low cosine similarity。我们的数据集和可重现基准的代码公开下载于https://github.com/dariabroscoteanu/RoCliCo。
Accelerating Monte Carlo Tree Search with Probability Tree State Abstraction
results: 通过与现有的MCTS-based算法集成,如Sampled MuZero和Gumbel MuZero,实验结果表明,我们的PTSA算法可以在不同任务上减少搜索空间大小10%-45%,并且加速了现有算法的训练过程。Abstract
Monte Carlo Tree Search (MCTS) algorithms such as AlphaGo and MuZero have achieved superhuman performance in many challenging tasks. However, the computational complexity of MCTS-based algorithms is influenced by the size of the search space. To address this issue, we propose a novel probability tree state abstraction (PTSA) algorithm to improve the search efficiency of MCTS. A general tree state abstraction with path transitivity is defined. In addition, the probability tree state abstraction is proposed for fewer mistakes during the aggregation step. Furthermore, the theoretical guarantees of the transitivity and aggregation error bound are justified. To evaluate the effectiveness of the PTSA algorithm, we integrate it with state-of-the-art MCTS-based algorithms, such as Sampled MuZero and Gumbel MuZero. Experimental results on different tasks demonstrate that our method can accelerate the training process of state-of-the-art algorithms with 10%-45% search space reduction.
摘要
蒙特卡洛树搜索(MCTS)算法如AlphaGo和MuZero已经在许多具有挑战性的任务中表现出人类之上。然而,MCTS基本算法的计算复杂度受到搜索空间的大小影响。为解决这个问题,我们提出了一个新的概率树状抽象(PTSA)算法,以提高MCTS搜索效率。我们定义一个通用的树状抽象,并提出了路径潜在性的概率树状抽象,以降低统计误差。此外,我们提供了对于潜在性和统计误差的理论保证。为评估PTSA算法的效果,我们将其与现有的MCTS基本算法结合,例如Sampled MuZero和Gumbel MuZero。实验结果显示,我们的方法可以将搜索空间缩减10%-45%,并加速现有算法的训练过程。
RK-core: An Established Methodology for Exploring the Hierarchical Structure within Datasets
results: 本研究在多个 benchmark 数据集上进行了实验,发现了 samples 的低核心值与其所属类别的表现有负相关性,而高核心值 samples 则对性能的贡献更大。此外,本研究还发现了一个高质量的核心集应该具有层次多样性,而不是仅选择表现最佳的示例。Abstract
Recently, the field of machine learning has undergone a transition from model-centric to data-centric. The advancements in diverse learning tasks have been propelled by the accumulation of more extensive datasets, subsequently facilitating the training of larger models on these datasets. However, these datasets remain relatively under-explored. To this end, we introduce a pioneering approach known as RK-core, to empower gaining a deeper understanding of the intricate hierarchical structure within datasets. Across several benchmark datasets, we find that samples with low coreness values appear less representative of their respective categories, and conversely, those with high coreness values exhibit greater representativeness. Correspondingly, samples with high coreness values make a more substantial contribution to the performance in comparison to those with low coreness values. Building upon this, we further employ RK-core to analyze the hierarchical structure of samples with different coreset selection methods. Remarkably, we find that a high-quality coreset should exhibit hierarchical diversity instead of solely opting for representative samples. The code is available at https://github.com/yaolu-zjut/Kcore.
摘要
最近,机器学习领域受到了数据中心化的影响,各种学习任务的进步受到了更加广泛和深入的数据驱动。然而,这些数据仍然尚未得到充分探索。为此,我们提出了一种创新的方法——RK-core,以便更深入地理解数据集中的复杂层次结构。在多个标准数据集上测试,我们发现低核心值的样本对其相应的类别表示着较低的表达力,而高核心值的样本则表现出较高的表达力。此外,高核心值的样本在性能中占据了更大的比重。基于这一点,我们进一步采用RK-core分析不同核心选择方法下的层次结构。我们发现,高质量的核心集应该具有层次多样性,而不是仅仅选择表达力最高的样本。相关代码可以在https://github.com/yaolu-zjut/Kcore上找到。
Evaluation of ChatGPT Feedback on ELL Writers’ Coherence and Cohesion
paper_authors: Su-Youn Yoon, Eva Miszoglad, Lisa R. Pierce
For: The paper evaluates the effectiveness of ChatGPT in providing feedback on the coherence and cohesion of essays written by English Language Learners (ELLs) students.* Methods: The paper uses a two-step approach to evaluate the feedback generated by ChatGPT, including classifying each sentence into subtypes based on its function and evaluating its accuracy and usability.* Results: The paper finds that most feedback sentences generated by ChatGPT are highly abstract and generic, failing to provide concrete suggestions for improvement. The accuracy of the feedback depends on superficial linguistic features and is often incorrect, indicating that ChatGPT, without specific training for the feedback generation task, does not offer effective feedback on ELL students’ coherence and cohesion.Here are the three key information points in Simplified Chinese text:* For: 这个研究用ChatGPT来评估英语学习者(ELLs)学生写的论文的 coherence 和 cohesion 的 feedback 的有效性。* Methods: 这个研究使用了一种两步方法来评估 ChatGPT 生成的 feedback,包括将每句话分类为不同的类型根据其功能(例如,正面鼓励、问题陈述),然后评估它们的准确性和可用性。* Results: 研究发现,ChatGPT 生成的 feedback 多数是高度抽象的和通用的,无法提供具体的改进建议。准确地检测主要问题(如重复的想法和不准确使用连接device)的准确性取决于表面语言特征,并且经常错误。因此,ChatGPT 无法提供有效的 feedback 于 ELLs 学生的 coherence 和 cohesion。Abstract
Since its launch in November 2022, ChatGPT has had a transformative effect on education where students are using it to help with homework assignments and teachers are actively employing it in their teaching practices. This includes using ChatGPT as a tool for writing teachers to grade and generate feedback on students' essays. In this study, we evaluated the quality of the feedback generated by ChatGPT regarding the coherence and cohesion of the essays written by English Language Learners (ELLs) students. We selected 50 argumentative essays and generated feedback on coherence and cohesion using the ELLIPSE rubric. During the feedback evaluation, we used a two-step approach: first, each sentence in the feedback was classified into subtypes based on its function (e.g., positive reinforcement, problem statement). Next, we evaluated its accuracy and usability according to these types. Both the analysis of feedback types and the evaluation of accuracy and usability revealed that most feedback sentences were highly abstract and generic, failing to provide concrete suggestions for improvement. The accuracy in detecting major problems, such as repetitive ideas and the inaccurate use of cohesive devices, depended on superficial linguistic features and was often incorrect. In conclusion, ChatGPT, without specific training for the feedback generation task, does not offer effective feedback on ELL students' coherence and cohesion.
摘要
(Simplified Chinese translation)自其在11月2022年发布以来,ChatGPT已经对教育产生了transformative的影响,学生们使用它来帮助完成家庭作业,教师也活动地使用它在教学实践中。这包括使用ChatGPT来评估学生的作业,并提供反馈。在这项研究中,我们评估了ChatGPT对英语学习者(ELLs)学生的论文 coherence 和 cohesion 的反馈质量。我们选择50篇 Argumentative essay,并使用 ELLIPSE 分类法生成反馈。在反馈评估中,我们采用了两步方法:首先,每句反馈被分类为不同的函数类型(例如,正面鼓励、问题陈述)。然后,我们评估了它们的准确性和可用性。结果表明,大多数反馈句子具有高度抽象和通用的特点,无法提供具体的改进建议。检测重要问题的准确性,如重复的想法和不当使用 cohesive devices,通常基于表面语言特征,并且错误。结论,ChatGPT,无需特定的培训,不能提供有效的反馈对 ELL 学生的 coherence 和 cohesion。
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task
results: 实验结果表明,目前的开源LLMs在实际噪声数据上的杂乱Robustness性表现很有限。基于这些实验观察结果,研究者提出了一些前瞻的建议,以促进这方面的研究。Abstract
With the increasing capabilities of large language models (LLMs), these high-performance models have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. However, the models' performance on commonly-used benchmark datasets often fails to accurately reflect their reliability and robustness when applied to real-world noisy data. To address these challenges, we propose a unified robustness evaluation framework based on the slot-filling task to systematically evaluate the dialogue understanding capability of LLMs in diverse input perturbation scenarios. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Furthermore, we utilize a multi-level data augmentation method (character, word, and sentence levels) to construct a candidate data pool, and carefully design two ways of automatic task demonstration construction strategies (instance-level and entity-level) with various prompt templates. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios. The experiments have demonstrated that the current open-source LLMs generally achieve limited perturbation robustness performance. Based on these experimental observations, we make some forward-looking suggestions to fuel the research in this direction.
摘要
随着大型语言模型(LLMs)的能力不断提高,这些高性能模型在各种自然语言处理(NLP)任务上取得了状态之最的成绩。然而,这些模型在通常使用的基准数据集上的性能经常不能准确反映它们在实际噪音数据上的可靠性和可靠性。为解决这些挑战,我们提议一种统一的可靠性评估框架,基于槽填充任务来系统地评估 LLMS 在多种噪音数据下的对话理解能力。具体来说,我们构建了一个噪音评估数据集,即 Noise-LLM,该数据集包括5种单个噪音数据和4种混合噪音数据。此外,我们采用了多级数据增强方法(字符、词和句子级别),将候选数据池构建起来,并且仔细设计了两种自动任务示例构建策略(实例级和实体级),并使用了多种提示模板。我们的目标是评估不同robustness方法在实际噪音场景下的表现。实验结果表明,当前的开源 LLMS 通常在实际噪音场景下表现有限的鲁棒性能。根据这些实验观察结果,我们提出了一些前瞻的建议,以促进这一方向的研究。
MetaAgents: Simulating Interactions of Human Behaviors for LLM-based Task-oriented Coordination via Collaborative Generative Agents
results: 作者的评估表明,这些协同生成代理人在一个 simulated job fair 环境中表现出了有前途的表现,但是也暴露出了在更复杂的协调任务中的局限性。Abstract
Significant advancements have occurred in the application of Large Language Models (LLMs) for various tasks and social simulations. Despite this, their capacities to coordinate within task-oriented social contexts are under-explored. Such capabilities are crucial if LLMs are to effectively mimic human-like social behavior and produce meaningful results. To bridge this gap, we introduce collaborative generative agents, endowing LLM-based Agents with consistent behavior patterns and task-solving abilities. We situate these agents in a simulated job fair environment as a case study to scrutinize their coordination skills. We propose a novel framework that equips collaborative generative agents with human-like reasoning abilities and specialized skills. Our evaluation demonstrates that these agents show promising performance. However, we also uncover limitations that hinder their effectiveness in more complex coordination tasks. Our work provides valuable insights into the role and evolution of LLMs in task-oriented social simulations.
摘要
<>大量的进步已经发生在大语言模型(LLM)的应用中,包括不同的任务和社会 simulations。然而,LLM在任务团队社会上的协调能力仍然未得到充分探索。这些能力是LLM模仿人类社会行为的关键,以生成有意义的结果。为了bridging这个差距,我们引入合作生成代理人,赋予LLM基于代理人的一致行为模式和任务解决能力。我们在模拟的就业 fair环境中作为一个案例,检验这些代理人的协调能力。我们提出了一种新的框架,让合作生成代理人具有人类化的思维能力和专业技能。我们的评估表明,这些代理人在完成任务时表现了promising的表现。然而,我们还发现了一些限制,这些限制阻碍了它们在更复杂的协调任务中的效果。我们的工作为LLM在任务团队社会 simulations中的角色和演化提供了重要的看法。
Topological RANSAC for instance verification and retrieval without fine-tuning
results: 与传统的 SP 方法相比,我们的方法在非精度调整情况下显著提高检索性能,并且可以增强使用精度调整的特征表现。此外,我们的方法具有高可解释性和轻量级的特点,适用于各种实际应用场景。Abstract
This paper presents an innovative approach to enhancing explainable image retrieval, particularly in situations where a fine-tuning set is unavailable. The widely-used SPatial verification (SP) method, despite its efficacy, relies on a spatial model and the hypothesis-testing strategy for instance recognition, leading to inherent limitations, including the assumption of planar structures and neglect of topological relations among features. To address these shortcomings, we introduce a pioneering technique that replaces the spatial model with a topological one within the RANSAC process. We propose bio-inspired saccade and fovea functions to verify the topological consistency among features, effectively circumventing the issues associated with SP's spatial model. Our experimental results demonstrate that our method significantly outperforms SP, achieving state-of-the-art performance in non-fine-tuning retrieval. Furthermore, our approach can enhance performance when used in conjunction with fine-tuned features. Importantly, our method retains high explainability and is lightweight, offering a practical and adaptable solution for a variety of real-world applications.
摘要
Memory efficient location recommendation through proximity-aware representation
results: 使用三个实际的 Location-Based Social Networking(LBSN)数据集进行评估,显示PASR在续ous sequential location recommendation方法中占据了领先地位。Abstract
Sequential location recommendation plays a huge role in modern life, which can enhance user experience, bring more profit to businesses and assist in government administration. Although methods for location recommendation have evolved significantly thanks to the development of recommendation systems, there is still limited utilization of geographic information, along with the ongoing challenge of addressing data sparsity. In response, we introduce a Proximity-aware based region representation for Sequential Recommendation (PASR for short), built upon the Self-Attention Network architecture. We tackle the sparsity issue through a novel loss function employing importance sampling, which emphasizes informative negative samples during optimization. Moreover, PASR enhances the integration of geographic information by employing a self-attention-based geography encoder to the hierarchical grid and proximity grid at each GPS point. To further leverage geographic information, we utilize the proximity-aware negative samplers to enhance the quality of negative samples. We conducted evaluations using three real-world Location-Based Social Networking (LBSN) datasets, demonstrating that PASR surpasses state-of-the-art sequential location recommendation methods
摘要
现代生活中的顺序位置推荐具有巨大的作用,可以提高用户体验、带来更多的商业利益以及政府管理的帮助。虽然位置推荐的方法已经发展到了很高的水平,但是还是受到地理信息的有限使用和缺乏数据的挑战。为了解决这个问题,我们介绍了一种基于靠近性的区域表示方法(PASR简称),基于自注意网络架构。我们通过一种新的损失函数和重要样本选择来解决缺乏数据的问题,并且使用自注意网络来增强地理信息的集成。此外,我们还利用靠近性aware的负样本来提高负样本的质量。我们对三个实际的位置基于社交媒体网络(LBSN)数据集进行了评估,结果表明,PASR超越了现状最佳的顺序位置推荐方法。
Understanding the Effects of RLHF on LLM Generalisation and Diversity
results: 研究发现,RLHF比SFT在新输入处理更好的泛化能力,尤其是当输入和输出之间的分布差异较大时。然而,RLHF会对输出多样性产生负面影响,特别是在多种指标上。这些结果可以帮助选择合适的微调方法,并促进RLHF方法的进一步改进。Abstract
Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT, Anthropic's Claude, or Meta's LLaMA-2. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the trade-off between generalisation and diversity.
摘要
大型语言模型(LLM)通过人工测验学习(RLHF)的精练化已经在一些最广泛应用的AI模型中使用,如OpenAI的ChatGPT、Anthropic的Claude或Meta的LLaMA-2。 although there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modeling, and RLHF) affects two key properties: out-of-distribution (OOD) generalization and output diversity. OOD generalization is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarization and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalizes better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalization and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the trade-off between generalization and diversity.
Constructive Large Language Models Alignment with Diverse Feedback
paper_authors: Tianshu Yu, Ting-En Lin, Yuchuan Wu, Min Yang, Fei Huang, Yongbin Li
for: 强调大语言模型(LLM)与人类价值观 aligning,以减少危险内容的影响。
methods: 我们介绍了一种新的 Constructive and Diverse Feedback(CDF)方法, inspirited by constructivist learning theory,收集了三种不同类型的反馈,包括批评反馈、纠正反馈和喜好反馈,以便在训练数据集中解决不同难度级别的问题。
results: 我们通过对三个下游任务(问答、对话生成和文本概要)进行评估,发现 CDF 方法可以在较小的训练数据集上 достичь更高的对齐性表现,比之前的方法更高。Abstract
In recent research on large language models (LLMs), there has been a growing emphasis on aligning these models with human values to reduce the impact of harmful content. However, current alignment methods often rely solely on singular forms of human feedback, such as preferences, annotated labels, or natural language critiques, overlooking the potential advantages of combining these feedback types. This limitation leads to suboptimal performance, even when ample training data is available. In this paper, we introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance LLM alignment, inspired by constructivist learning theory. Our approach involves collecting three distinct types of feedback tailored to problems of varying difficulty levels within the training dataset. Specifically, we exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data. To assess the effectiveness of CDF, we evaluate it against previous methods in three downstream tasks: question answering, dialog generation, and text summarization. Experimental results demonstrate that CDF achieves superior performance even with a smaller training dataset.
摘要
Recent research on large language models (LLMs) 有增加对人类价值观Alignment的强调,以降低有害内容的影响。然而,现有的Alignment方法通常仅仅基于单一的人类反馈方式,如偏好、注释标签或自然语言批评,而忽视了可 combining这些反馈类型的可能性。这种局限性导致模型性能不佳,即使有充足的训练数据available。在这篇论文中,我们提出了一种新的Feedback方法,名为Constructive and Diverse Feedback(CDF), draws inspiration from constructivist learning theory。我们的方法是收集三种不同类型的反馈,适用于训练数据中的问题Difficulty Level不同的情况。特别是,我们利用了批评Feedback来解决容易的问题,修充Feedback来解决中等Difficulty Level的问题,以及偏好Feedback来解决困难的问题。通过使用这种多样化的反馈,我们可以增强模型的Alignment性能,并使用较少的训练数据。为评估CDF的效果,我们对之前的方法进行了三个下游任务的评估:问题回答、对话生成和文本概要。实验结果表明,CDF可以在较少的训练数据下达到更高的性能。
Stepwise functional refoundation of relational concept analysis
results: 返回一个家族的概念树,而不考虑数据中循环依赖关系时,可能存在其他可接受的解决方案Abstract
Relational concept analysis (RCA) is an extension of formal concept analysis allowing to deal with several related contexts simultaneously. It has been designed for learning description logic theories from data and used within various applications. A puzzling observation about RCA is that it returns a single family of concept lattices although, when the data feature circular dependencies, other solutions may be considered acceptable. The semantics of RCA, provided in an operational way, does not shed light on this issue. In this report, we define these acceptable solutions as those families of concept lattices which belong to the space determined by the initial contexts (well-formed), cannot scale new attributes (saturated), and refer only to concepts of the family (self-supported). We adopt a functional view on the RCA process by defining the space of well-formed solutions and two functions on that space: one expansive and the other contractive. We show that the acceptable solutions are the common fixed points of both functions. This is achieved step-by-step by starting from a minimal version of RCA that considers only one single context defined on a space of contexts and a space of lattices. These spaces are then joined into a single space of context-lattice pairs, which is further extended to a space of indexed families of context-lattice pairs representing the objects manip
摘要
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
paper_authors: Srijith Radhakrishnan, Chao-Han Huck Yang, Sumeer Ahmad Khan, Rohit Kumar, Narsis A. Kiani, David Gomez-Cabrero, Jesper N. Tegner
for: 这个论文是为了提出一种新的跨模态融合技术,用于自动语音识别(ASR)中的生成错误修正。
methods: 该方法利用了语音信息和外部语言表示来生成准确的语音转文本上下文。这标志着一种新的 парадигshift towards generative error correction within the realm of n-best hypotheses。
results: 对于多种ASR dataset的评估,我们证明了我们的融合技术的稳定性和可重现性,并达到了相对于n-best假设的37.66%的单词错误率改善。Here’s the full answer in Simplified Chinese:
for: 这个论文是为了提出一种新的跨模态融合技术,用于自动语音识别(ASR)中的生成错误修正。
methods: 该方法利用了语音信息和外部语言表示来生成准确的语音转文本上下文。这标志着一种新的 парадигshift towards generative error correction within the realm of n-best hypotheses。
results: 对于多种ASR dataset的评估,我们证明了我们的融合技术的稳定性和可重现性,并达到了相对于n-best假设的37.66%的单词错误率改善。Abstract
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts. This marks a step towards a fresh paradigm in generative error correction within the realm of n-best hypotheses. Unlike the existing ranking-based rescoring methods, our approach adeptly uses distinct initialization techniques and parameter-efficient algorithms to boost ASR performance derived from pre-trained speech and text models. Through evaluation across diverse ASR datasets, we evaluate the stability and reproducibility of our fusion technique, demonstrating its improved word error rate relative (WERR) performance in comparison to n-best hypotheses by relatively 37.66%. To encourage future research, we have made our code and pre-trained models open source at https://github.com/Srijith-rkr/Whispering-LLaMA.
摘要
我们介绍了一种新的跨Modal融合技术,用于自动语音识别(ASR)的生成错误修正。我们的方法ология利用了语音信息和外部语言表示来生成准确的语音转文本上下文。这标志着一个新的 парадигshift towards a fresh paradigm in generative error correction within the realm of n-best hypotheses。不同于现有的排名基于重新分配方法,我们的方法使用了不同的初始化技术和参数高效的算法来提高ASR性能,基于预训练的语音和文本模型。通过对多个ASR数据集的评估,我们评估了我们的融合技术的稳定性和可重现性,并证明了它的单词错误率相对改进(WERR)性能,相比于n-best假设的37.66%。为了鼓励未来的研究,我们将我们的代码和预训练模型公开在GitHub上,请参考https://github.com/Srijith-rkr/Whispering-LLaMA。
Retromorphic Testing: A New Approach to the Test Oracle Problem
paper_authors: Boxi Yu, Qiuyang Mang, Qingshuo Guo, Pinjia He
for: This paper focuses on developing a novel black-box testing methodology called Retromorphic Testing, which is inspired by the mathematical concept of inverse functions. The purpose is to provide a non-intrusive and effective approach to testing software systems.
methods: The proposed method uses an auxiliary program in conjunction with the program under test, creating a dual-program structure. The input data is processed by the forward program, and then the output is reversed to its original input format using the backward program. The testing modes include using the auxiliary program as either the forward or backward program.
results: The paper presents three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications. The method is demonstrated to be effective in revealing defects and bugs in the software systems under test.Abstract
A test oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept of inverse function, we present Retromorphic Testing, a novel black-box testing methodology. It leverages an auxiliary program in conjunction with the program under test, which establishes a dual-program structure consisting of a forward program and a backward program. The input data is first processed by the forward program and then its program output is reversed to its original input format using the backward program. In particular, the auxiliary program can operate as either the forward or backward program, leading to different testing modes. The process concludes by examining the relationship between the initial input and the transformed output within the input domain. For example, to test the implementation of the sine function $\sin(x)$, we can employ its inverse function, $\arcsin(x)$, and validate the equation $x = \sin(\arcsin(x)+2k\pi), \forall k \in \mathbb{Z}$. In addition to the high-level concept of Retromorphic Testing, this paper presents its three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications.
摘要
一个测试 oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept of inverse function, we present Retromorphic Testing, a novel black-box testing methodology. It leverages an auxiliary program in conjunction with the program under test, which establishes a dual-program structure consisting of a forward program and a backward program. The input data is first processed by the forward program and then its program output is reversed to its original input format using the backward program. In particular, the auxiliary program can operate as either the forward or backward program, leading to different testing modes. The process concludes by examining the relationship between the initial input and the transformed output within the input domain. For example, to test the implementation of the sine function $\sin(x)$, we can employ its inverse function, $\arcsin(x)$, and validate the equation $x = \sin(\arcsin(x)+2k\pi), \forall k \in \mathbb{Z}$. In addition to the high-level concept of Retromorphic Testing, this paper presents its three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications.
Proceedings of The first international workshop on eXplainable AI for the Arts (XAIxArts)
results: 论文在XAI在艺术领域的应用中发现了一些有价值的结果。Abstract
This first international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 15th ACM Conference on Creativity and Cognition (C&C 2023).
摘要
这是第一届国际工作坊 on 可解释 AI for the Arts (XAIxArts), 它将聚集一群研究人员来探讨 XAI 在艺术领域的角色。 工作坊于 ACM 创造力和认知会议 (C&C 2023) 举行。
TANGO: Time-Reversal Latent GraphODE for Multi-Agent Dynamical Systems
results: 我们的方法在各种物理系统上实验表现出色,特别是在困难的混沌三杆系统上实现了11.5%的MSE提高。Abstract
Learning complex multi-agent system dynamics from data is crucial across many domains, such as in physical simulations and material modeling. Extended from purely data-driven approaches, existing physics-informed approaches such as Hamiltonian Neural Network strictly follow energy conservation law to introduce inductive bias, making their learning more sample efficiently. However, many real-world systems do not strictly conserve energy, such as spring systems with frictions. Recognizing this, we turn our attention to a broader physical principle: Time-Reversal Symmetry, which depicts that the dynamics of a system shall remain invariant when traversed back over time. It still helps to preserve energies for conservative systems and in the meanwhile, serves as a strong inductive bias for non-conservative, reversible systems. To inject such inductive bias, in this paper, we propose a simple-yet-effective self-supervised regularization term as a soft constraint that aligns the forward and backward trajectories predicted by a continuous graph neural network-based ordinary differential equation (GraphODE). It effectively imposes time-reversal symmetry to enable more accurate model predictions across a wider range of dynamical systems under classical mechanics. In addition, we further provide theoretical analysis to show that our regularization essentially minimizes higher-order Taylor expansion terms during the ODE integration steps, which enables our model to be more noise-tolerant and even applicable to irreversible systems. Experimental results on a variety of physical systems demonstrate the effectiveness of our proposed method. Particularly, it achieves an MSE improvement of 11.5 % on a challenging chaotic triple-pendulum systems.
摘要
学习复杂多代理系统动态从数据是透支多个领域的关键,如物理 simulate 和材料模型。从某种意义上说,现有的物理 Informed Approach 如 Hamiltonian Neural Network 会 strictly follow 能量保守法则来引入 inductive bias,使其学习更加有效。然而,许多实际系统不会严格地保守能量,例如带有摩擦的spring系统。认识到这一点,我们转向了更广泛的物理原理:时间反转 симметry,即系统的动态将在时间反转后保持不变。它可以保持能量的方面,而且在非保守系统中也能提供强的 inductive bias。为了把这种 inductive bias 引入,在这篇论文中,我们提议一种简单 yet 有效的自顾supervised regularization term,用于规范一个基于 continues graph neural network 的 ordinary differential equation (GraphODE) 中的前进和后退轨迹预测。它有效地强制实现时间反转对称,使得模型预测更加准确,并且可以涵盖更广泛的物理系统。此外,我们还提供了理论分析,证明我们的 regularization 实际上在 ODE интеграル步骤中减少高阶泰勒展开项,使我们的模型更加快速敏感和灵活应用于不可逆系统。实验结果表明,我们提议的方法在多种物理系统上都有效。特别是,它在一个复杂的混沌三个拖钩系统上达到了11.5%的MSE提升。
results: GPT-4达到了与当前状态艺术的相当水平,并且对宣传技巧的检测具有较高的精度和准确性。Abstract
The prevalence of propaganda in our digital society poses a challenge to societal harmony and the dissemination of truth. Detecting propaganda through NLP in text is challenging due to subtle manipulation techniques and contextual dependencies. To address this issue, we investigate the effectiveness of modern Large Language Models (LLMs) such as GPT-3 and GPT-4 for propaganda detection. We conduct experiments using the SemEval-2020 task 11 dataset, which features news articles labeled with 14 propaganda techniques as a multi-label classification problem. Five variations of GPT-3 and GPT-4 are employed, incorporating various prompt engineering and fine-tuning strategies across the different models. We evaluate the models' performance by assessing metrics such as $F1$ score, $Precision$, and $Recall$, comparing the results with the current state-of-the-art approach using RoBERTa. Our findings demonstrate that GPT-4 achieves comparable results to the current state-of-the-art. Further, this study analyzes the potential and challenges of LLMs in complex tasks like propaganda detection.
摘要
现代社会中的宣传活动带来了社会和真实信息的困难。检测宣传的自然语言处理(NLP)在文本中是一个挑战,因为宣传者可以通过某些细微的操纵技巧和上下文依赖来隐秘宣传。为解决这个问题,我们调查了现代大型语言模型(LLM)such as GPT-3和GPT-4的宣传检测效果。我们在SemEval-2020任务11数据集上进行实验,这是一个新闻文章标注有14种宣传技巧的多标签分类问题。我们使用5种不同的GPT-3和GPT-4模型,包括不同的提问工程和精度调整策略。我们评估模型的表现,包括$F1$分数、$Precision$和$Recall$指标,并与使用RoBERTa的当前状态态-of-the-art方法进行比较。我们的发现表明GPT-4在当前状态态-of-the-art方法中获得了相似的结果。此外,这种研究还分析了LLMs在复杂任务中的潜在和挑战。
Advective Diffusion Transformers for Topological Generalization in Graph Learning
results: 本研究的实验结果表明,使用非本地扩散方法和ADiT模型可以在多种图学任务上实现superior表现,并且在图结构下降情况下保持良好的泛化能力。Abstract
Graph diffusion equations are intimately related to graph neural networks (GNNs) and have recently attracted attention as a principled framework for analyzing GNN dynamics, formalizing their expressive power, and justifying architectural choices. One key open questions in graph learning is the generalization capabilities of GNNs. A major limitation of current approaches hinges on the assumption that the graph topologies in the training and test sets come from the same distribution. In this paper, we make steps towards understanding the generalization of GNNs by exploring how graph diffusion equations extrapolate and generalize in the presence of varying graph topologies. We first show deficiencies in the generalization capability of existing models built upon local diffusion on graphs, stemming from the exponential sensitivity to topology variation. Our subsequent analysis reveals the promise of non-local diffusion, which advocates for feature propagation over fully-connected latent graphs, under the assumption of a specific data-generating condition. In addition to these findings, we propose a novel graph encoder backbone, Advective Diffusion Transformer (ADiT), inspired by advective graph diffusion equations that have a closed-form solution backed up with theoretical guarantees of desired generalization under topological distribution shifts. The new model, functioning as a versatile graph Transformer, demonstrates superior performance across a wide range of graph learning tasks.
摘要
GRAPH diffusion equations 是 GNN 的关联方法,最近受到关注,作为 GNN 的分析框架、表达力 formalization 和建筑设计的原则。一个关键的开问在 GRAPH 学习中是 GNN 的通用能力。现有的方法假设 training 和 test 集中的 GRAPH 结构来自同一个分布,这是一个主要的限制。在这篇论文中,我们向 GRAPH diffusion equations 的推广和通用性进行了研究。我们首先表明了现有的 LOCAL diffusion 模型在 GRAPH 结构变化方面存在欠佳的泛化能力,这是由 GRAPH 结构变化导致的极敏感性引起的。我们的后续分析表明了非 LOCAL diffusion 的潜在优势,它强调在具有完全相关的 latent graph 上进行特征传播,对于特定的数据生成条件,它具有理论保证的泛化性。此外,我们提出了一种新的 GRAPH 编码器基础,即 Advective Diffusion Transformer (ADiT),这种基础是基于 advective GRAPH diffusion equations 的关键解。新的模型,作为一种多样的 GRAPH Transformer,在各种 GRAPH 学习任务中显示出了优秀的表现。
Hexa: Self-Improving for Knowledge-Grounded Dialogue System
results: 我们通过对多个 benchmark 数据集进行实验,证明了我们的方法可以成功地使用自我改进机制来生成中间和最终回答,并提高了知识基〉的对话生成能力。Abstract
A common practice in knowledge-grounded dialogue generation is to explicitly utilize intermediate steps (e.g., web-search, memory retrieval) with modular approaches. However, data for such steps are often inaccessible compared to those of dialogue responses as they are unobservable in an ordinary dialogue. To fill in the absence of these data, we develop a self-improving method to improve the generative performances of intermediate steps without the ground truth data. In particular, we propose a novel bootstrapping scheme with a guided prompt and a modified loss function to enhance the diversity of appropriate self-generated responses. Through experiments on various benchmark datasets, we empirically demonstrate that our method successfully leverages a self-improving mechanism in generating intermediate and final responses and improves the performances on the task of knowledge-grounded dialogue generation.
摘要
通常在知识基础对话生成中,会显式使用中间步骤(例如网络搜索、记忆检索)和模块化方法。然而,这些中间步骤的数据通常不可见,比对话响应的数据更难以获取。为了填充这些数据的缺失,我们提出了一种自我改进方法,以提高中间步骤的生成性能。具体来说,我们提出了一种新的启动方案,以及一种修改的损失函数,以提高自动生成的应ropriate响应的多样性。通过对各种标准数据集进行实验,我们证明了我们的方法可以成功地利用自我改进机制来生成中间和最终响应,并提高知识基础对话生成任务的性能。
For: The paper aims to improve the practicality of molecular property prediction benchmarks for drug discovery by creating a new benchmark called Lo-Hi, which includes two tasks: Lead Optimization and Hit Identification.* Methods: The paper uses a novel molecular splitting algorithm to solve the Balanced Vertex Minimum $k$-Cut problem for the Hi task, and tests state-of-the-art and classic machine learning models under practical settings.* Results: The paper shows that modern benchmarks are unrealistic and overoptimistic, and that the Lo-Hi benchmark is more practical and accurate for drug discovery applications.Here’s the simplified Chinese version of the three key points:* For: 这个论文目标是改进药物发现中的分子性能预测标准 benchmark,通过创建一个名为 Lo-Hi 的新 benchmark,包括两个任务:Lead Optimization 和 Hit Identification。* Methods: 论文使用一种新的分子拆分算法解决 Hi 任务中的 Balanced Vertex Minimum $k$-Cut 问题,并测试了当前最佳和经典机器学习模型在实际设置下的表现。* Results: 论文显示现有的标准 benchmark 是不实用的和过optimistic,而 Lo-Hi benchmark 更加实用和准确地反映药物发现应用中的分子性能预测问题。Abstract
Finding new drugs is getting harder and harder. One of the hopes of drug discovery is to use machine learning models to predict molecular properties. That is why models for molecular property prediction are being developed and tested on benchmarks such as MoleculeNet. However, existing benchmarks are unrealistic and are too different from applying the models in practice. We have created a new practical \emph{Lo-Hi} benchmark consisting of two tasks: Lead Optimization (Lo) and Hit Identification (Hi), corresponding to the real drug discovery process. For the Hi task, we designed a novel molecular splitting algorithm that solves the Balanced Vertex Minimum $k$-Cut problem. We tested state-of-the-art and classic ML models, revealing which works better under practical settings. We analyzed modern benchmarks and showed that they are unrealistic and overoptimistic. Review: https://openreview.net/forum?id=H2Yb28qGLV Lo-Hi benchmark: https://github.com/SteshinSS/lohi_neurips2023 Lo-Hi splitter library: https://github.com/SteshinSS/lohi_splitter
摘要
现在找新药物是越来越Difficult。一种希望的药物发现是使用机器学习模型预测分子性质。因此,模型 для分子性质预测在MoleculeNet等准 benchmark上进行了开发和测试。然而,现有的benchmark是不realistic的,与实际应用场景有很大差异。我们创建了一个新的实用 Lo-Hi benchmark,包括两个任务:Lead Optimization(Lo)和 Hit Identification(Hi),对应实际药物发现过程。对于Hi任务,我们设计了一种新的分子拆分算法,解决了Balanced Vertex Minimum $k$-Cut问题。我们测试了当今最佳和经典的机器学习模型,发现哪些在实际设置下表现更好。我们分析了现代benchmark,发现它们是不realistic和过optimistic。参考:https://openreview.net/forum?id=H2Yb28qGLVLo-Hi benchmark:https://github.com/SteshinSS/lohi_neurips2023Lo-Hi splitter library:https://github.com/SteshinSS/lohi_splitter
P5: Plug-and-Play Persona Prompting for Personalized Response Selection
paper_authors: Joosung Lee, Minsik Oh, Donghun Lee
for: This paper aims to address the challenges of using persona-grounded retrieval-based chatbots for personalized conversations, specifically the high cost of collecting persona-grounded corpora and the chatbot’s lack of consideration for persona in real-world applications.
methods: The proposed solution is a plug-and-play persona prompting method that allows the chatbot system to function as a standard open-domain chatbot when persona information is not available. The method uses a zero-shot setting to reduce the dependence on persona-grounded training data, and the model can be fine-tuned for even better performance.
results: The authors demonstrate that their approach improves the standard model by 7.71 and 1.04 points in the original persona and revised persona, respectively, and fine-tuning the model further improves the previous state-of-the-art system by 1.95 and 3.39 points in the original persona and revised persona, respectively. This is the first attempt to solve the problem of personalized response selection using prompt sequences.Here’s the information in Simplified Chinese text:
methods: 提议的解决方案是一种插件式 persona 提示方法,允许 chatbot 系统在 persona 信息不available 时 функциональ如标准的 open-domain chatbot。该方法使用 zero-shot 设定,以减少基于 persona-grounded 训练数据的依赖。
results: 作者们示出了他们的方法可以提高标准模型的性能, Specifically, the zero-shot model improved the standard model by 7.71 and 1.04 points in the original persona and revised persona, respectively, and fine-tuning the model further improved the previous state-of-the-art system by 1.95 and 3.39 points in the original persona and revised persona, respectively.Abstract
The use of persona-grounded retrieval-based chatbots is crucial for personalized conversations, but there are several challenges that need to be addressed. 1) In general, collecting persona-grounded corpus is very expensive. 2) The chatbot system does not always respond in consideration of persona at real applications. To address these challenges, we propose a plug-and-play persona prompting method. Our system can function as a standard open-domain chatbot if persona information is not available. We demonstrate that this approach performs well in the zero-shot setting, which reduces the dependence on persona-ground training data. This makes it easier to expand the system to other languages without the need to build a persona-grounded corpus. Additionally, our model can be fine-tuned for even better performance. In our experiments, the zero-shot model improved the standard model by 7.71 and 1.04 points in the original persona and revised persona, respectively. The fine-tuned model improved the previous state-of-the-art system by 1.95 and 3.39 points in the original persona and revised persona, respectively. To the best of our knowledge, this is the first attempt to solve the problem of personalized response selection using prompt sequences. Our code is available on github~\footnote{https://github.com/rungjoo/plug-and-play-prompt-persona}.
摘要
使用基于搜索的人物固定的聊天机器人是至关重要的 для个性化对话,但有几个挑战需要解决。1)总体而言,收集基于人物的训练数据非常昂贵。2)聊天系统在实际应用中不一定会考虑到人物。为了解决这些挑战,我们提出了一种插件式人物提示方法。我们的系统可以作为标准的开放领域聊天机器人运行,如果人物信息不available。我们的实验表明,这种方法在零shot设定下表现良好,减少了基于人物固定训练数据的依赖。这使得我们可以更容易地扩展系统到其他语言,无需建立基于人物的训练数据。此外,我们的模型可以进行细化调整,以进一步提高表现。在我们的实验中,零shot模型在标准模型的基础上提高了7.71和1.04分,而修改后的模型则在原始人物和修改后的人物上提高了1.95和3.39分。到目前为止,这是个性化响应选择使用提示序列的第一次尝试。我们的代码可以在github上找到(https://github.com/rungjoo/plug-and-play-prompt-persona)。
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
results: 实验结果表明,ICA和ICD可以增加或减少针对语言模型的恶意攻击的成功率。Abstract
Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. In this paper, we explore the power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs. We find that by providing just few in-context demonstrations without fine-tuning, LLMs can be manipulated to increase or decrease the probability of jailbreaking, i.e. answering malicious prompts. Based on these observations, we propose In-Context Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding aligned language model purposes. ICA crafts malicious contexts to guide models in generating harmful outputs, while ICD enhances model robustness by demonstrations of rejecting to answer harmful prompts. Our experiments show the effectiveness of ICA and ICD in increasing or reducing the success rate of adversarial jailbreaking attacks. Overall, we shed light on the potential of ICL to influence LLM behavior and provide a new perspective for enhancing the safety and alignment of LLMs.
摘要
Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.Please note that the translation is done by a machine and may not be perfect, and some cultural references or idioms may not be accurately translated.
What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?
methods: 该论文使用了信息学的视角来模型多Modal模型在缺失modalities时的场景,并提出了一种基于Uni-Modal Ensemble with Missing Modality Adaptation(UME-MMA)的方法来解决这个问题。UME-MMA使用了预训练的uni-Modal网络 weights来提高特征提取,并使用缺失modalities数据增强技术来更好地适应缺失modalities的情况。
results: 该论文在Audio-Visual dataset(如AV-MNIST、Kinetics-Sound、AVE)和视觉语言dataset(如MM-IMDB、UPMC Food101)中展示了UME-MMA的效果。Abstract
With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101).
摘要
随着多Modal学习的成功增长,对多Modal模型在缺失模式下的Robustness研究得到了更多的关注。然而,先前的研究在这个领域存在一些限制,因为它们经常缺乏理论性的深度或者其方法论是特定的网络架构或模式所限制的。我们从信息论角度模拟了多Modal模型在缺失模式下的场景,并证明了在这些场景下性能的上限可以通过高效地利用缺失模式下的信息来逼近。在实践中,有两个关键方面:(1)encoder应该能够从非缺失模式中提取足够好的特征;(2)提取的特征应该能够在模式之间的混合过程中免疫噪音的影响。为此,我们提出了Uni-Modal Ensemble with Missing Modality Adaptation(UME-MMA)。UME-MMA使用uni-modal预训练权重来提高多Modal模型的特征提取,并使用缺失模式数据增强技术来更好地适应缺失模式。此外,UME-MMA基于晚期融合学习框架,允许插入多种编码器,使其适用于多种模式并允许大规模预训练编码器进一步提高性能。我们在AV-MNIST、Kinetics-Sound、AVE等音视频数据集和MM-IMDB、UPMC Food101等视语数据集中证明了UME-MMA的有效性。
Advanced Efficient Strategy for Detection of Dark Objects Based on Spiking Network with Multi-Box Detection
methods: combining spiked and normal convolution layers, pre-trained VGG16 feature extractor
results: 66.01%和41.25% mAP for detecting 20 different objects in the VOC-12 and 12 objects in the Ex-Dark dataset, superior performance compared to other state-of-the-art object detection modelsAbstract
Several deep learning algorithms have shown amazing performance for existing object detection tasks, but recognizing darker objects is the largest challenge. Moreover, those techniques struggled to detect or had a slow recognition rate, resulting in significant performance losses. As a result, an improved and accurate detection approach is required to address the above difficulty. The whole study proposes a combination of spiked and normal convolution layers as an energy-efficient and reliable object detector model. The proposed model is split into two sections. The first section is developed as a feature extractor, which utilizes pre-trained VGG16, and the second section of the proposal structure is the combination of spiked and normal Convolutional layers to detect the bounding boxes of images. We drew a pre-trained model for classifying detected objects. With state of the art Python libraries, spike layers can be trained efficiently. The proposed spike convolutional object detector (SCOD) has been evaluated on VOC and Ex-Dark datasets. SCOD reached 66.01% and 41.25% mAP for detecting 20 different objects in the VOC-12 and 12 objects in the Ex-Dark dataset. SCOD uses 14 Giga FLOPS for its forward path calculations. Experimental results indicated superior performance compared to Tiny YOLO, Spike YOLO, YOLO-LITE, Tinier YOLO and Center of loc+Xception based on mAP for the VOC dataset.
摘要
多种深度学习算法在现有的对象检测任务上表现出色,但检测黑色对象是最大的挑战。此外,这些技术在检测或识别速度较慢,导致性能下降。因此,需要一种改进的和准确的检测方法来解决上述困难。本研究提议一种结合刺激和常规卷积层的能量高效和可靠对象检测模型。该模型分为两部分。第一部分是特征提取器,利用预训练的VGG16,第二部分是将刺激和常规卷积层结合在一起,用于检测图像中的 bounding box。我们预训练了一个用于类别检测的模型。使用现代Python库,刺激层可以高效地训练。我们提出的刺激卷积对象检测器(SCOD)在VOC和Ex-Dark数据集上进行评估,SCOD在VOC-12和Ex-Dark数据集中分别达到了66.01%和41.25%的mAP。SCOD的前向计算需要14亿FLOPS。实验结果表明,SCOD在VOC数据集上比Tiny YOLO、Spike YOLO、YOLO-LITE、Tinier YOLO和Center of loc+Xception基于mAP的性能更高。
Geometrically Aligned Transfer Encoder for Inductive Transfer in Regression Tasks
results: 在多种分子图数据集上,GATE 比 convential 方法表现出色,在隐藏空间和极限区域都显示稳定的行为。Abstract
Transfer learning is a crucial technique for handling a small amount of data that is potentially related to other abundant data. However, most of the existing methods are focused on classification tasks using images and language datasets. Therefore, in order to expand the transfer learning scheme to regression tasks, we propose a novel transfer technique based on differential geometry, namely the Geometrically Aligned Transfer Encoder (GATE). In this method, we interpret the latent vectors from the model to exist on a Riemannian curved manifold. We find a proper diffeomorphism between pairs of tasks to ensure that every arbitrary point maps to a locally flat coordinate in the overlapping region, allowing the transfer of knowledge from the source to the target data. This also serves as an effective regularizer for the model to behave in extrapolation regions. In this article, we demonstrate that GATE outperforms conventional methods and exhibits stable behavior in both the latent space and extrapolation regions for various molecular graph datasets.
摘要
< translate into Simplified Chinese抽象:传承学是一种重要的技术,用于处理小量数据,但这些数据可能与其他丰富的数据相关。然而,现有的方法主要是用于图像和语言数据的分类任务。因此,我们提出了一种新的传承技术基于差分几何,即差分几何转移编码器(GATE)。在这种方法中,我们将模型中的隐藏 вектор视为在圆柱几何上的点。我们找到了对应的对称变换,使得每个任务的任意点在重叠区域中都可以映射到一个本地平坦坐标,从而实现了从源数据传播知识到目标数据。此外,这也 serves as a 有效的正则化项,使模型在极限区域行为稳定。在这篇文章中,我们示示了GATE比传统方法更高效,并在不同的分子图数据集上显示了稳定的行为。
Noisy-ArcMix: Additive Noisy Angular Margin Loss Combined With Mixup Anomalous Sound Detection
for: 这篇论文targetsUnsupervised anomalous sound detection (ASD), aiming to identify abnormal sounds by learning normal operational sounds’ features and sensing their deviations.
methods: 本研究使用了自类指导任务,利用normal data的类别 tasks to learn representation space for anomalous data, and proposes a training technique to ensure intra-class compactness and increase angle gap between normal and abnormal samples.
results: 实验结果显示,提案的方法在DCASE 2020 Challenge Task2 dataset上实现了最佳性能,与state-of-the-art方法相比,具有0.90%, 0.83%, 2.16%的改善(AUC、pAUC、mAUC分别)。Abstract
Unsupervised anomalous sound detection (ASD) aims to identify anomalous sounds by learning the features of normal operational sounds and sensing their deviations. Recent approaches have focused on the self-supervised task utilizing the classification of normal data, and advanced models have shown that securing representation space for anomalous data is important through representation learning yielding compact intra-class and well-separated intra-class distributions. However, we show that conventional approaches often fail to ensure sufficient intra-class compactness and exhibit angular disparity between samples and their corresponding centers. In this paper, we propose a training technique aimed at ensuring intra-class compactness and increasing the angle gap between normal and abnormal samples. Furthermore, we present an architecture that extracts features for important temporal regions, enabling the model to learn which time frames should be emphasized or suppressed. Experimental results demonstrate that the proposed method achieves the best performance giving 0.90%, 0.83%, and 2.16% improvement in terms of AUC, pAUC, and mAUC, respectively, compared to the state-of-the-art method on DCASE 2020 Challenge Task2 dataset.
摘要
无监督异常声音检测(ASD)目标是通过学习正常操作声音特征来识别异常声音。 current approaches 通常采用自我监督任务,利用正常数据的分类来学习表征。然而,我们发现,现有方法通常无法保证异常样本内的准确性和相关性。在这篇论文中,我们提出了一种培训技术,以确保异常样本内的准确性和相关性,同时提高正常和异常样本之间的角度差。此外,我们还提出了一种建模方法,以EXTRACT Features for important temporal regions,使模型能够学习哪些时间区域是重要的。实验结果表明,我们提出的方法可以达到最佳性能,与现有方法相比,在 DCASE 2020 Challenge Task2 数据集上提高了0.90%、0.83% 和 2.16% 的 AUC、pAUC 和 mAUC 分别。
paper_authors: Arafat Islam, Md. Imtiaz Habib for: 火灾探测,特别是室外、森林火灾中的小型火焰探测methods: 提出了一个改进的YOLOv5火灾探测深度学习算法,包括增强特征提取网络的扩展和特征堆峰推广等技术results: 该算法可以实现小型火焰探测的高精度探测,其中mAP达90.5%,f1分数达88%,并且可以实现实时森林火灾探测,平均探测时间为0.12秒/帧。Abstract
For the detection of fire-like targets in indoor, outdoor and forest fire images, as well as fire detection under different natural lights, an improved YOLOv5 fire detection deep learning algorithm is proposed. The YOLOv5 detection model expands the feature extraction network from three dimensions, which enhances feature propagation of fire small targets identification, improves network performance, and reduces model parameters. Furthermore, through the promotion of the feature pyramid, the top-performing prediction box is obtained. Fire-YOLOv5 attains excellent results compared to state-of-the-art object detection networks, notably in the detection of small targets of fire and smoke with mAP 90.5% and f1 score 88%. Overall, the Fire-YOLOv5 detection model can effectively deal with the inspection of small fire targets, as well as fire-like and smoke-like objects with F1 score 0.88. When the input image size is 416 x 416 resolution, the average detection time is 0.12 s per frame, which can provide real-time forest fire detection. Moreover, the algorithm proposed in this paper can also be applied to small target detection under other complicated situations. The proposed system shows an improved approach in all fire detection metrics such as precision, recall, and mean average precision.
摘要
For the detection of fire-like targets in indoor, outdoor and forest fire images, as well as fire detection under different natural lights, an improved YOLOv5 fire detection deep learning algorithm is proposed. The YOLOv5 detection model expands the feature extraction network from three dimensions, which enhances feature propagation of fire small targets identification, improves network performance, and reduces model parameters. Furthermore, through the promotion of the feature pyramid, the top-performing prediction box is obtained. Fire-YOLOv5 attains excellent results compared to state-of-the-art object detection networks, notably in the detection of small targets of fire and smoke with mAP 90.5% and f1 score 88%. Overall, the Fire-YOLOv5 detection model can effectively deal with the inspection of small fire targets, as well as fire-like and smoke-like objects with F1 score 0.88. When the input image size is 416 x 416 resolution, the average detection time is 0.12 s per frame, which can provide real-time forest fire detection. Moreover, the algorithm proposed in this paper can also be applied to small target detection under other complicated situations. The proposed system shows an improved approach in all fire detection metrics such as precision, recall, and mean average precision.Here's the text in Simplified Chinese characters:为了检测室内、户外和森林火图像中的火目标,以及火 detection under different natural lights,我们提出了一种改进的 YOLOv5 火 detection 深度学习算法。 YOLOv5 检测模型从三维特征提取网络中扩展了特征提取网络,以增强小火目标识别的特征传播,提高网络性能,并减少模型参数。此外,通过特征层的提高,得到了最佳预测盒。 Fire-YOLOv5 在比较其他物体检测网络时表现出色,特别是在小目标火和烟的识别上,具有 mAP 90.5% 和 f1 score 88%。总的来说,Fire-YOLOv5 检测模型可以有效地处理小火目标的检测,以及火类和烟类对象的识别。当输入图像大小为 416 x 416 像素时,每帧检测时间为 0.12 秒,可以提供实时森林火检测。此外,本文提出的算法还可以应用于其他复杂情况下的小目标检测。提议的系统在所有火检测指标中表现出色,包括准确率、回归率和mean average precision。
Filter Pruning For CNN With Enhanced Linear Representation Redundancy
results: 这个方法在Cifar-10 datasets上实现了93.64%的准确率,仅剩1.40M个参数和49.60M FLOPs。在ImageNet datasets上,这个方法实现了42.8%和47.3%的储存空间和计算量减少,并保持76.23%的准确率。Abstract
Structured network pruning excels non-structured methods because they can take advantage of the thriving developed parallel computing techniques. In this paper, we propose a new structured pruning method. Firstly, to create more structured redundancy, we present a data-driven loss function term calculated from the correlation coefficient matrix of different feature maps in the same layer, named CCM-loss. This loss term can encourage the neural network to learn stronger linear representation relations between feature maps during the training from the scratch so that more homogenous parts can be removed later in pruning. CCM-loss provides us with another universal transcendental mathematical tool besides L*-norm regularization, which concentrates on generating zeros, to generate more redundancy but for the different genres. Furthermore, we design a matching channel selection strategy based on principal components analysis to exploit the maximum potential ability of CCM-loss. In our new strategy, we mainly focus on the consistency and integrality of the information flow in the network. Instead of empirically hard-code the retain ratio for each layer, our channel selection strategy can dynamically adjust each layer's retain ratio according to the specific circumstance of a per-trained model to push the prune ratio to the limit. Notably, on the Cifar-10 dataset, our method brings 93.64% accuracy for pruned VGG-16 with only 1.40M parameters and 49.60M FLOPs, the pruned ratios for parameters and FLOPs are 90.6% and 84.2%, respectively. For ResNet-50 trained on the ImageNet dataset, our approach achieves 42.8% and 47.3% storage and computation reductions, respectively, with an accuracy of 76.23%. Our code is available at https://github.com/Bojue-Wang/CCM-LRR.
摘要
《结构化网络剪除超越非结构化方法,因为它们可以利用已有的发达并行计算技术。在这篇论文中,我们提出了一种新的结构化剪除方法。首先,我们提出了一种数据驱动的损失函数,即基于层内不同特征地图的相关系数矩阵(CCM-loss)。这个损失函数可以让神经网络在训练时从零开始学习更强的线性表示关系,以便在剪除后更多的同类部分可以被去除。CCM-loss为我们提供了除L*-norm正则化之外的另一种普遍适用的 трансцендент数学工具,可以产生更多的重复。此外,我们设计了基于主成分分析的通道选择策略,以便在最大化CCM-loss的情况下使用。在我们的新策略中,我们主要关注神经网络中信息流的一致性和完整性。而不是按照预先确定的方式来确定各层保留比例,我们的通道选择策略可以在每个训练过程中动态调整各层保留比例,以达到剪除比例的最大化。值得注意的是,在Cifar-10数据集上,我们的方法可以实现93.64%的准确率,只需要1.40M个参数和49.60M个操作量。剪除率为90.6%和84.2%。对于ImageNet数据集上训练的ResNet-50,我们的方法可以实现42.8%和47.3%的存储和计算剪除,准确率为76.23%。我们的代码可以在https://github.com/Bojue-Wang/CCM-LRR上获取。
Contrastive Prompt Learning-based Code Search based on Interaction Matrix
results: 通过对实际世界数据集进行广泛的实验,证明了我们的方法可以提高代码搜索的Semantic representation质量和自然语言和编程语言之间的匹配能力。Abstract
Code search aims to retrieve the code snippet that highly matches the given query described in natural language. Recently, many code pre-training approaches have demonstrated impressive performance on code search. However, existing code search methods still suffer from two performance constraints: inadequate semantic representation and the semantic gap between natural language (NL) and programming language (PL). In this paper, we propose CPLCS, a contrastive prompt learning-based code search method based on the cross-modal interaction mechanism. CPLCS comprises:(1) PL-NL contrastive learning, which learns the semantic matching relationship between PL and NL representations; (2) a prompt learning design for a dual-encoder structure that can alleviate the problem of inadequate semantic representation; (3) a cross-modal interaction mechanism to enhance the fine-grained mapping between NL and PL. We conduct extensive experiments to evaluate the effectiveness of our approach on a real-world dataset across six programming languages. The experiment results demonstrate the efficacy of our approach in improving semantic representation quality and mapping ability between PL and NL.
摘要
Code search 的目的是搜索与给定的自然语言(NL)查询语句相似的代码片段。近些年,许多代码预训练方法在代码搜索方面凭借了惊人的表现。然而,现有的代码搜索方法仍然受到两种性能约束:代码表示不够准确和自然语言(PL)和计算机语言(PL)之间的semantic gap。本文提出了一种基于对比提示学习的代码搜索方法(CPLCS),它包括:1. PL-NL对应学习,这种学习方法学习PL和NL表示之间的匹配关系;2.一种适应双encoder结构的提问学习设计,以解决代码表示不够准确的问题;3.一种跨模式交互机制,以增强NL和PL之间的细致对应。我们对实际世界数据集进行了广泛的实验,以评估我们的方法的有效性。实验结果表明,我们的方法可以提高代码表示质量和NL和PL之间的映射能力。
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction
results: 在多modal named entity recognition数据集Twitter-2015和Twitter-2017以及多modal relation extraction数据集MNRE上,我们的提议的I2SRM方法实现了竞争力的结果,Twitter-2015上的F1分数为77.12%, Twitter-2017上的F1分数为88.40%, MNRE数据集上的F1分数为84.12%.Abstract
Multimodal information extraction is attracting research attention nowadays, which requires aggregating representations from different modalities. In this paper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM) method for this task, which contains two modules. Firstly, the intra-sample relationship modeling module operates on a single sample and aims to learn effective representations. Embeddings from textual and visual modalities are shifted to bridge the modality gap caused by distinct pre-trained language and image models. Secondly, the inter-sample relationship modeling module considers relationships among multiple samples and focuses on capturing the interactions. An AttnMixup strategy is proposed, which not only enables collaboration among samples but also augments data to improve generalization. We conduct extensive experiments on the multimodal named entity recognition datasets Twitter-2015 and Twitter-2017, and the multimodal relation extraction dataset MNRE. Our proposed method I2SRM achieves competitive results, 77.12% F1-score on Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE.
摘要
现在,多模态信息抽取正在引起研究者的关注,这需要将不同模式的表示合并。在这篇论文中,我们提出了内样本关系模型(I2SRM)方法,它包括两个模块。首先,内样本关系模型模块在单个样本上运行,旨在学习有效的表示。文本和视觉模式的嵌入都被Shift到bridgemodality gap caused by distinct pre-trained language and image models。其次,间样本关系模型模块考虑多个样本之间的关系,重点是捕捉交互。我们提出了AttnMixup策略,不仅可以在样本之间协作,还可以增强数据,以提高泛化性。我们在多模态命名实体识别数据集Twitter-2015和Twitter-2017以及多模态关系抽取数据集MNRE进行了广泛的实验。我们的提出的I2SRM方法在Twitter-2015上达到了77.12%的F1分数,在Twitter-2017上达到了88.40%的F1分数,并在MNRE上达到了84.12%的F1分数。
Predicting Three Types of Freezing of Gait Events Using Deep Learning Models
paper_authors: Wen Tao Mo, Jonathan H. Chan for:这篇论文旨在预测患有parkinson病的患者会发生停止进行步行的症状(冻结步行),并且预测不同类型的冻结步行事件。methods:本研究使用了深度学习模型,包括trasformer核心架构和Bidirectional LSTM层,以及不同的特征集合,来预测不同类型的冻结步行事件。results:最佳表现的模型在训练数据中取得了0.427的分数,可以在Kaggle的冻结步行预测竞赛中排名前5名。然而,我们也发现了训练数据中的过滤现象,可能可以通过伪造标签和模型架构简化来改善。Abstract
Freezing of gait is a Parkinson's Disease symptom that episodically inflicts a patient with the inability to step or turn while walking. While medical experts have discovered various triggers and alleviating actions for freezing of gait, the underlying causes and prediction models are still being explored today. Current freezing of gait prediction models that utilize machine learning achieve high sensitivity and specificity in freezing of gait predictions based on time-series data; however, these models lack specifications on the type of freezing of gait events. We develop various deep learning models using the transformer encoder architecture plus Bidirectional LSTM layers and different feature sets to predict the three different types of freezing of gait events. The best performing model achieves a score of 0.427 on testing data, which would rank top 5 in Kaggle's Freezing of Gait prediction competition, hosted by THE MICHAEL J. FOX FOUNDATION. However, we also recognize overfitting in training data that could be potentially improved through pseudo labelling on additional data and model architecture simplification.
摘要
冻结步态是parkinson病 symptom的一种 episodic 表现,患者在步行时会受到不能前进或转弯的困难。医学专家已经发现了冻结步态的许多触发因素和缓解方法,但是下面的原因和预测模型仍在探索中。现有的冻结步态预测模型使用机器学习技术,可以在时间序列数据上获得高的敏感度和特异度,但是这些模型缺乏冻结步态事件的类型specification。我们采用了不同的深度学习模型,包括 transformer encoder 架构和bi-directional LSTM层,以及不同的特征集来预测冻结步态事件的三种不同类型。最佳表现的模型在测试数据上得分为0.427,这将在Kaggle的冻结步态预测竞赛中排名前5。然而,我们也发现了训练数据中的过拟合,可能通过pseudo标注和额外数据的使用来改进。
Dobby: A Conversational Service Robot Driven by GPT-4
results: 研究结果显示,在一个自由游导游 scenarios中,具有对话AI能力的机器人比无此能力的机器人表现出更高的总效果、探索能力、审查能力、人格化接受度和适应性。Abstract
This work introduces a robotics platform which embeds a conversational AI agent in an embodied system for natural language understanding and intelligent decision-making for service tasks; integrating task planning and human-like conversation. The agent is derived from a large language model, which has learned from a vast corpus of general knowledge. In addition to generating dialogue, this agent can interface with the physical world by invoking commands on the robot; seamlessly merging communication and behavior. This system is demonstrated in a free-form tour-guide scenario, in an HRI study combining robots with and without conversational AI capabilities. Performance is measured along five dimensions: overall effectiveness, exploration abilities, scrutinization abilities, receptiveness to personification, and adaptability.
摘要
这项工作描述了一个机器人平台,其内置了一个基于大语言模型的对话智能代理人,用于自然语言理解和智能决策 для服务任务。该代理人可以通过 invoke 命令来与物理世界交互,从而将通信和行为融为一体。该系统在一个自由游览导览场景中进行了人机交互研究,并对无对话智能代理人和具有对话智能代理人的机器人进行了比较。研究指标包括总效果、探索能力、审查能力、人格化响应性和适应性。
Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition
paper_authors: Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, Daniel Murfet
for: 这 paper 的目的是 investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT).
methods: 这 paper 使用 closed formula derive theoretical loss, 并在两个隐藏维度情况下发现 regular $k$-gons 是 critical points.
results: 这 paper 提供 supporting theory 表明这些 $k$-gons 的 local learning coefficient (a geometric invariant) determines phase transitions in the Bayesian posterior as a function of training sample size. Empirical results show that the same $k$-gon critical points also determine the behavior of SGD training.Abstract
We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT). We derive a closed formula for the theoretical loss and, in the case of two hidden dimensions, discover that regular $k$-gons are critical points. We present supporting theory indicating that the local learning coefficient (a geometric invariant) of these $k$-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then show empirically that the same $k$-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we find that the learning process in TMS, be it through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.
摘要
我们研究托 modelo de superposición (TMS) 中的相对transition using Singular Learning Theory (SLT).我们 derivated a closed formula for the theoretical loss, and in the case of two hidden dimensions, we found that regular $k$-gons are critical points. We presented supporting theory indicating that the local learning coefficient (a geometric invariant) of these $k$-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then showed empirically that the same $k$-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we found that the learning process in TMS, whether through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.Here's the translation in Traditional Chinese:我们研究托模型超position (TMS) 中的相对转换使用Singular Learning Theory (SLT).我们 derivated a closed formula for the theoretical loss, 并在两个隐藏维度的情况下发现了 Regular $k$-gons 是托点。我们提供了支持理论,认为这些 $k$-gons 的本地学习系数(一个几何 invariant) determines phase transitions in the Bayesian posterior as a function of training sample size。我们随后证明了这些 $k$-gon 托点也determine SGD 训练的行为。图像 emerges 证明了 SGD 学习轨迹是受到顺序学习机制的影响。具体来说,我们发现 TMS 的学习过程, whether through SGD 或 Bayesian 学习,可以通过从高损失且低复杂性的区域到低损失且高复杂性的区域的 parameter space 中的旅程来描述。
Suppressing Overestimation in Q-Learning through Adversarial Behaviors
results: 对多种环境进行实验表明,提议的DAQ可以有效地降低过估偏好,并且可以轻松地应用于现有的感知学习算法中,以提高性能。Abstract
The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias thourgh dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments.
摘要
<> translate "The goal of this paper is to propose a new Q-learning algorithm with a dummy adversarial player, which is called dummy adversarial Q-learning (DAQ), that can effectively regulate the overestimation bias in standard Q-learning. With the dummy player, the learning can be formulated as a two-player zero-sum game. The proposed DAQ unifies several Q-learning variations to control overestimation biases, such as maxmin Q-learning and minmax Q-learning (proposed in this paper) in a single framework. The proposed DAQ is a simple but effective way to suppress the overestimation bias though dummy adversarial behaviors and can be easily applied to off-the-shelf reinforcement learning algorithms to improve the performances. A finite-time convergence of DAQ is analyzed from an integrated perspective by adapting an adversarial Q-learning. The performance of the suggested DAQ is empirically demonstrated under various benchmark environments." into Simplified Chinese.中文简体版:本文的目标是提出一种新的Q学习算法,即幻数对手Q学习(DAQ),可以有效地控制标准Q学习中的过估偏见。通过幻数player,学习可以转化为两个玩家的零SUM游戏。提出的DAQ整合了多种Q学习变体来控制过估偏见,如maxmin Q学习和minmax Q学习(在本文中提出)。DAQ是一种简单 yet有效的方法,通过幻数对手行为来抑制过估偏见,并可以轻松应用于市场上的Q学习算法来提高性能。我们从一个整合的视角来分析DAQ的固定时间收敛性。本文的建议DAQ的性能在多个标准环境中进行了实验证明。
BC4LLM: Trusted Artificial Intelligence When Blockchain Meets Large Language Models
methods: 本文使用了区块链技术来解决 AI 学习数据的 authenticity 和可靠性问题,包括可靠学习团队、安全训练过程和可识别生成内容。
results: 本文预计通过基于区块链技术的 empowering 方案,可以实现人工智能的可靠性和安全性,并且在前沿通信网络领域可能带来很多应用和挑战。Abstract
In recent years, artificial intelligence (AI) and machine learning (ML) are reshaping society's production methods and productivity, and also changing the paradigm of scientific research. Among them, the AI language model represented by ChatGPT has made great progress. Such large language models (LLMs) serve people in the form of AI-generated content (AIGC) and are widely used in consulting, healthcare, and education. However, it is difficult to guarantee the authenticity and reliability of AIGC learning data. In addition, there are also hidden dangers of privacy disclosure in distributed AI training. Moreover, the content generated by LLMs is difficult to identify and trace, and it is difficult to cross-platform mutual recognition. The above information security issues in the coming era of AI powered by LLMs will be infinitely amplified and affect everyone's life. Therefore, we consider empowering LLMs using blockchain technology with superior security features to propose a vision for trusted AI. This paper mainly introduces the motivation and technical route of blockchain for LLM (BC4LLM), including reliable learning corpus, secure training process, and identifiable generated content. Meanwhile, this paper also reviews the potential applications and future challenges, especially in the frontier communication networks field, including network resource allocation, dynamic spectrum sharing, and semantic communication. Based on the above work combined and the prospect of blockchain and LLMs, it is expected to help the early realization of trusted AI and provide guidance for the academic community.
摘要
This paper mainly introduces the motivation and technical route of blockchain for LLM (BC4LLM), including reliable learning corpus, secure training process, and identifiable generated content. Meanwhile, this paper also reviews the potential applications and future challenges, especially in the frontier communication networks field, including network resource allocation, dynamic spectrum sharing, and semantic communication. Based on the above work combined and the prospect of blockchain and LLMs, it is expected to help the early realization of trusted AI and provide guidance for the academic community.Here is the translation in Simplified Chinese:近年来,人工智能(AI)和机器学习(ML)对社会生产方式和生产效率产生了深见影响,同时也改变了科学研究的 paradigma。其中,AI语言模型代表的ChatGPT等大语言模型(LLMs)已经做出了很大的进步。这些LLMs服务于人类在形式上为AI生成内容(AIGC),广泛应用于咨询、医疗和教育等领域。然而,保证AIGC学习数据的authenticity和可靠性具有挑战。此外,分布式AI培训中也存在隐藏的隐私泄露风险。此外,由LLMs生成的内容很难以识别和跟踪,同时也难以在不同平台之间进行跨平台认可。这些在未来的AI驱动by LLMs中的信息安全问题将无限扩大,影响每个人的生活。因此,我们认为使用区块链技术来加强LLMs,以提出一种可靠的AI视野。本文主要介绍了加强LLMs使用区块链技术的动机和技术路径,包括可靠的学习ikorpus,安全的培训过程和可识别的生成内容。同时,本文还进行了前ier communication networksfield中的潜在应用和未来挑战的评估,包括网络资源分配、动态频率共享和semantic communication。基于上述工作的结合以及区块链和LLMs的前景,我们期望通过提出可靠的AI来帮助早期实现可靠的AI,并为学术界提供指导。
Let Models Speak Ciphers: Multiagent Debate through Embeddings
paper_authors: Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang
for: 提高大语言模型(LLM)的理解能力
methods: 去掉 LLM 中的 токен抽样步骤,通过 Raw Transformer 输出嵌入表示模型的信念
results: 在五种理解任务和多个开源 LLM 中,CIPHER 较 traditional inference 提高了1-3.5%,表明嵌入作为 LLM 之间communication的 alternatinative “语言” 的优势和稳定性。Abstract
Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one token to represent the model's belief across the entire vocabulary. In this paper, we introduce a communication regime named CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. Specifically, we remove the token sampling step from LLMs and let them communicate their beliefs across the vocabulary through the expectation of the raw transformer output embeddings. Remarkably, by deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. While the state-of-the-art LLM debate methods using natural language outperforms traditional inference by a margin of 1.5-8%, our experiment results show that CIPHER debate further extends this lead by 1-3.5% across five reasoning tasks and multiple open-source LLMs of varying sizes. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
摘要
Large Language Models (LLMs) 之间的讨论和辩论已经吸引了广泛的关注,因为它们可以提高 LLMs 的理解能力。 although natural language 是一个自然的选择,因为 LLMS 拥有语言理解能力,但是token sampling 步骤在生成自然语言时存在一定的风险,因为它只使用一个token来表示模型对整个词汇的信念。 在这篇论文中,我们提出了一种通信协议名为CIPHER(Communicative Inter-Model Protocol Through Embedding Representation),以解决这个问题。 Specifically,我们从 LLMS 中除了token sampling步骤,让它们通过 Raw Transformer 输出嵌入来交换信念。 这种方法的优点在于,它可以编码更广泛的信息,无需修改模型参数。 在使用自然语言进行辩论方法的现有状态的LLMs中,我们的实验结果表明,CIPHER辩论可以进一步提高这个领先的势头,在五种理解任务和多个开源 LLMS 中,平均提高1-3.5%。 这表明嵌入可以作为 LLMS 之间的另一种通信"语言"的一个有利的和稳定的选择。
Towards Mitigating Hallucination in Large Language Models via Self-Reflection
paper_authors: Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung
for: This paper focuses on the issue of hallucination in medical generative question-answering systems, and proposes an interactive self-reflection methodology to improve the factuality and consistency of the generated answers.
methods: The paper uses widely adopted large language models (LLMs) and datasets, and employs an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation to tackle the challenge of hallucination.
results: The experimental results show that the proposed approach outperforms baselines in reducing hallucination, and produces more accurate and consistent answers.Abstract
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. However, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. This issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets. Our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. To tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. Through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. Consequently, we harness the interactivity and multitasking ability of LLMs and produce progressively more precise and accurate answers. Experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.
摘要
The AI Incident Database as an Educational Tool to Raise Awareness of AI Harms: A Classroom Exploration of Efficacy, Limitations, & Future Improvements
paper_authors: Michael Feffer, Nikolas Martelaro, Hoda Heidari for:* 这 paper 的目的是提高人们对 AI 技术的应用中可能出现的危害的意识,以及如何设计安全、可靠的 AI 系统。methods:* 该 paper 使用了 AI Incident Database (AIID) 作为教学工具,以帮助学生更好地理解 AI 技术在社会高危领域中可能出现的危害。results:* 该 study 发现,通过使用 AIID,学生的初始印象Changed significantly,他们更加意识到 AI 技术在社会中的危害,并且有更强的感觉要设计安全、可靠的 AI 系统。Abstract
Prior work has established the importance of integrating AI ethics topics into computer and data sciences curricula. We provide evidence suggesting that one of the critical objectives of AI Ethics education must be to raise awareness of AI harms. While there are various sources to learn about such harms, The AI Incident Database (AIID) is one of the few attempts at offering a relatively comprehensive database indexing prior instances of harms or near harms stemming from the deployment of AI technologies in the real world. This study assesses the effectiveness of AIID as an educational tool to raise awareness regarding the prevalence and severity of AI harms in socially high-stakes domains. We present findings obtained through a classroom study conducted at an R1 institution as part of a course focused on the societal and ethical considerations around AI and ML. Our qualitative findings characterize students' initial perceptions of core topics in AI ethics and their desire to close the educational gap between their technical skills and their ability to think systematically about ethical and societal aspects of their work. We find that interacting with the database helps students better understand the magnitude and severity of AI harms and instills in them a sense of urgency around (a) designing functional and safe AI and (b) strengthening governance and accountability mechanisms. Finally, we compile students' feedback about the tool and our class activity into actionable recommendations for the database development team and the broader community to improve awareness of AI harms in AI ethics education.
摘要
We conducted a classroom study at an R1 institution as part of a course focused on the societal and ethical considerations around AI and ML. Our findings show that interacting with the database helps students better understand the magnitude and severity of AI harms and instills in them a sense of urgency around designing functional and safe AI and strengthening governance and accountability mechanisms.Our qualitative findings also reveal that students desire to close the educational gap between their technical skills and their ability to think systematically about ethical and societal aspects of their work. We compile students' feedback about the tool and our class activity into actionable recommendations for the database development team and the broader community to improve awareness of AI harms in AI ethics education.In conclusion, our study demonstrates that the AIID is an effective educational tool to raise awareness of AI harms in socially high-stakes domains, and highlights the importance of incorporating AI ethics education into computer and data sciences curricula to address the urgent need for ethical and responsible AI development.
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
results: 实验结果表明,CodeFuse-13B在实际应用场景中,如代码生成、代码翻译、代码注释和测试用例生成等,都能够更好地处理中文输入,并在人类评价(HumanEval)中获得37.10%的 passer@1 分数,位居同参数大小的多语言代码 LLM 之列。Abstract
Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.
摘要
大型语言模型(Code LLMs)在业界获得了广泛关注,因为它们在软件工程的全生命周期中有广泛的应用。然而,现有模型对非英语输入的效果在多种语言程式码相关任务中仍然不够熟悉。本文介绍CodeFuse-13B,一个开源预训式程式码大型语言模型。它可以处理英语和中文提示,支持40种程式语言,并且通过使用高质量的预训数据和训练过程中的优化而获得效果。我们通过实际的使用场景、业界标准对chmark HumanEval-x以及特别设计的CodeFuseEval进行了广泛的实验。为评估CodeFuse的效果,我们 актив地收集了AntGroup的软件开发过程中的宝贵人类反馈。结果显示,CodeFuse-13B在HumanEval pass@1 score中获得37.10%,位居多种多语言程式码大型语言模型之一。在实际应用中,例如程式码生成、程式码翻译、程式码注释和测试案例生成等方面,CodeFuse在面对中文提示时表现更好。
Self-Discriminative Modeling for Anomalous Graph Detection
results: 提出了三种不同的计算效率和稳定性的算法,并与一些州OF-the-art图像级异常检测基线方法进行比较,显著提高了AUC。Abstract
This paper studies the problem of detecting anomalous graphs using a machine learning model trained on only normal graphs, which has many applications in molecule, biology, and social network data analysis. We present a self-discriminative modeling framework for anomalous graph detection. The key idea, mathematically and numerically illustrated, is to learn a discriminator (classifier) from the given normal graphs together with pseudo-anomalous graphs generated by a model jointly trained, where we never use any true anomalous graphs and we hope that the generated pseudo-anomalous graphs interpolate between normal ones and (real) anomalous ones. Under the framework, we provide three algorithms with different computational efficiencies and stabilities for anomalous graph detection. The three algorithms are compared with several state-of-the-art graph-level anomaly detection baselines on nine popular graph datasets (four with small size and five with moderate size) and show significant improvement in terms of AUC. The success of our algorithms stems from the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provide new insights for anomaly detection. Moreover, we investigate our algorithms for large-scale imbalanced graph datasets. Surprisingly, our algorithms, though fully unsupervised, are able to significantly outperform supervised learning algorithms of anomalous graph detection. The corresponding reason is also analyzed.
摘要
Under this framework, we present three algorithms with different computational efficiencies and stabilities for anomalous graph detection. These algorithms are compared with several state-of-the-art graph-level anomaly detection baselines on nine popular graph datasets, and they show significant improvement in terms of AUC. The success of our algorithms is due to the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provide new insights for anomaly detection.Moreover, we investigate our algorithms for large-scale imbalanced graph datasets and find that they can significantly outperform supervised learning algorithms for anomalous graph detection. We also analyze the reason for this surprising result.
Get the gist? Using large language models for few-shot decontextualization
results: 在多个领域中,使用几次训练的方法可以实现可靠的将句子脱离内容,并且可以跨领域实现 Transfer Learning。Abstract
In many NLP applications that involve interpreting sentences within a rich context -- for instance, information retrieval systems or dialogue systems -- it is desirable to be able to preserve the sentence in a form that can be readily understood without context, for later reuse -- a process known as ``decontextualization''. While previous work demonstrated that generative Seq2Seq models could effectively perform decontextualization after being fine-tuned on a specific dataset, this approach requires expensive human annotations and may not transfer to other domains. We propose a few-shot method of decontextualization using a large language model, and present preliminary results showing that this method achieves viable performance on multiple domains using only a small set of examples.
摘要
在许多自然语言处理(NLP)应用中,如信息检索系统或对话系统,旨在保留句子的形式,以便在不同上下文中重用——一种称为“减 contextualization”的过程。而过去的研究表明,可以使用生成 Seq2Seq 模型进行减 contextualization,但这种方法需要优质的人工标注,并且可能无法在其他领域传输。我们提出了一种几个示例的方法,使用大型语言模型进行减 contextualization,并提供了多个领域的初步结果,表明这种方法可以在不同领域 достичь可行的性能,只需要一小组示例。
We are what we repeatedly do: Inducing and deploying habitual schemas in persona-based responses
results: 作者在文章中提出了一种方法来从 generic facts 中生成 schema,然后从这些 schema 中挑选合适的一些来控制语言模型生成响应。这种方法可以帮助实现在对话系统中更加自然地表现出人物性。Abstract
Many practical applications of dialogue technology require the generation of responses according to a particular developer-specified persona. While a variety of personas can be elicited from recent large language models, the opaqueness and unpredictability of these models make it desirable to be able to specify personas in an explicit form. In previous work, personas have typically been represented as sets of one-off pieces of self-knowledge that are retrieved by the dialogue system for use in generation. However, in realistic human conversations, personas are often revealed through story-like narratives that involve rich habitual knowledge -- knowledge about kinds of events that an agent often participates in (e.g., work activities, hobbies, sporting activities, favorite entertainments, etc.), including typical goals, sub-events, preconditions, and postconditions of those events. We capture such habitual knowledge using an explicit schema representation, and propose an approach to dialogue generation that retrieves relevant schemas to condition a large language model to generate persona-based responses. Furthermore, we demonstrate a method for bootstrapping the creation of such schemas by first generating generic passages from a set of simple facts, and then inducing schemas from the generated passages.
摘要
很多实际应用场景中,对话技术需要根据开发者指定的 persona 生成响应。当前的大语言模型可以生成多种 persona,但这些模型的 complexity 和难于预测性使得可以将 persona 表示为明确的形式。在过去的工作中, persona 通常被表示为一组自我认知的一次性 retrieve,但在真实的人类对话中, persona 通常通过 rich 的习惯知识表示,包括代表工作、习惯、运动、喜好等活动的常见目标、子事件、前提和后果等知识。我们使用明确的schema表示方式来捕捉这些习惯知识,并提议一种基于 schema 的对话生成方法,使用这些 schema 来condition 一个大语言模型,以生成基于 persona 的响应。此外,我们还提出了一种方法,通过首先生成一组简单的事实,然后从这些事实中推导出 schema,来初始化 schema 的创建。
Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction
paper_authors: Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, Yonghui Wu for: 这种研究的目的是开发大型自然语言模型(LLM)上的软提示学习算法,检查提示的形状,提取提示使用冻结/不冻结LLM的方法,转移学习和少量学习能力。methods: 我们开发了一种基于软提示的LLM模型,并对4种训练策略进行比较,包括(1)没有提示的精度训练;(2)硬提示使用不冻结LLM;(3)软提示使用不冻结LLM;以及(4)软提示使用冻结LLM。我们使用了7种预训练LLM进行评估,并在两个benchmark数据集上进行评估。results: 结果表明,当LLM冻结时,GatorTron-3.9B with soft prompting得到了最好的精度分数为0.9118和0.8604 для概念EXTRACTION,与传统精度训练和硬提示基本模型相比,提高了0.63.1%和1.22.9%。GatorTron-345M with soft prompting得到了最好的F1分数为0.8332和0.7488 для端到端关系EXTRACTION,与其他两个模型相比,提高了0.22%和0.611.7%。当LLM冻结时,小型(i.e., 345 million parameters)LLM有一个很大的差距,需要扩大到比较大的参数量才能与冻结模型竞争。在跨机构评估中,使用冻结GatorTron-8.9B模型的软提示方法获得了最好的表现。这项研究表明了以下三点:(1)机器可以更好地学习软提示,(2)冻结LLM具有更好的少量学习和转移学习能力,以便多机构应用,(3)冻结LLM需要大型模型。Abstract
Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with frozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for clinical concept and relation extraction on two benchmark datasets. We evaluated the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting. We also assessed the few-shot learning ability. Results and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming the other two models by 0.2~2% and 0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million parameters) LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen LLMs. For cross-institute evaluation, soft prompting with a frozen GatorTron-8.9B model achieved the best performance. This study demonstrates that (1) machines can learn soft prompts better than humans, (2) frozen LLMs have better few-shot learning ability and transfer learning ability to facilitate muti-institution applications, and (3) frozen LLMs require large models.
摘要
方法:我们开发了一种软提示基于LLM模型,并比较了四种训练策略:(1)不使用提示;(2)使用不冻LLM的硬提示;(3)使用不冻LLM的软提示;(4)使用冻LLM的软提示。我们使用七种预训练LLM模型在两个benchmark数据集上进行临床概念和关系提取任务的评估。我们还评估了提示基于学习算法的跨机构传输学习能力和少量学习能力。结果和结论:当LLM模型处于不冻状态时,GatorTron-3.9B模型通过软提示方式取得了概念提取任务的严格F1分数最高,为0.9118和0.8604,分别超过了传统的精度训练和硬提示基于模型的0.6~3.1%和1.2~2.9%。GatorTron-345M模型通过软提示方式取得了综合关系提取任务的F1分数最高,为0.8332和0.7488,分别超过了其他两个模型的0.2~2%和0.6~11.7%。当LLM模型处于冻结状态时,小型(i.e., 345 million parameters)LLM模型具有大的差距,需要通过扩大模型规模来使其与不冻模型竞争。跨机构评估中,使用冻结GatorTron-8.9B模型的软提示方式取得了最佳性能。这个研究表明:(1)机器可以更好地学习软提示 than humans;(2)冻结LLM模型具有更好的少量学习和传输学习能力,以便实现多机构应用;(3)冻结LLM模型需要大型模型。
Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering
results: 研究人员在MUSIC-AVQA v2.0上使用新基线模型,比对 existed benchmarks 提高了2%的准确率,创造了新的状态率记录。Abstract
In recent years, there has been a growing emphasis on the intersection of audio, vision, and text modalities, driving forward the advancements in multimodal research. However, strong bias that exists in any modality can lead to the model neglecting the others. Consequently, the model's ability to effectively reason across these diverse modalities is compromised, impeding further advancement. In this paper, we meticulously review each question type from the original dataset, selecting those with pronounced answer biases. To counter these biases, we gather complementary videos and questions, ensuring that no answers have outstanding skewed distribution. In particular, for binary questions, we strive to ensure that both answers are almost uniformly spread within each question category. As a result, we construct a new dataset, named MUSIC-AVQA v2.0, which is more challenging and we believe could better foster the progress of AVQA task. Furthermore, we present a novel baseline model that delves deeper into the audio-visual-text interrelation. On MUSIC-AVQA v2.0, this model surpasses all the existing benchmarks, improving accuracy by 2% on MUSIC-AVQA v2.0, setting a new state-of-the-art performance.
摘要
近年来,有关音频、视觉和文本modalities的交叉研究得到了越来越多的关注,这些研究带来了多模态研究的进步。然而,任何一个modalities中的强大偏见可能导致模型忽略其他modalities。因此,模型在多个不同modalities之间有效地进行推理的能力受到了限制,这障碍了进一步的进步。在这篇论文中,我们仔细审查了原始数据集中的每个问题类型,选择具有明显的答案偏见的问题。为了纠正这些偏见,我们收集了补充视频和问题,确保每个问题类型中的答案都具有均匀的分布。因此,我们构建了一个新的数据集,名为MUSIC-AVQA v2.0,它比原始数据集更加具有挑战性,我们认为这将更好地推动AVQA任务的进步。此外,我们还提出了一种新的基线模型,它深入探究音频-视觉-文本之间的关系。在MUSIC-AVQA v2.0上,这种模型的表现超过了所有现有的 bench marks,提高了MUSIC-AVQA v2.0上的准确率2%,创造了新的状态状态表现。
Evolution of Natural Language Processing Technology: Not Just Language Processing Towards General Purpose AI
results: 研究结果表明,通过使用深度学习技术可以实现“练习做完善”的原则,并且可以进行四则运算而无需显式学习。这种方法可以解释复杂的图像和生成与其相应的文本描述。此外,本研究还提供了应用于商业领域的实践例子。Abstract
Since the invention of computers, communication through natural language (actual human language) has been a dream technology. However, natural language is extremely difficult to mathematically formulate, making it difficult to realize as an algorithm without considering programming. While there have been numerous technological developments, one cannot say that any results allowing free utilization have been achieved thus far. In the case of language learning in humans, for instance when learning one's mother tongue or foreign language, one must admit that this process is similar to the adage "practice makes perfect" in principle, even though the learning method is significant up to a point. Deep learning has played a central role in contemporary AI technology in recent years. When applied to natural language processing (NLP), this produced unprecedented results. Achievements exceeding the initial predictions have been reported from the results of learning vast amounts of textual data using deep learning. For instance, four arithmetic operations could be performed without explicit learning, thereby enabling the explanation of complex images and the generation of images from corresponding explanatory texts. It is an accurate example of the learner embodying the concept of "practice makes perfect" by using vast amounts of textual data. This report provides a technological explanation of how cutting-edge NLP has made it possible to realize the "practice makes perfect" principle. Additionally, examples of how this can be applied to business are provided. We reported in June 2022 in Japanese on the NLP movement from late 2021 to early 2022. We would like to summarize this as a memorandum since this is just the initial movement leading to the current large language models (LLMs).
摘要
Deep learning has played a central role in contemporary AI technology in recent years. When applied to natural language processing (NLP), this produced unprecedented results. Achievements exceeding the initial predictions have been reported from the results of learning vast amounts of textual data using deep learning. For instance, four arithmetic operations could be performed without explicit learning, thereby enabling the explanation of complex images and the generation of images from corresponding explanatory texts. It is an accurate example of the learner embodying the concept of "practice makes perfect" by using vast amounts of textual data.This report provides a technological explanation of how cutting-edge NLP has made it possible to realize the "practice makes perfect" principle. Additionally, examples of how this can be applied to business are provided. We reported in June 2022 in Japanese on the NLP movement from late 2021 to early 2022. We would like to summarize this as a memorandum since this is just the initial movement leading to the current large language models (LLMs).
GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models
results: GPT-4在农业考试中达到了88%的正确率,比之前的通用模型高,并在一个实验中与人类参与者相比而得到了最高表现。 GPT-4还可以为农业教育、评估和作物管理提供有价值的意见和建议。Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding across various domains, including healthcare and finance. For some tasks, LLMs achieve similar or better performance than trained human beings, therefore it is reasonable to employ human exams (e.g., certification tests) to assess the performance of LLMs. We present a comprehensive evaluation of popular LLMs, such as Llama 2 and GPT, on their ability to answer agriculture-related questions. In our evaluation, we also employ RAG (Retrieval-Augmented Generation) and ER (Ensemble Refinement) techniques, which combine information retrieval, generation capabilities, and prompting strategies to improve the LLMs' performance. To demonstrate the capabilities of LLMs, we selected agriculture exams and benchmark datasets from three of the largest agriculture producer countries: Brazil, India, and the USA. Our analysis highlights GPT-4's ability to achieve a passing score on exams to earn credits for renewing agronomist certifications, answering 93% of the questions correctly and outperforming earlier general-purpose models, which achieved 88% accuracy. On one of our experiments, GPT-4 obtained the highest performance when compared to human subjects. This performance suggests that GPT-4 could potentially pass on major graduate education admission tests or even earn credits for renewing agronomy certificates. We also explore the models' capacity to address general agriculture-related questions and generate crop management guidelines for Brazilian and Indian farmers, utilizing robust datasets from the Brazilian Agency of Agriculture (Embrapa) and graduate program exams from India. The results suggest that GPT-4, ER, and RAG can contribute meaningfully to agricultural education, assessment, and crop management practice, offering valuable insights to farmers and agricultural professionals.
摘要
大型自然语言模型(LLM)已经展示了在各种领域的自然语言理解能力,包括医疗和金融等。对于某些任务,LLM可以达到人类训练的水平或更高,因此可以使用人类考试(例如证书考试)来评估LLM的性能。我们对 популяр的LLM,如LLama 2和GPT,进行了全面的评估,以测试它们在农业相关问题上的能力。在我们的评估中,我们还使用了RAG(Retrieval-Augmented Generation)和ER(Ensemble Refinement)技术,这些技术将 Retrieval、生成和提示策略相结合以提高LLM的性能。为了展示LLM的能力,我们选择了来自世界上三个最大农业生产国的农业考试和标准Dataset:巴西、印度和美国。我们的分析显示,GPT-4在考试中可以达到88%的正确率,比之前的通用模型提高了5%。在一些实验中,GPT-4 even outperformed human subjects,这种性能表明GPT-4可能可以通过大学 entrance exams或者农业证书更新考试。我们还探讨了模型在农业相关问题上的总能力和生成农业管理指南的能力,使用了巴西农业局(Embrapa)的robust数据集和印度大学考试。结果表明,GPT-4、ER和RAG可以在农业教育、评估和农业管理实践中发挥重要作用,为农民和农业专业人员提供有价值的意见。
paper_authors: Jingyang Xiang, Siqi Li, Jun Chen, Shipeng Bai, Yukai Ma, Guang Dai, Yong Liu
for: This paper aims to train a uniform 1$\times$N sparse structured network from scratch, which can overcome the problems of expensive training cost, memory access, sub-optimal model quality, and unbalanced workload across threads in existing sparse weight selection and fine-tuning methods.
methods: The proposed method, called Soft Uniform Block Pruning (SUBP), repeatedly allows pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process, making the model less dependent on pre-training and achieving balanced workload.
results: The paper shows that the proposed SUBP method consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch, as demonstrated by comprehensive experiments across various CNN architectures on ImageNet.Abstract
The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a \emph{Block Sparse Row} matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1$\times$N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel \emph{\textbf{S}oft \textbf{U}niform \textbf{B}lock \textbf{P}runing} (SUBP) approach to train a uniform 1$\times$N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1$\times$N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at \url{https://github.com/JingyangXiang/SUBP}.
摘要
study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1×N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space saving by a Block Sparse Row matrix. 2) Excellent performance at a high sparsity. 3) Significant speedups on CPUs with Advanced Vector Extensions. Recent work requires selecting and fine-tuning 1×N sparse weights based on dense pre-trained weights, leading to the problems such as expensive training cost and memory access, sub-optimal model quality, as well as unbalanced workload across threads (different sparsity across output channels). To overcome them, this paper proposes a novel Soft Uniform Block Pruning (SUBP) approach to train a uniform 1×N sparse structured network from scratch. Specifically, our approach tends to repeatedly allow pruned blocks to regrow to the network based on block angular redundancy and importance sampling in a uniform manner throughout the training process. It not only makes the model less dependent on pre-training, reduces the model redundancy and the risk of pruning the important blocks permanently but also achieves balanced workload. Empirically, on ImageNet, comprehensive experiments across various CNN architectures show that our SUBP consistently outperforms existing 1×N and structured sparsity methods based on pre-trained models or training from scratch. Source codes and models are available at https://github.com/JingyangXiang/SUBP.
results: 研究发现,使用这两种技巧可以提高一个强有力的预训练机器翻译模型对idiomatic sentences的翻译精度,最高提高13%。此外,这些技巧还可以对非idiomatic sentences进行改进。Abstract
Idioms are common in everyday language, but often pose a challenge to translators because their meanings do not follow from the meanings of their parts. Despite significant advances, machine translation systems still struggle to translate idiomatic expressions. We provide a simple characterization of idiomatic translation and related issues. This allows us to conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To expand multilingual resources, we compile a dataset of ~4k natural sentences containing idiomatic expressions in French, Finnish, and Japanese. To improve translation of natural idioms, we introduce two straightforward yet effective techniques: the strategic upweighting of training loss on potentially idiomatic sentences, and using retrieval-augmented models. This not only improves the accuracy of a strong pretrained MT model on idiomatic sentences by up to 13% in absolute accuracy, but also holds potential benefits for non-idiomatic sentences.
摘要
idioms 是日常语言中很常见的表达方式,但是它们的意思并不是由其部件的意思所推导出来。虽然有了 significiant advances,机器翻译系统仍然难以翻译idiomatic表达。我们提供了一个简单的idiomatic翻译特征化和相关问题。这允许我们进行一个 sintethic experiment,揭示了使用 transformer-based 机器翻译模型时,正确地采用idiomatic翻译的tipping point。为扩展多语言资源,我们编译了 ~4k 自然句子中包含idiomatic表达的 French、Finland 和 Japanese 语言数据集。为了改进自然idiomatic翻译,我们介绍了两种简单 yet effective 技术:一是对潜在idiomatic句子的训练损失进行战略性增加,二是使用retrieval-augmented模型。这不仅提高了一个强制trained MT模型在idiomatic句子上的准确率,还有可能对非idiomatic句子产生正面的影响。
Automatic Macro Mining from Interaction Traces at Scale
results: 研究人员通过多种研究,包括用户评估、比较分析和自动执行这些 macro,证明了本approach的有效性和提取的 macro 在下游应用中的有用性。Abstract
Macros are building block tasks of our everyday smartphone activity (e.g., "login", or "booking a flight"). Effectively extracting macros is important for understanding mobile interaction and enabling task automation. These macros are however difficult to extract at scale as they can be comprised of multiple steps yet hidden within programmatic components of the app. In this paper, we introduce a novel approach based on Large Language Models (LLMs) to automatically extract semantically meaningful macros from both random and user-curated mobile interaction traces. The macros produced by our approach are automatically tagged with natural language descriptions and are fully executable. To examine the quality of extraction, we conduct multiple studies, including user evaluation, comparative analysis against human-curated tasks, and automatic execution of these macros. These experiments and analyses show the effectiveness of our approach and the usefulness of extracted macros in various downstream applications.
摘要
macro 是我们每天手机活动的基本构建块(例如,"登录" 或 "预订航班")。抽取macro有助于理解移动交互和实现任务自动化。但是,由于这些macro可能由多个步骤组成,并且隐藏在应用程序的编程组件中,因此EXTRACTING MACROS AT SCALE 是一项重要的挑战。在这篇论文中,我们提出了一种基于大语言模型(LLMs)的新方法,可以自动抽取手机交互轨迹中的semantically meaningful macro。这些macro被自动标记为自然语言描述,并且可以自动执行。为了评估EXTRACTING MACROS的质量,我们进行了多个研究,包括用户评估、对人工Curate任务进行比较分析,以及自动执行这些macro。这些实验和分析表明了我们的方法的有效性和抽取的macro的多种下游应用。
LLMs as Potential Brainstorming Partners for Math and Science Problems
results: 研究发现,当前的state-of-the-art LLMs在collective brainstorming中表现出了扎实的能力,并且可以帮助人类解决一些复杂的数学和科学问题。但是,这些模型也存在一些局限性和缺陷,需要进一步的改进和调整。Abstract
With the recent rise of widely successful deep learning models, there is emerging interest among professionals in various math and science communities to see and evaluate the state-of-the-art models' abilities to collaborate on finding or solving problems that often require creativity and thus brainstorming. While a significant chasm still exists between current human-machine intellectual collaborations and the resolution of complex math and science problems, such as the six unsolved Millennium Prize Problems, our initial investigation into this matter reveals a promising step towards bridging the divide. This is due to the recent advancements in Large Language Models (LLMs). More specifically, we conduct comprehensive case studies to explore both the capabilities and limitations of the current state-of-the-art LLM, notably GPT-4, in collective brainstorming with humans.
摘要
Recently, with the rise of widely successful deep learning models, there is growing interest among professionals in various math and science communities to see and evaluate the state-of-the-art models' abilities to collaborate on finding or solving problems that often require creativity and thus brainstorming. Although a significant gap still exists between current human-machine intellectual collaborations and the resolution of complex math and science problems, such as the six unsolved Millennium Prize Problems, our preliminary investigation into this matter reveals a promising step towards bridging the divide. This is due to the recent advancements in Large Language Models (LLMs). Specifically, we conduct comprehensive case studies to explore both the capabilities and limitations of the current state-of-the-art LLM, notably GPT-4, in collective brainstorming with humans.
Violation of Expectation via Metacognitive Prompting Reduces Theory of Mind Prediction Error in Large Language Models
paper_authors: Courtland Leer, Vincent Trost, Vineeth Voruganti
for: 本研究旨在探讨 Large Language Models (LLMs) 在理解人类心理的能力是如何提高的。
methods: 本研究使用了一种 Developmental psychology 中的机制 known as Violation of Expectation (VoE),以减少 LLM 预测用户的错误。并提出了一个 \textit{metacognitive prompting} 框架来应用 VoE 在 AI 教育中。
results: 研究发现,通过存储和检索在 LLM 对用户预期的情况下出现的事实,LLMs 能够学习关于用户的知识。最后,研究探讨了模型用户心理的潜在危险和可能的未来研究方向。Abstract
Recent research shows that Large Language Models (LLMs) exhibit a compelling level of proficiency in Theory of Mind (ToM) tasks. This ability to impute unobservable mental states to others is vital to human social cognition and may prove equally important in principal-agent relations between individual humans and Artificial Intelligences (AIs). In this paper, we explore how a mechanism studied in developmental psychology known as Violation of Expectation (VoE) can be implemented to reduce errors in LLM prediction about users by leveraging emergent ToM affordances. And we introduce a \textit{metacognitive prompting} framework to apply VoE in the context of an AI tutor. By storing and retrieving facts derived in cases where LLM expectation about the user was violated, we find that LLMs are able to learn about users in ways that echo theories of human learning. Finally, we discuss latent hazards and augmentative opportunities associated with modeling user psychology and propose ways to mitigate risk along with possible directions for future inquiry.
摘要
现代研究显示大语言模型(LLM)在理论心理(ToM)任务中表现出吸引人的水平。这种能够推理他人隐藏的心理状态的能力是人类社交认知的核心,可能对人工智能(AI)和人之间的主体-代理关系也非常重要。在这篇论文中,我们探讨了在发展心理学中研究的违反预期(VoE)机制,以减少LLM预测用户时的错误。我们还提出了一种“认知推导”框架,用于在AI教育者中应用VoE。通过存储和重新获取在LLM预测用户时出现的情况中的事实,我们发现LLM能够通过对用户学习方式的模拟来学习用户。最后,我们讨论了模型用户心理的潜在危险和可能的发展方向,并提出了降低风险的方法。
Why bother with geometry? On the relevance of linear decompositions of Transformer embeddings
results: 研究结果表明,嵌入分解指标与模型性能显示正相关,但是在不同的运行中存在很大的变化,表明geometry更反映模型特有的特征而不是句子特定的计算。Abstract
A recent body of work has demonstrated that Transformer embeddings can be linearly decomposed into well-defined sums of factors, that can in turn be related to specific network inputs or components. There is however still a dearth of work studying whether these mathematical reformulations are empirically meaningful. In the present work, we study representations from machine-translation decoders using two of such embedding decomposition methods. Our results indicate that, while decomposition-derived indicators effectively correlate with model performance, variation across different runs suggests a more nuanced take on this question. The high variability of our measurements indicate that geometry reflects model-specific characteristics more than it does sentence-specific computations, and that similar training conditions do not guarantee similar vector spaces.
摘要
Jaynes Machine: The universal microstructure of deep neural networks
paper_authors: Venkat Venkatasubramanian, N. Sanjeevrajan, Manasi Khandekar
for: 这 paper 的目的是提出一种新的深度神经网络的微结构理论。
methods: 这 paper 使用了一种名为统计电动力学的概念总结,它是统计 термо动力学和潜在游戏理论的概念合并。这种理论预测了深度神经网络中所有高度连接层的连接强度分布为 Lognormal($LN(\mu, \sigma)$),并且在理想条件下,$\mu$ 和 $\sigma$ 在所有层次和所有网络中都相同。这是因为所有连接在竞争和贡献效用方面达到了平衡,从而实现了总损失函数的最小化。
results: 这 paper 通过对六个大规模的深度神经网络实际数据进行验证,证明了这些预测的正确性。此外,这 paper 还讨论了如何利用这些结果来降低训练大深度神经网络所需的数据、时间和计算资源。Abstract
We present a novel theory of the microstructure of deep neural networks. Using a theoretical framework called statistical teleodynamics, which is a conceptual synthesis of statistical thermodynamics and potential game theory, we predict that all highly connected layers of deep neural networks have a universal microstructure of connection strengths that is distributed lognormally ($LN({\mu}, {\sigma})$). Furthermore, under ideal conditions, the theory predicts that ${\mu}$ and ${\sigma}$ are the same for all layers in all networks. This is shown to be the result of an arbitrage equilibrium where all connections compete and contribute the same effective utility towards the minimization of the overall loss function. These surprising predictions are shown to be supported by empirical data from six large-scale deep neural networks in real life. We also discuss how these results can be exploited to reduce the amount of data, time, and computational resources needed to train large deep neural networks.
摘要
我团队提出了一种新的深度神经网络微结构理论。使用一种名为统计电动力学的理论框架,这是统计 термодинами学和潜在游戏理论的概念合成。我们预测了所有深度神经网络中高度连接层的微结构强度分布为Lognormal($LN(\mu, \sigma)$)。此外,在理想情况下,我们预测$\mu$和$\sigma$在所有层次和所有网络中都相同。这是因为所有连接都在竞争和贡献同样的有效利用于最小化总损失函数。这些意外预测得到了实际数据中6个大规模深度神经网络的支持。我们还讨论了如何利用这些结果减少训练大深度神经网络所需的数据、时间和计算资源。
Creation Of A ChatBot Based On Natural Language Proccesing For Whatsapp
paper_authors: Valderrama Jonatan, Aguilar-Alonso Igor
for: 提高客户满意度和公司服务质量 through WhatsApp chatbot
methods: 基于自然语言处理的 chatbot 开发
results: 实现快速和准确的回答,提高客户服务效率和客户满意度Here’s the simplified Chinese text for each point:
for: 通过 WhatsApp chatbot 提高客户满意度和公司服务质量
methods: 基于自然语言处理的 chatbot 开发
results: 实现快速和准确的回答,提高客户服务效率和客户满意度Abstract
In the era of digital transformation, customer service is of paramount importance to the success of organizations, and to meet the growing demand for immediate responses and personalized assistance 24 hours a day, chatbots have become a promising tool to solve these problems. Currently, there are many companies that need to provide these solutions to their customers, which motivates us to study this problem and offer a suitable solution. The objective of this study is to develop a chatbot based on natural language processing to improve customer satisfaction and improve the quality of service provided by the company through WhatsApp. The solution focuses on creating a chatbot that efficiently and effectively handles user queries. A literature review related to existing chatbots has been conducted, analyzing methodological approaches, artificial intelligence techniques and quality attributes used in the implementation of chatbots. The results found highlight that chatbots based on natural language processing enable fast and accurate responses, which improves the efficiency of customer service, as chatbots contribute to customer satisfaction by providing accurate answers and quick solutions to their queries at any time. Some authors point out that artificial intelligence techniques, such as machine learning, improve the learning and adaptability of chatbots as user interactions occur, so a good choice of appropriate natural language understanding technologies is essential for optimal chatbot performance. The results of this study will provide a solid foundation for the design and development of effective chatbots for customer service, ensuring a satisfactory user experience and thus meeting the needs of the organization.
摘要
在数字化转型时代,客户服务对组织的成功非常重要,为了应对增长的快速响应和个性化帮助需求,聊天机器人已成为一种有前途的解决方案。目前有很多公司需要为客户提供这些解决方案,这使我们感到需要研究这个问题并提供适合的解决方案。本研究的目标是开发基于自然语言处理的聊天机器人,以提高客户满意度和公司向客户提供的服务质量。解决方案关注于创建高效高质量的聊天机器人,以快速和准确地处理用户查询。在现有聊天机器人的研究中,我们进行了文献综述,分析了方法ológicas approached,人工智能技术和质量特征在聊天机器人的实施中使用。结果显示,基于自然语言处理的聊天机器人可以快速和准确地回答用户查询,从而提高客户服务的效率,因为聊天机器人可以为客户提供快速和准确的答案,使用户满意度提高。一些作者指出,人工智能技术,如机器学习,可以使聊天机器人在用户互动时进行学习和适应,因此选择合适的自然语言理解技术是聊天机器人性能优化的关键。本研究的结果将为聊天机器人的设计和开发提供坚实的基础,确保用户体验满意,从而满足组织的需求。
Document-Level Supervision for Multi-Aspect Sentiment Analysis Without Fine-grained Labels
for: This paper proposes a VAE-based topic modeling approach for aspect-based sentiment analysis (ABSA) that does not require fine-grained labels for aspects or sentiments.
methods: The proposed approach uses document-level supervision and leverages user-generated text with overall sentiment to detect multiple aspects in a document and reason about their contributions to the overall sentiment.
results: The approach significantly outperforms a state-of-the-art baseline on two benchmark datasets from different domains.Here’s the text in Simplified Chinese:
results: 该方法在两个不同领域的两个标准 benchmark 数据集上显著超越了一个状态监督的基准。Abstract
Aspect-based sentiment analysis (ABSA) is a widely studied topic, most often trained through supervision from human annotations of opinionated texts. These fine-grained annotations include identifying aspects towards which a user expresses their sentiment, and their associated polarities (aspect-based sentiments). Such fine-grained annotations can be expensive and often infeasible to obtain in real-world settings. There is, however, an abundance of scenarios where user-generated text contains an overall sentiment, such as a rating of 1-5 in user reviews or user-generated feedback, which may be leveraged for this task. In this paper, we propose a VAE-based topic modeling approach that performs ABSA using document-level supervision and without requiring fine-grained labels for either aspects or sentiments. Our approach allows for the detection of multiple aspects in a document, thereby allowing for the possibility of reasoning about how sentiment expressed through multiple aspects comes together to form an observable overall document-level sentiment. We demonstrate results on two benchmark datasets from two different domains, significantly outperforming a state-of-the-art baseline.
摘要
《方面基于情感分析(ABSA)是一个广泛研究的话题,通常通过人类注释的意见文本进行培育。这些细化的注释包括确定用户表达情感的方面以及其相关的负面性(方面基于情感)。然而,在实际场景中获得这些细化注释可能是昂贵的和不可能完成的。在这篇论文中,我们提出了基于VAE的话题模型方法,用于实现ABSA,不需要文本级别的细化标注,也不需要方面或情感的细化标注。我们的方法允许文档中检测多个方面,从而允许理解多个方面的情感如何共同形成可见的总文档级别的情感。我们在两个不同领域的两个标准 benchmark 数据集上进行了实验,并在比较一个基eline之下显著地提高了性能。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.
Improving Contrastive Learning of Sentence Embeddings with Focal-InfoNCE
methods: combinest SimCSE 和 hard negative mining, introduce self-paced modulation terms in the contrastive objective
results: 改进句子表示的Spearman correlation和Alignment和UniformityAbstract
The recent success of SimCSE has greatly advanced state-of-the-art sentence representations. However, the original formulation of SimCSE does not fully exploit the potential of hard negative samples in contrastive learning. This study introduces an unsupervised contrastive learning framework that combines SimCSE with hard negative mining, aiming to enhance the quality of sentence embeddings. The proposed focal-InfoNCE function introduces self-paced modulation terms in the contrastive objective, downweighting the loss associated with easy negatives and encouraging the model focusing on hard negatives. Experimentation on various STS benchmarks shows that our method improves sentence embeddings in terms of Spearman's correlation and representation alignment and uniformity.
摘要
最近,SimCSE的成功有效地提高了现代句子表示的状态艺。然而,原始的SimCSE формулировция并没有充分利用强有力的负样本在对比学习中的潜力。本研究提出了一种无监督对比学习框架,将SimCSE与强负样本挖掘结合起来,以提高句子嵌入的质量。我们提出的自适应InfoNCE函数在对比目标中添加了自适应调整项,将易于获得的负样本下Weight,让模型更加注重困难的负样本。经过实验表明,我们的方法可以提高句子嵌入的斯宾森相关度和表示对应性和一致性。
A Comparative Study of Transformer-based Neural Text Representation Techniques on Bug Triaging
results: 研究发现,DeBERTa 是最有效的方法,在开发者和组件归属和漏洞地方化等三个任务中具有 statistically significant 的表现优势。但是,每种方法都有其特点和优势,适用于不同类型的漏洞报告。Abstract
Often, the first step in managing bug reports is related to triaging a bug to the appropriate developer who is best suited to understand, localize, and fix the target bug. Additionally, assigning a given bug to a particular part of a software project can help to expedite the fixing process. However, despite the importance of these activities, they are quite challenging, where days can be spent on the manual triaging process. Past studies have attempted to leverage the limited textual data of bug reports to train text classification models that automate this process -- to varying degrees of success. However, the textual representations and machine learning models used in prior work are limited by their expressiveness, often failing to capture nuanced textual patterns that might otherwise aid in the triaging process. Recently, large, transformer-based, pre-trained neural text representation techniques such as BERT have achieved greater performance in several natural language processing tasks. However, the potential for using these techniques to improve upon prior approaches for automated bug triaging is not well studied or understood. Therefore, in this paper we offer one of the first investigations that fine-tunes transformer-based language models for the task of bug triaging on four open source datasets, spanning a collective 53 years of development history with over 400 developers and over 150 software project components. Our study includes both a quantitative and qualitative analysis of effectiveness. Our findings illustrate that DeBERTa is the most effective technique across the triaging tasks of developer and component assignment, and the measured performance delta is statistically significant compared to other techniques. However, through our qualitative analysis, we also observe that each technique possesses unique abilities best suited to certain types of bug reports.
摘要
通常,处理bug报告的第一步是将bug分配到适合的开发者,以便他们能够更好地理解、本地化和修复目标bug。此外,将bug分配到特定的软件项目部分也可以帮助加速修复过程。然而,这些活动具有挑战性,可能需要数天的手动分配过程。过去的研究已经尝试使用bug报告的有限文本数据来训练文本分类模型,以便自动进行这些活动——尽管效果不一。然而,这些表达和机器学习模型在先前的工作中有限,常常无法捕捉bug报告中细腻的文本模式,这可能会帮助分配过程。最近,大型的transformer-based大型预训练神经网络模型,如BERT,在自然语言处理任务中已经达到了更高的性能。然而,使用这些技术来改进先前的自动分配策略的可能性并不很了解或研究。因此,在这篇论文中,我们提供了一个由BERT等大型神经网络模型进行微调的首次研究,用于在四个开源数据集上进行自动分配。我们的研究包括量化和质量分析的效果分析。我们的发现表明,DeBERTa是分配任务中最有效的技术,并且与其他技术的性能差异是统计学上有意义的。然而,我们的质量分析也表明,每种技术都具有特定的优势,适合某些类型的bug报告。
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
results: 在单文检索、多文检索、简要摘要、 sintetic 任务和代码完成任务等长文本场景中,LongLLMLingua 可以 deriv 更高的性能,并降低了终端系统的响应时间。例如,在 NaturalQuestions bencmark 上,LongLLMLingua 可以提高 GPT-3.5-Turbo 的性能 by 17.1%,并且只需要输入 ~4x fewer tokens。此外,LongLLMLingua 可以在压缩提示的情况下,提高终端系统的响应速度。Abstract
In long context scenarios, large language models (LLMs) face three main challenges: higher computational/financial cost, longer latency, and inferior performance. Some studies reveal that the performance of LLMs depends on both the density and the position of the key information (question relevant) in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. We conduct evaluation on a wide range of long context scenarios including single-/multi-document QA, few-shot learning, summarization, synthetic tasks, and code completion. The experimental results show that LongLLMLingua compressed prompt can derive higher performance with much less cost. The latency of the end-to-end system is also reduced. For example, on NaturalQuestions benchmark, LongLLMLingua gains a performance boost of up to 17.1% over the original prompt with ~4x fewer tokens as input to GPT-3.5-Turbo. It can derive cost savings of \$28.5 and \$27.4 per 1,000 samples from the LongBench and ZeroScrolls benchmark, respectively. Additionally, when compressing prompts of ~10k tokens at a compression rate of 2x-10x, LongLLMLingua can speed up the end-to-end latency by 1.4x-3.8x. Our code is available at https://aka.ms/LLMLingua.
摘要
受长文本场景限制,大语言模型(LLM)面临三大挑战:更高的计算/金融成本、更长的延迟时间和较差的性能。一些研究表明,LLM的性能与输入提示中关键信息的密度和位置有关。以这些发现为灵感,我们提出了LongLLMLingua,用于提取提示中关键信息,以同时解决这三个挑战。我们在单/多文档问答、几拍学习、概要、人工任务和代码完成等多种长文本场景进行评估。实验结果表明,LongLLMLingua压缩后的提示可以提高性能,并且减少了终端系统的延迟时间。例如,在NaturalQuestionsBenchmark上,LongLLMLingua可以在GPT-3.5-Turbo上提高性能,并且只需输入4x少于原始提示的token数量。此外,当压缩提示长度为10k字时,LongLLMLingua可以将终端系统的延迟时间加速1.4x-3.8x。我们的代码可以在https://aka.ms/LLMLingua上下载。
Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency
results: 该论文使用GPT-4生成新的测试项,并使用精度调整后的LLM来筛选符合心理测量标准的题目。 results show that the generated test scores are highly correlated (r=0.93) with those of a standard test form written by human experts, and the generated tests closely correspond to the original test’s difficulty and reliability based on crowdworker responses.Abstract
Developing an educational test can be expensive and time-consuming, as each item must be written by experts and then evaluated by collecting hundreds of student responses. Moreover, many tests require multiple distinct sets of questions administered throughout the school year to closely monitor students' progress, known as parallel tests. In this study, we focus on tests of silent sentence reading efficiency, used to assess students' reading ability over time. To generate high-quality parallel tests, we propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items. With these simulated responses, we can estimate each item's difficulty and ambiguity. We first use GPT-4 to generate new test items following a list of expert-developed rules and then apply a fine-tuned LLM to filter the items based on criteria from psychological measurements. We also propose an optimal-transport-inspired technique for generating parallel tests and show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses. Our evaluation of a generated test with 234 students from grades 2 to 8 produces test scores highly correlated (r=0.93) to those of a standard test form written by human experts and evaluated across thousands of K-12 students.
摘要
开发教育测试可能会很昂贵和时间consuming,因为每个项目都需要由专家写作并由数百名学生回答。此外,许多测试需要在学年中多次进行测试,以便密切监测学生的进步,这种测试被称为平行测试。在这项研究中,我们关注 silent sentence reading efficiency 测试,用于评估学生的阅读能力。为生成高质量平行测试,我们提议使用大型自然语言模型(LLM)来模拟以前学生对未看过的问题的回答。通过这些模拟回答,我们可以估算每个问题的难度和抽象性。我们首先使用 GPT-4 生成新的测试项目,并应用一个精度调整的 LLM 来过滤测试项目基于心理测量的标准。我们还提出一种基于最优运输的技术来生成平行测试,并证明生成的测试与原始测试的难度和可靠性具有高度相似性。我们对234名二至八年级学生进行评估,得到的测试分数与由人类专家编写的标准测试形式相高度相关(r=0.93)。
Lemur: Harmonizing Natural Language and Code for Language Agents
paper_authors: Yiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, Tao Yu
For: 本研究开发了一个名为Lemur和Lemur-Chat的开源语言模型,用于实现多元语言代理人。* Methods: 研究人员使用了一个代码数据集进行谨慎预训,并对文本和程式码数据进行微调。* Results: 研究人员通过实验发现,Lemur和Lemur-Chat可以在多种环境中实现高水平的表现,并且与商业化模型相比,它们在代理人能力方面表现更为出色。Abstract
We introduce Lemur and Lemur-Chat, openly accessible language models optimized for both natural language and coding capabilities to serve as the backbone of versatile language agents. The evolution from language chat models to functional language agents demands that models not only master human interaction, reasoning, and planning but also ensure grounding in the relevant environments. This calls for a harmonious blend of language and coding capabilities in the models. Lemur and Lemur-Chat are proposed to address this necessity, demonstrating balanced proficiencies in both domains, unlike existing open-source models that tend to specialize in either. Through meticulous pre-training using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks among open-source models. Comprehensive experiments demonstrate Lemur's superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully- and partially- observable environments. The harmonization between natural and programming languages enables Lemur-Chat to significantly narrow the gap with proprietary models on agent abilities, providing key insights into developing advanced open-source agents adept at reasoning, planning, and operating seamlessly across environments. https://github.com/OpenLemur/Lemur
摘要
我们介绍Lemur和Lemur-Chat,是一 pair of开源语言模型,旨在扩展语言和程式码之间的共同能力,以便建立多元化的语言代理。从语言交流模型演化为功能性语言代理需要模型不仅掌握人类互动、推理和观念,而且还需要与环境相互融合。这需要模型同时具备语言和程式码的能力。Lemur和Lemur-Chat被提议以应对这个需求,并在多个语言和程式码benchmark测试中表现出色。我们通过精心预训使用一个具有程式码的资料集,以及对文本和程式码数据进行精确调整,使我们的模型在开源模型中表现出积极的平均性能。实验结果显示Lemur在开源模型中表现出色,并且在不同的代理任务中具备广泛的能力,包括人类交流、工具使用和受完全和受限 Observable 环境中的互动。通过自然语言和程式码之间的融合,Lemur-Chat可以对Proprietary模型的代理能力进行明显的缩小,提供关键的意见,以帮助开发高水准的开源代理,能够快速推理、规划和在不同环境中顺畅运行。更多资讯可以在GitHub上找到:https://github.com/OpenLemur/Lemur
Teaching Language Models to Hallucinate Less with Synthetic Tasks
results: 在三个实际的摘要任务上,SynTra 能够减少两个 13B 参数的 LLM 的幻觉。此外,研究还发现,在 synthetic task 上优化系统消息比优化模型参数更加重要,而 fine-tuning 整个模型在 synthetic task 上可能会增加幻觉。Abstract
Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing LLMs to hallucinate less on these tasks is challenging, as hallucination is hard to efficiently evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We also find that optimizing the system message rather than the model weights can be critical; fine-tuning the entire model on the synthetic task can counterintuitively increase hallucination. Overall, SynTra demonstrates that the extra flexibility of working with synthetic data can help mitigate undesired behaviors in practice.
摘要
大型语言模型(LLM)在抽象摘要化任务中常常会出现幻想,例如文档问答、会议摘要和医疗报告生成,即使所有必要的信息都包含在 контек斯中。但是,对 LLM 进行幻想调整是困难的,因为幻想难以在每个优化步骤中有效评估。在这个工作中,我们显示了将幻想降低在 sintetic 任务上可以降低实际世界下渠道任务中的幻想。我们的方法 SynTra 首先设计了 sintetic 任务,可以轻松诱发和评估幻想。然后,SynTra 透过 prefix-tuning 优化 LLM 的系统讯息,最后将系统讯息转换到实际、difficult-to-optimize 任务上。在三个实际抽象摘要化任务中,SynTra 可以降低两个 13B 参数 LLM 的幻想。我们还发现,对系统讯息进行优化可以是关键的;精确地调整整个模型的参数可能会增加幻想。总的来说,SynTra 显示了使用 sintetic 数据可以帮助解决实际中的问题。
paper_authors: John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, Alexander M. Rush
for: investigate the problem of embedding inversion, reconstructing the full text represented in dense text embeddings.
methods: frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space.
results: recover $92%$ of $32\text{-token}$ text inputs exactly using a multi-step method that iteratively corrects and re-embeds text.Abstract
How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a na\"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92\%$ of $32\text{-token}$ text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github: \href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.
摘要
TEXT我们研究了文本嵌入的私人信息泄露问题,具体来说是文本嵌入的反推问题,即通过 dense text embeddings 中的点来恢复原始文本。我们将问题定义为控制生成问题,即生成文本,其重新嵌入后与给定点在嵌入空间很近。我们发现,直接使用嵌入模型conditioned的模型表现不佳,但是通过Iteratively Correct and Re-Embed Text(ICRT)方法,可以准确地恢复 $92\%$ 的 $32$-token 文本输入。我们使用两种现状顶尖嵌入模型来训练我们的模型,并示出我们的模型可以从医疗笔记中提取重要的个人信息(全名)。我们的代码可以在 Github 上找到:https://github.com/jxmorris12/vec2text。Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
Uni3D: Exploring Unified 3D Representation at Scale
For: 本研究旨在探讨3D对象和场景的扩大表示,以探索3D世界中的一元表示。* Methods: 本研究使用2D初始化的ViT端到终推理,将3D点云特征与图像文本对齐。通过简单的architecture和预测任务,Uni3D可以利用丰富的2D预测模型和图像文本对齐模型作为初始化,从而解锁2D模型和扩大策略在3D世界中的潜力。* Results: 我们效率地扩大Uni3D到一亿个参数,并在广泛的3D任务中设置新的纪录,如零shot分类、少shot分类、开放世界理解和部分 segmentation。我们还示出Uni3D的强大表示能够应用于3D绘制和 Retrieval in the wild。我们认为Uni3D提供了一个新的方向,用于探索3D表示的扩大和效率。Abstract
Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain.
摘要
压缩表示法在图像或文本领域已经得到了广泛的研究,并导致了视觉和语言学习领域的革命。然而,对于3D对象和场景的可扩展表示仍然相对未经探索。在这个工作中,我们提出了Uni3D,一个用于探索可扩展3D表示的基础模型。Uni3D使用一个初始化为2D的ViT结构,通过对3D点云特征与图像和文本对齐的方式进行预training,以获得一个协调的3D表示。通过简单的建筑和预text任务,Uni3D可以利用丰富的2D预训练模型和图像和文本对齐的模型作为目标,从而解锁2D模型和扩展策略在3D世界的潜力。我们效率地扩展Uni3D到一亿个参数,并在广泛的3D任务上设置新的纪录,如零shot分类、几shot分类、开放世界理解和部分 segmentation。我们显示Uni3D表示也可以应用于3D涂鸦和野外检索。我们认为Uni3D提供了一个新的方向,用于探索3D领域中的表示扩展和效率。
OmniLingo: Listening- and speaking-based language learning
results: 论文提供了一个基于IPFS的分布数据架构和一个示例客户端,用于支持语言学习应用程序的听说学习。Abstract
In this demo paper we present OmniLingo, an architecture for distributing data for listening- and speaking-based language learning applications and a demonstration client built using the architecture. The architecture is based on the Interplanetary Filesystem (IPFS) and puts at the forefront user sovereignty over data.
摘要
在这份 demo 纸上,我们介绍 OmniLingo,一种分布式数据架构,用于语音和语言学习应用程序,以及一个基于 Interplanetary Filesystem (IPFS) 的示例客户端。这种架构强调用户主权 над数据。Here's a breakdown of the text:* "在这份 demo 纸上" (在这份 demo 纸上) - This phrase is used to indicate that the topic being discussed is a demo or a sample.* "我们介绍 OmniLingo" (我们介绍 OmniLingo) - This phrase introduces the topic of the discussion, which is OmniLingo.* "一种分布式数据架构" (一种分布式数据架构) - This phrase describes OmniLingo as a distributed data architecture.* "用于语音和语言学习应用程序" (用于语音和语言学习应用程序) - This phrase explains the purpose of OmniLingo, which is to support listening- and speaking-based language learning applications.* "以及一个基于 Interplanetary Filesystem (IPFS) 的示例客户端" (以及一个基于 Interplanetary Filesystem (IPFS) 的示例客户端) - This phrase provides more information about OmniLingo, specifically that it is based on the Interplanetary Filesystem (IPFS) and includes a demonstration client.* "这种架构强调用户主权 над数据" (这种架构强调用户主权 над数据) - This phrase emphasizes the importance of user sovereignty over data in the OmniLingo architecture.
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models
methods: 本研究使用了一个新的benchmark方法 named TRACE,包括8个不同的数据集,涵盖域专业任务、多语言能力、代码生成和数学逻辑等多种挑战任务。
results: 实验结果表明,在TRACE数据集上训练后,已经aligned的LLMs呈现了显著的普通能力和指令遵循能力下降。例如,llama2-chat 13B在gsm8k数据集上的准确率从28.8%降至2%。这表明需要找到一个适合的权衡,以确保实现特定任务的表现,而不会导致LLMs的原始能力减退。Abstract
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs, owing to both their simplicity and the models' potential exposure during instruction tuning. In this paper, we introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs. TRACE consists of 8 distinct datasets spanning challenging tasks including domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. All datasets are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities. For example, the accuracy of llama2-chat 13B on gsm8k dataset declined precipitously from 28.8\% to 2\% after training on our datasets. This highlights the challenge of finding a suitable tradeoff between achieving performance on specific tasks while preserving the original prowess of LLMs. Empirical findings suggest that tasks inherently equipped with reasoning paths contribute significantly to preserving certain capabilities of LLMs against potential declines. Motivated by this, we introduce the Reasoning-augmented Continual Learning (RCL) approach. RCL integrates task-specific cues with meta-rationales, effectively reducing catastrophic forgetting in LLMs while expediting convergence on novel tasks.
摘要
aligned large language models (LLMs) 表现出色地解决任务、遵循指令和保持安全。然而,这些aligned LLMs的持续学习方面尚未得到充分的注意。现有的持续学习标准benchmarklacks sufficient challenge for leading aligned LLMs, owing to both their simplicity and the models' potential exposure during instruction tuning.在这篇论文中,我们介绍TRACE,一个新的benchmark,用于评估LLMs的持续学习能力。TRACE包括8个不同的数据集,涵盖域специфи任务、多语言能力、代码生成和数学逻辑推理。所有数据集都是 стандар化了,以便自动评估LLMs。我们的实验表明,在TRACE上训练后,aligned LLMs的总能力和遵循指令能力都会显著下降。例如,llama2-chat 13B在gsm8k数据集上的准确率从28.8%下降到2%。这说明了在寻找适当的任务和原始模型能力之间的权衡是一个挑战。我们的实验结果表明,具有逻辑路径的任务可以帮助保持LLMs的一些能力。基于这一点,我们提出了Reasoning-augmented Continual Learning(RCL)方法。RCL通过将任务特有的cue与元理性相结合,以降低LLMs中的恶化学习,同时加速在新任务上的 converges。
Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration
results: 我们对比了 INDENT 模型和文本基于的 heuristics 模型,并证明了 INDENT 模型在 R-avg 方面提高了约 3%。我们还表明了使用 state-of-the-art ASR 模型生成的噪音 ASR 可以在搜寻时提供更好的结果。 finally,我们证明了 INDENT 只需要在印地语料上训练,就可以在 11 种指定语言上进行搜寻。Abstract
The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on the temporal ordering of sentences to learn speech embeddings that capture the semantics of the underlying spoken text. These learnt embeddings are used to retrieve the corresponding audio segment based on text queries at inference time. We empirically demonstrate the significant effectiveness (improvement in R-avg of about 3%) of our model over those obtained using text-based heuristics. We also show how noisy ASR, generated using state-of-the-art ASR models for Indian languages, yields better results when used in place of speech. INDENT, trained only on Hindi data is able to cater to all languages supported by the (semantically) shared text space. We illustrate this empirically on 11 Indic languages.
摘要
audio-to-文本对齐问题在训练中得到了大量研究,通常是使用完全监督。但是,这并不是在长度较长的音频文件中,文本 queries 中的内容不是直接出现在音频文件中的情况。这是一项与非政府组织CARE印度合作的工作,收集了印度北部锡库的年轻母亲的长 Audio 健康调查。给定一个问卷中的问题,我们的目标是在长 Audio 录音中找到这个问题的位置。这对于非洲和亚洲组织来说是非常有价值的,否则他们需要慢慢地从长度较长的 Audio 录音中找到问题(以及答案)。我们提出了一个名为 INDENT 的框架,使用 cross-attention 模型和前期知识来学习 speech 嵌入,这些嵌入 capture 了下面的含义。在推理时,我们使用这些学习的嵌入来根据文本查询 retrieve 相应的音频段。我们实际示出了我们模型比使用文本基于的优化法得到的效果更好(提高 R-avg 约 3%)。我们还显示了使用 state-of-the-art ASR 模型生成的噪音 ASR 可以在某些情况下提供更好的结果。INDENT,只在印地语料上训练,能够涵盖所有支持 Semantic 共享文本空间中的语言。我们在 11 种指定语言上进行了实质性的示例。
Learning Multiplex Embeddings on Text-rich Networks with One Text Encoder
for: Learning multiple types of relationships in text-rich networks
methods: Using one text encoder to model shared knowledge across relations, and deriving relation-specific representations with a small number of parameters
results: Significantly and consistently outperforming baselines on nine downstream tasks in five networks, with high parameter efficiency.Abstract
In real-world scenarios, texts in a network are often linked by multiple semantic relations (e.g., papers in an academic network are referenced by other publications, written by the same author, or published in the same venue), where text documents and their relations form a multiplex text-rich network. Mainstream text representation learning methods use pretrained language models (PLMs) to generate one embedding for each text unit, expecting that all types of relations between texts can be captured by these single-view embeddings. However, this presumption does not hold particularly in multiplex text-rich networks. Along another line of work, multiplex graph neural networks (GNNs) directly initialize node attributes as a feature vector for node representation learning, but they cannot fully capture the semantics of the nodes' associated texts. To bridge these gaps, we propose METERN, a new framework for learning Multiplex Embeddings on TExt-Rich Networks. In contrast to existing methods, METERN uses one text encoder to model the shared knowledge across relations and leverages a small number of parameters per relation to derive relation-specific representations. This allows the encoder to effectively capture the multiplex structures in the network while also preserving parameter efficiency. We conduct experiments on nine downstream tasks in five networks from both academic and e-commerce domains, where METERN outperforms baselines significantly and consistently. The code is available at https://github.com/PeterGriffinJin/METERN-submit.
摘要
在实际场景中,网络中的文本经常被多种Semantic relation连接(例如,学术文献之间的引用、作者之间的共同写作或者发表在同一个会议上),这些文本文档和其关系组成了一个多重文本rich网络。主流文本表示学习方法使用预训练语言模型(PLM)生成每个文本单元的一个嵌入,期望所有类型的文本关系都可以通过这些单一视图嵌入被捕捉。然而,这个假设不符合特别在多重文本rich网络中。另一条工作线索是多种文本 graphs neural networks(GNNs)直接初始化节点属性为节点表示学习的特征向量,但它们无法完全捕捉节点相关文本的 semantics。为了覆盖这些差距,我们提出了METERN框架,一种新的文本多重嵌入学习框架。METERN使用一个文本编码器来模型关系间共享知识,并使用每个关系只需一些参数来生成特定关系表示。这使得编码器能够有效地捕捉多重结构,同时也能够保持参数效率。我们在五个网络和九个下渠任务上进行了实验,METERN与基线相比显著地提高了表现,并在多个网络和任务上保持稳定的高效性。代码可以在https://github.com/PeterGriffinJin/METERN-submit中找到。
SEER : A Knapsack approach to Exemplar Selection for In-Context HybridQA
methods: 本文提出 Selection of ExEmplars for hybrid Reasoning (SEER) 方法,该方法将 exemplar 选择问题转化为 Knapsack 整数线性编程,以便满足多样化约束和容量约束。
results: 在 FinQA 和 TAT-QA 两个实际 benchmark 上,SEER 方法比前一代 exemplar 选择方法表现更高效。Abstract
Question answering over hybrid contexts is a complex task, which requires the combination of information extracted from unstructured texts and structured tables in various ways. Recently, In-Context Learning demonstrated significant performance advances for reasoning tasks. In this paradigm, a large language model performs predictions based on a small set of supporting exemplars. The performance of In-Context Learning depends heavily on the selection procedure of the supporting exemplars, particularly in the case of HybridQA, where considering the diversity of reasoning chains and the large size of the hybrid contexts becomes crucial. In this work, we present Selection of ExEmplars for hybrid Reasoning (SEER), a novel method for selecting a set of exemplars that is both representative and diverse. The key novelty of SEER is that it formulates exemplar selection as a Knapsack Integer Linear Program. The Knapsack framework provides the flexibility to incorporate diversity constraints that prioritize exemplars with desirable attributes, and capacity constraints that ensure that the prompt size respects the provided capacity budgets. The effectiveness of SEER is demonstrated on FinQA and TAT-QA, two real-world benchmarks for HybridQA, where it outperforms previous exemplar selection methods.
摘要
In this work, we propose Selection of ExEmplars for hybrid Reasoning (SEER), a novel method for selecting a set of exemplars that is both representative and diverse. The key innovation of SEER is that it formulates exemplar selection as a Knapsack Integer Linear Program. The Knapsack framework provides the flexibility to incorporate diversity constraints that prioritize exemplars with desirable attributes and capacity constraints that ensure that the prompt size respects the provided capacity budgets.We demonstrate the effectiveness of SEER on FinQA and TAT-QA, two real-world benchmarks for HybridQA, where it outperforms previous exemplar selection methods.
Making Large Language Models Perform Better in Knowledge Graph Completion
results: 作者通过对这些结构意识LLM-based KGC方法进行了广泛的实验和深入分析,并证明了在引入结构信息后,LLM 的知识理解能力得到了改善。Abstract
Large language model (LLM) based knowledge graph completion (KGC) aims to predict the missing triples in the KGs with LLMs and enrich the KGs to become better web infrastructure, which can benefit a lot of web-based automatic services. However, research about LLM-based KGC is limited and lacks effective utilization of LLM's inference capabilities, which ignores the important structural information in KGs and prevents LLMs from acquiring accurate factual knowledge. In this paper, we discuss how to incorporate the helpful KG structural information into the LLMs, aiming to achieve structrual-aware reasoning in the LLMs. We first transfer the existing LLM paradigms to structural-aware settings and further propose a knowledge prefix adapter (KoPA) to fulfill this stated goal. KoPA employs structural embedding pre-training to capture the structural information of entities and relations in the KG. Then KoPA informs the LLMs of the knowledge prefix adapter which projects the structural embeddings into the textual space and obtains virtual knowledge tokens as a prefix of the input prompt. We conduct comprehensive experiments on these structural-aware LLM-based KGC methods and provide an in-depth analysis comparing how the introduction of structural information would be better for LLM's knowledge reasoning ability. Our code is released at https://github.com/zjukg/KoPA.
摘要
Self-Supervised Representation Learning for Online Handwriting Text Classification
methods: 该研究使用了Part of Stroke Masking(POSM)作为预处理模型的预测任务,并提出了两种精度预处理模型的精度。
results: 该研究通过对预处理模型进行内在和外在评估方法,发现预处理模型可以达到写作人员认知、性别识别和手性识别等任务的最新状态。Abstract
Self-supervised learning offers an efficient way of extracting rich representations from various types of unlabeled data while avoiding the cost of annotating large-scale datasets. This is achievable by designing a pretext task to form pseudo labels with respect to the modality and domain of the data. Given the evolving applications of online handwritten texts, in this study, we propose the novel Part of Stroke Masking (POSM) as a pretext task for pretraining models to extract informative representations from the online handwriting of individuals in English and Chinese languages, along with two suggested pipelines for fine-tuning the pretrained models. To evaluate the quality of the extracted representations, we use both intrinsic and extrinsic evaluation methods. The pretrained models are fine-tuned to achieve state-of-the-art results in tasks such as writer identification, gender classification, and handedness classification, also highlighting the superiority of utilizing the pretrained models over the models trained from scratch.
摘要
自我指导学习提供了一种高效的方法,可以从不同类型的无标记数据中提取丰富的表示,而不需要投入大规模数据集的标注成本。这可以通过设计一个预tex任务,以模式和领域为据,生成 pseudo标签。在在线手写文本的应用场景中,在这项研究中,我们提出了一种新的部分roke掩蔽(POSM)作为预training模型的预tex任务,以提取英语和中文语言的在线手写人员的信息有价值表示。同时,我们还提出了两种可行的精度调整管道。为了评估提取的表示质量,我们使用了内在和外在评估方法。经过精度调整,预training模型可以达到当今最佳的写作人员认可、性别分类和手征分类等任务的结果,同时还 highlighted 预training模型的优势,比投入从头开始训练的模型更高效。
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
For: This paper aims to benchmark the counterfactual reasoning ability of multi-modal large language models.* Methods: The authors use the VQAv2 dataset and add a counterfactual presupposition to the questions, then generate counterfactual questions and answers using ChatGPT. They manually examine all generated questions and answers to ensure correctness.* Results: The authors evaluate recent vision language models on their newly collected test dataset and find that all models exhibit a large performance drop compared to the results tested on questions without the counterfactual presupposition, indicating that there is still room for improving vision language models. Additionally, the authors find a large gap between GPT-4 and current open-source models.Here are the three points in Simplified Chinese text:* For: 这篇论文目的是为了评估多模态大语言模型的反事实理解能力。* Methods: 作者使用VQAv2集成 dataset,并将问题中添加反事实前提,然后使用ChatGPT生成反事实问题和答案。他们手动检查所有生成的问题和答案,以确保正确性。* Results: 作者在新收集的测试集上评估了最近的视觉语言模型,发现所有模型在反事实前提下的表现均下降了较大,这表明还有很大的空间用于发展视觉语言模型。此外,作者发现GPT-4和当前开源模型之间存在很大的差距。Abstract
Counterfactual reasoning ability is one of the core abilities of human intelligence. This reasoning process involves the processing of alternatives to observed states or past events, and this process can improve our ability for planning and decision-making. In this work, we focus on benchmarking the counterfactual reasoning ability of multi-modal large language models. We take the question and answer pairs from the VQAv2 dataset and add one counterfactual presupposition to the questions, with the answer being modified accordingly. After generating counterfactual questions and answers using ChatGPT, we manually examine all generated questions and answers to ensure correctness. Over 2k counterfactual question and answer pairs are collected this way. We evaluate recent vision language models on our newly collected test dataset and found that all models exhibit a large performance drop compared to the results tested on questions without the counterfactual presupposition. This result indicates that there still exists space for developing vision language models. Apart from the vision language models, our proposed dataset can also serves as a benchmark for evaluating the ability of code generation LLMs, results demonstrate a large gap between GPT-4 and current open-source models. Our code and dataset are available at \url{https://github.com/Letian2003/C-VQA}.
摘要
《 counterfactual 理解能力是人类智能核心能力之一。这种理解过程包括评估观察到的状态或过去事件的 alternativas,可以提高我们的规划和决策能力。在这项工作中,我们将关注多模态大语言模型的 counterfactual 理解能力。我们从 VQAv2 数据集中提取了问题和答案对,并在其中添加了 counterfactual 前提,答案相应地被修改。通过使用 ChatGPT 生成 counterfactual 问题和答案,我们手动检查所有生成的问题和答案,以确保正确性。共收集了超过 2k 个 counterfactual 问题和答案对。我们对最新的视觉语言模型进行评估,发现所有模型在我们新收集的测试数据集上表现出大量的性能下降,这表明还存在开发视觉语言模型的空间。此外,我们的提出的数据集也可以用于评估代码生成 LLMS,结果显示 GPT-4 与当前开源模型存在很大差距。我们的代码和数据集可以在 上获取。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation
results: 对女性发音的识别精度提高9.87%,特别是对最少表示的基频范围内的发音进行了更大的改进。Abstract
Automatic speech recognition (ASR) systems are known to be sensitive to the sociolinguistic variability of speech data, in which gender plays a crucial role. This can result in disparities in recognition accuracy between male and female speakers, primarily due to the under-representation of the latter group in the training data. While in the context of hybrid ASR models several solutions have been proposed, the gender bias issue has not been explicitly addressed in end-to-end neural architectures. To fill this gap, we propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. This technique reduces the data unbalance among genders by simulating voices of the under-represented female speakers and increases the variability within each gender group. Experiments on spontaneous English speech show that our technique yields a relative WER improvement up to 9.87% for utterances by female speakers, with larger gains for the least-represented f0 ranges.
摘要
FTFT: efficient and robust Fine-Tuning by transFerring Training dynamics
methods: 使用 Data Map 方法,包括在参考模型上进行 fine-tuning,然后选择一部分重要的训练示例,并在这些选择的示例上进行 fine-tuning
results: 比起 conventional Empirical Risk Minimization (ERM),使用 Fine-Tuning by transFerring Training dynamics (FTFT) 方法可以更快速地达到更好的泛化 robustness 性能,同时占用训练成本的一半。Abstract
Despite the massive success of fine-tuning large Pre-trained Language Models (PLMs) on a wide range of Natural Language Processing (NLP) tasks, they remain susceptible to out-of-distribution (OOD) and adversarial inputs. Data map (DM) is a simple yet effective dual-model approach that enhances the robustness of fine-tuned PLMs, which involves fine-tuning a model on the original training set (i.e. reference model), selecting a specified fraction of important training examples according to the training dynamics of the reference model, and fine-tuning the same model on these selected examples (i.e. main model). However, it suffers from the drawback of requiring fine-tuning the same model twice, which is computationally expensive for large models. In this paper, we first show that 1) training dynamics are highly transferable across different model sizes and different pre-training methods, and that 2) main models fine-tuned using DM learn faster than when using conventional Empirical Risk Minimization (ERM). Building on these observations, we propose a novel fine-tuning approach based on the DM method: Fine-Tuning by transFerring Training dynamics (FTFT). Compared with DM, FTFT uses more efficient reference models and then fine-tunes more capable main models for fewer steps. Our experiments show that FTFT achieves better generalization robustness than ERM while spending less than half of the training cost.
摘要
尽管大型预训言语模型(PLM)的精细调整在各种自然语言处理(NLP)任务上取得了巨大成功,但它们仍然容易受到生成外部输入(OOD)和恶意输入的影响。数据映射(DM)是一种简单 yet有效的双模型方法,可以提高精细调整后PLM的Robustness,该方法包括将引用模型(i.e. reference model)在原始训练集上进行精细调整,然后选择该模型在训练动态中的一定比率的重要训练示例,并将该示例精细调整到同一模型上(i.e. main model)。然而,它的缺点在于需要两次精细调整同一模型,这会对大型模型来说很 computationally expensive。在这篇论文中,我们首先表明了以下两点:1)训练动态在不同的模型大小和预训练方法之间具有很高的传递性,2)使用DM方法精细调整的主模型在训练过程中更快速地 converges。基于这些观察,我们提出了一种基于DM方法的新的精细调整方法:FTFT(Fine-Tuning by transFerring Training dynamics)。相比DM,FTFT使用更有效的引用模型,然后精细调整更有能力的主模型,需要更少的训练步骤。我们的实验表明,FTFT在比ERM更好的泛化 Robustness 的同时,训练成本也比ERM低于一半。
AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion
results: 我们的模型在对比之前的state-of-the-art结果时表现出色,并在主观和客观评估中都达到了更高的评价标准。此外,我们的模型还可以实现 cross-lingual语音转换和提高synthesized语音的质量。Abstract
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it facilitates cross-lingual voice conversions and enhances the quality of synthesized speech.
摘要
这篇论文提出了一种简单而可靠的零shot语音转换系统,具有一个循环结构和mel-spectrogram预处理。前一些工作受到瓶颈结构的限制,导致信息损失和Synthesis质量不佳。而且,仅仅依靠自我重建损失的模型很难复制不同的发音者的voice。为了解决这些问题,我们建议了一种循环一致损失,考虑 conversions between 目标和源发音者。此外,我们在Speaker encoder训练时使用了Random-shuffled mel-spectrograms和标签平滑方法,以提取speech中的时间独立的全局发音者表示。这是零shot转换的关键。我们的模型在主观和客观评估中都超过了现有的状态场的结果,并且允许跨语言的语音转换和提高合成语音质量。
EmoTwiCS: A Corpus for Modelling Emotion Trajectories in Dutch Customer Service Dialogues on Twitter
paper_authors: Sofie Labat, Thomas Demeester, Véronique Hoste for:这篇论文的目的是为了提供一个有用的满足客户需求的社交媒体上的客户服务对话集,以便在这些平台上自动检测情绪。methods:这篇论文使用的方法包括Twitter上的客户服务对话集的收集和标注,并对这些对话中的情绪进行了分类和评价。results:这篇论文的结果包括一个高质量的情绪演变轨迹数据集,以及对这些数据集的多种分析和应用。Abstract
Due to the rise of user-generated content, social media is increasingly adopted as a channel to deliver customer service. Given the public character of these online platforms, the automatic detection of emotions forms an important application in monitoring customer satisfaction and preventing negative word-of-mouth. This paper introduces EmoTwiCS, a corpus of 9,489 Dutch customer service dialogues on Twitter that are annotated for emotion trajectories. In our business-oriented corpus, we view emotions as dynamic attributes of the customer that can change at each utterance of the conversation. The term `emotion trajectory' refers therefore not only to the fine-grained emotions experienced by customers (annotated with 28 labels and valence-arousal-dominance scores), but also to the event happening prior to the conversation and the responses made by the human operator (both annotated with 8 categories). Inter-annotator agreement (IAA) scores on the resulting dataset are substantial and comparable with related research, underscoring its high quality. Given the interplay between the different layers of annotated information, we perform several in-depth analyses to investigate (i) static emotions in isolated tweets, (ii) dynamic emotions and their shifts in trajectory, and (iii) the role of causes and response strategies in emotion trajectories. We conclude by listing the advantages and limitations of our dataset, after which we give some suggestions on the different types of predictive modelling tasks and open research questions to which EmoTwiCS can be applied. The dataset is available upon request and will be made publicly available upon acceptance of the paper.
摘要
Translated into Simplified Chinese:由于用户生成内容的升起,社交媒体越来越被用作客服渠道。由于这些在线平台的公共性,自动检测情感的应用变得非常重要,以监测客户满意度并避免负面Word of mouth。本文介绍了 EmoTwiCS,一个包含9489个荷兰客服对话的推特数据集,每个对话都被注释为情感轨迹。在我们的商业化数据集中,我们视情感为客户的动态特性,可以在每个对话中改变。“情感轨迹”这个术语不仅包括客户经验的细腻情感(通过28个标签和挥腾评分得分),还包括对话之前的事件和人工操作员的回应(两者各被注释为8个类别)。结果的交互注释者一致性(IAA)分数很高,与相关研究相当,这证明数据的高质量。由于不同层次的注释信息之间的互动,我们进行了多种深入分析, investigate (i) 隔离 tweet 中的静态情感, (ii) 情感的变化和轨迹的转折,以及 (iii) 事件和回应策略在情感轨迹中的作用。我们 conclude 后列出了数据集的优点和限制,然后给出了针对不同预测模型任务和开放研究 вопро题的建议。数据集可以在请求时获得,并在文章接受后公开发布。
Toward Semantic Publishing in Non-Invasive Brain Stimulation: A Comprehensive Analysis of rTMS Studies
methods: 本论文使用了大规模系统性审查,对 600 篇复合性Transcranial Magnetic Stimulation(rTMS)研究进行了描述,并描述了这些研究的关键特征,以便在结构化的描述和比较中使用。
results: 本论文通过实施 FAIR Semantic Web 资源(s)基本publishing 方案,对 600 篇审查的 rTMS 研究进行了 semantic publishing 在知识图库中。Abstract
Noninvasive brain stimulation (NIBS) encompasses transcranial stimulation techniques that can influence brain excitability. These techniques have the potential to treat conditions like depression, anxiety, and chronic pain, and to provide insights into brain function. However, a lack of standardized reporting practices limits its reproducibility and full clinical potential. This paper aims to foster interinterdisciplinarity toward adopting Computer Science Semantic reporting methods for the standardized documentation of Neuroscience NIBS studies making them explicitly Findable, Accessible, Interoperable, and Reusable (FAIR). In a large-scale systematic review of 600 repetitive transcranial magnetic stimulation (rTMS), a subarea of NIBS, dosages, we describe key properties that allow for structured descriptions and comparisons of the studies. This paper showcases the semantic publishing of NIBS in the ecosphere of knowledge-graph-based next-generation scholarly digital libraries. Specifically, the FAIR Semantic Web resource(s)-based publishing paradigm is implemented for the 600 reviewed rTMS studies in the Open Research Knowledge Graph.
摘要
非侵入性脑刺激(NIBS)涵盖了跨脑刺激技术,可以影响脑部活动。这些技术有可能用于治疗厌食症、抑郁症和慢性疼痛等疾病,并提供脑功能的知识。然而,由于报告方法不够标准化,NIBS的复制性和临床潜力受到限制。这篇文章的目的是推动不同领域的学者共同努力,以采用计算机科学 semantic 报告方法,为脑科学 NIBS 研究提供标准化的描述和比较。在600例重复脑刺激(rTMS)系统性回顾中,我们描述了允许结构化描述和比较研究的关键性质。这篇文章展示了 NIBS 在知识图像基础的下一代学术数字图书馆中的semantic publishing paradigm。具体来说,本文使用 FAIR Semantic Web 资源(s)基于的发布方式,对600例回顾的 rTMS 研究进行开放式研究知识图像中的发布。
The Limits of ChatGPT in Extracting Aspect-Category-Opinion-Sentiment Quadruples: A Comparative Analysis
results: 我们对ChatGPT与现有状态的四元组提取模型进行了比较,并在四个公共数据集上进行了评估。我们发现ChatGPT在这个任务上的表现不佳,但是它在某些情况下表现出了良好的能力。Abstract
Recently, ChatGPT has attracted great attention from both industry and academia due to its surprising abilities in natural language understanding and generation. We are particularly curious about whether it can achieve promising performance on one of the most complex tasks in aspect-based sentiment analysis, i.e., extracting aspect-category-opinion-sentiment quadruples from texts. To this end, in this paper we develop a specialized prompt template that enables ChatGPT to effectively tackle this complex quadruple extraction task. Further, we propose a selection method on few-shot examples to fully exploit the in-context learning ability of ChatGPT and uplift its effectiveness on this complex task. Finally, we provide a comparative evaluation on ChatGPT against existing state-of-the-art quadruple extraction models based on four public datasets and highlight some important findings regarding the capability boundaries of ChatGPT in the quadruple extraction.
摘要
近期,ChatGPT已经吸引了行业和学术界的广泛关注,因为它在自然语言理解和生成方面表现出了惊人的能力。我们尤其关注ChatGPT是否可以在一个最复杂的任务中表现出色,即从文本中提取方面-类别-意见-情感四元组。为此,在这篇论文中,我们开发了特有的提示模板,使得ChatGPT能够有效地解决这个复杂的四元组提取任务。此外,我们提出了基于少量示例选择的方法,以充分利用ChatGPT在上下文学习中的能力,提高它在这个任务上的效iveness。最后,我们对ChatGPT与现有状态的四元组提取模型进行了比较评估,并发现了一些关于ChatGPT在四元组提取任务上的能力边界的重要发现。
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection
results: 我们的方法在两个数据集上对比baseline方法表现出色,具有更高的准确率和更低的质量成本。此外,我们还手动分析了 LLM 失败检测的一些例子,发现零资源方法具有共同的限制。Abstract
Large Language Models (LLMs) have shown their ability to collaborate effectively with humans in real-world scenarios. However, LLMs are apt to generate hallucinations, i.e., makeup incorrect text and unverified information, which can cause significant damage when deployed for mission-critical tasks. In this paper, we propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. To facilitate future studies and assess different methods, we construct a hallucination detection benchmark named PHD, which is generated by ChatGPT and annotated by human annotators. Contrasting previous studies of zero-resource hallucination detection, our method and benchmark concentrate on passage-level detection instead of sentence-level. We empirically evaluate our method and existing zero-resource detection methods on two datasets. The experimental results demonstrate that the proposed method considerably outperforms the baselines while costing fewer tokens and less time. Furthermore, we manually analyze some hallucination cases that LLM failed to capture, revealing the shared limitation of zero-resource methods.
摘要
Note: "Simplified Chinese" is a translation of the text into Chinese, using simpler grammar and vocabulary to make it easier to understand for native Chinese speakers. However, please note that the translation may not be perfect and may not capture all the nuances of the original text.
SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network
For: 本文旨在探讨使用刺激神经网络(SNN)在多Modal场景中的扩展,并采用了一种新的框架 named SpikeCLIP,以提高在多Modal场景中的刺激计算的效能。* Methods: 本文使用了一种两步方法,包括“Alignment Pre-training”和“双损失精度调整”,以将刺激计算与深度神经网络(DNN)相结合,从而实现在多Modal场景中的刺激计算。* Results: 实验结果表明,使用SpikeCLIP框架可以在多Modal场景中实现刺激计算的相对比较好的性能,同时减少了能耗量。此外,SpikeCLIP还可以保持在图像分类任务中的稳定性,即使涉及到不在特定类别中的类别标签。Abstract
Spiking neural networks (SNNs) have demonstrated the capability to achieve comparable performance to deep neural networks (DNNs) in both visual and linguistic domains while offering the advantages of improved energy efficiency and adherence to biological plausibility. However, the extension of such single-modality SNNs into the realm of multimodal scenarios remains an unexplored territory. Drawing inspiration from the concept of contrastive language-image pre-training (CLIP), we introduce a novel framework, named SpikeCLIP, to address the gap between two modalities within the context of spike-based computing through a two-step recipe involving ``Alignment Pre-training + Dual-Loss Fine-tuning". Extensive experiments demonstrate that SNNs achieve comparable results to their DNN counterparts while significantly reducing energy consumption across a variety of datasets commonly used for multimodal model evaluation. Furthermore, SpikeCLIP maintains robust performance in image classification tasks that involve class labels not predefined within specific categories.
摘要
聚合神经网络(SNN)已经表现出与深度神经网络(DNN)相当的性能在视觉和语言领域,而且具有更好的能效性和生物启发性。然而,将单模态SNN扩展到多模态场景仍然是一个未探索的领域。 Drawing inspiration from语言-图像准备(CLIP)的概念,我们提出了一种新的框架,名为SpikeCLIP,以解决在毫 COUNTING computing中两个模态之间的差异。我们采用了两步方法:“对齐预训练 + 双损失细化”。广泛的实验表明,SNN可以与其DNN对应类型相当,同时具有显著降低能耗的优势。此外,SpikeCLIP在图像分类任务中保持了不受限定类别的稳定性。
Multilingual Jailbreak Challenges in Large Language Models
for: This paper aims to address the safety concerns associated with large language models (LLMs) in the multilingual context, specifically the “jailbreak” problem where malicious instructions can manipulate LLMs to exhibit undesirable behavior.
methods: The paper reveals the presence of multilingual jailbreak challenges within LLMs and considers two potential risk scenarios: unintentional and intentional. The authors experimentally demonstrate that low-resource languages are more susceptible to unsafe content generation, and propose a novel \textsc{Self-Defense} framework for safety fine-tuning.
results: The paper shows that the proposed \textsc{Self-Defense} framework can achieve a substantial reduction in unsafe content generation for ChatGPT, with an 80.92% reduction in unsafe output for the intentional scenario and a three times increase in unsafe content for the unintentional scenario compared to high-resource languages.Here’s the Chinese translation of the three points:
results: 论文显示,\textsc{Self-Defense} 框架可以减少 ChatGPT 的危险输出,具体来说,对于意外情况,低资源语言的危险内容生成率高三倍于高资源语言,而对于意图情况, \textsc{Self-Defense} 框架可以减少 unsafe 输出的比例为 80.92%。Abstract
While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English data. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risk scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs. Warning: This paper contains examples with potentially harmful content.
摘要
large language models (LLMs) display remarkable capabilities across a wide range of tasks, but they also pose potential safety concerns, such as the "jailbreak" problem, where malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English data. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risk scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs.our experimental results show that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92% for ChatGPT and 40.71% for GPT-4.to handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs. Warning: This paper contains examples with potentially harmful content.
Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features
results: 研究发现文化价值调查indeed possess a predictive power for cross-cultural transfer learning success in OLD tasks, and that it can be further improved using offensive word distance.Abstract
The increasing ubiquity of language technology necessitates a shift towards considering cultural diversity in the machine learning realm, particularly for subjective tasks that rely heavily on cultural nuances, such as Offensive Language Detection (OLD). Current understanding underscores that these tasks are substantially influenced by cultural values, however, a notable gap exists in determining if cultural features can accurately predict the success of cross-cultural transfer learning for such subjective tasks. Addressing this, our study delves into the intersection of cultural features and transfer learning effectiveness. The findings reveal that cultural value surveys indeed possess a predictive power for cross-cultural transfer learning success in OLD tasks and that it can be further improved using offensive word distance. Based on these results, we advocate for the integration of cultural information into datasets. Additionally, we recommend leveraging data sources rich in cultural information, such as surveys, to enhance cultural adaptability. Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies.
摘要
随着语言技术的普及,需要对文化多样性在机器学习领域进行考虑,特别是对于基于文化特点的主观任务,如涉礼语言检测(OLD)。现有研究表明,这类任务受到文化价值的影响,但是存在一定的掌握问题,即可以否准确预测跨文化传输学习的成功。我们的研究团队对此进行了调查,发现文化价值调查确实可以预测跨文化传输学习成功,并且可以通过涉礼词语距离进一步改进。根据这些结果,我们建议将文化信息纳入数据集中,并且建议使用具有文化信息的数据源,如调查,来提高文化适应性。我们的研究表明,针对更包容、文化敏感的语言技术的开发是一步前进。
MemSum-DQA: Adapting An Efficient Long Document Extractive Summarizer for Document Question Answering
results: 与先前的基eline相比,MemSum-DQA在全文 answering 任务上提高了9%的精确匹配率。此外,MemSum-DQA在儿童关系理解方面表现出色,这指示了抽取概要技术在 DQA 任务中的潜在优势。Abstract
We introduce MemSum-DQA, an efficient system for document question answering (DQA) that leverages MemSum, a long document extractive summarizer. By prefixing each text block in the parsed document with the provided question and question type, MemSum-DQA selectively extracts text blocks as answers from documents. On full-document answering tasks, this approach yields a 9% improvement in exact match accuracy over prior state-of-the-art baselines. Notably, MemSum-DQA excels in addressing questions related to child-relationship understanding, underscoring the potential of extractive summarization techniques for DQA tasks.
摘要
我们介绍MemSum-DQA,一种高效的文档问答系统(DQA),利用MemSum,一种长文档抽取式概要系统。通过在文档中每个文本块前置提供的问题和问题类型,MemSum-DQA选择性地从文档中提取答案。在全文 answering 任务上,这种方法比之前的基线性能提高9%。尤其是在儿童关系理解方面,MemSum-DQA表现出色,这 highlights the potential of 抽取式概要技术在 DQA 任务中。
Humans and language models diverge when predicting repeating text
results: 研究发现,在第一次显示文本扩展时,人类和语言模型的性能很高相关,但是当memory(或在场景学习)开始发挥作用时,人类和语言模型的性能快速分化。研究发现了这种分化的原因,并通过添加带有力学律回归的注意头来解决这个问题,使模型更像人类。Abstract
Language models that are trained on the next-word prediction task have been shown to accurately model human behavior in word prediction and reading speed. In contrast with these findings, we present a scenario in which the performance of humans and LMs diverges. We collected a dataset of human next-word predictions for five stimuli that are formed by repeating spans of text. Human and GPT-2 LM predictions are strongly aligned in the first presentation of a text span, but their performance quickly diverges when memory (or in-context learning) begins to play a role. We traced the cause of this divergence to specific attention heads in a middle layer. Adding a power-law recency bias to these attention heads yielded a model that performs much more similarly to humans. We hope that this scenario will spur future work in bringing LMs closer to human behavior.
摘要
语言模型,它们在下一个词预测任务上训练,已经能够准确地模拟人类行为。然而,我们提出了一种情况,在这种情况下,人类和语言模型(LM)的性能开始分化。我们收集了五个句子的人类下一个词预测数据集。人类和GPT-2语言模型在第一次文本段的预测 task 上强相关,但是他们的性能很快地分化,当内存(或在场景学习)开始发挥作用时。我们追踪了这种分化的原因,发现了特定的注意头在中间层。将power-law recency bias添加到这些注意头可以创建一个与人类更相似的模型。我们希望这种情况能够促进未来的研究,使语言模型更接近人类行为。
Improved prompting and process for writing user personas with LLMs, using qualitative interviews: Capturing behaviour and personality traits of users
paper_authors: Stefano De Paoli for:The paper aims to present a workflow for creating user personas using large language models, specifically through the results of thematic analysis of qualitative interviews.methods:The proposed workflow utilizes improved prompting and a larger pool of themes compared to previous work by the author, made possible by the capabilities of a recently released large language model (GPT3.5-Turbo-16k) and refined prompting for creating personas.results:The paper discusses the improved workflow for creating personas and offers reflections on the relationship between the proposed process and existing approaches to personas, as well as the capacity of LLMs to capture user behaviors and personality traits from the underlying dataset of qualitative interviews used for analysis.Abstract
This draft paper presents a workflow for creating User Personas with Large Language Models, using the results of a Thematic Analysis of qualitative interviews. The proposed workflow uses improved prompting and a larger pool of Themes, compared to previous work conducted by the author for the same task. This is possible due to the capabilities of a recently released LLM which allows the processing of 16 thousand tokens (GPT3.5-Turbo-16k) and also due to the possibility to offer a refined prompting for the creation of Personas. The paper offers details of performing Phase 2 and 3 of Thematic Analysis, and then discusses the improved workflow for creating Personas. The paper also offers some reflections on the relationship between the proposed process and existing approaches to Personas such as the data-driven and qualitative Personas. Moreover, the paper offers reflections on the capacity of LLMs to capture user behaviours and personality traits, from the underlying dataset of qualitative interviews used for the analysis.
摘要
这份草稿文章介绍了使用大语言模型创建用户人物的工作流程,基于论题分析的访谈结果。提议的工作流程使用改进的提示和更大的主题池,比前一作者为同任务所做的工作更好。这几乎可以归功于最近发布的LLM,它可以处理16千个字符(GPT3.5-Turbo-16k),以及可以提供更精细的提示 для创建人物。文章详细介绍了执行阶段2和3的论题分析,然后讨论了改进的工作流程。文章还提供了关于提案过程和现有方法人物之间的关系的反思,以及LLM对用户行为和人格特征的捕捉能力。
Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models
results: 研究发现,在选择PLM模型时,仅增加模型大小或进行任务特定适应并不是parameterfficient的; 在解码方面,使用抽样查找方法可以提高F1分数,但是它在回味方面落后于简单搜索方法。基于这些发现,本研究提出了一种基于概率的decode-select算法,可以改进greedy搜索。Abstract
Keyphrase Generation (KPG) is a longstanding task in NLP with widespread applications. The advent of sequence-to-sequence (seq2seq) pre-trained language models (PLMs) has ushered in a transformative era for KPG, yielding promising performance improvements. However, many design decisions remain unexplored and are often made arbitrarily. This paper undertakes a systematic analysis of the influence of model selection and decoding strategies on PLM-based KPG. We begin by elucidating why seq2seq PLMs are apt for KPG, anchored by an attention-driven hypothesis. We then establish that conventional wisdom for selecting seq2seq PLMs lacks depth: (1) merely increasing model size or performing task-specific adaptation is not parameter-efficient; (2) although combining in-domain pre-training with task adaptation benefits KPG, it does partially hinder generalization. Regarding decoding, we demonstrate that while greedy search achieves strong F1 scores, it lags in recall compared with sampling-based methods. Based on these insights, we propose DeSel, a likelihood-based decode-select algorithm for seq2seq PLMs. DeSel improves greedy search by an average of 4.7% semantic F1 across five datasets. Our collective findings pave the way for deeper future investigations into PLM-based KPG.
摘要
《键签生成(KPG)是NLPT中长期任务,广泛应用。 seq2seq预训练语言模型(PLM)的出现,为KPG带来了转变性的时代,提高性能。然而,许多设计决策仍然未经探索,经常采取优化的方式。本文进行了系统性的分析,探讨PLM基于KPG的模型选择和解码策略对Seq2Seq PLM的影响。我们开始由Seq2Seq PLM适用于KPG的原因,基于注意力驱动的假设。然后,我们发现了现有的Seq2Seq PLM选择方法的缺陷:(1)仅通过增加模型大小或进行任务特定的适应,不能减少参数的效率;(2)虽然结合域内预训练和任务适应可以提高KPG,但也会部分削弱泛化性。对于解码,我们表明了批量搜索可以 дости得强大的F1分数,但在回归方面落后于抽样方法。基于这些发现,我们提出了DeSel算法,它是基于概率的解码-选择算法,可以改进批量搜索。DeSel在五个数据集上提高了4.7%的语义F1分数。我们的总体发现可以为PLM基于KPG的未来研究开辟道路。》
Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity Alignment
results: 对多个benchmark dataset进行了广泛的实验,得到了优秀的实体匹配性能,比STRONG竞争对手更高。Abstract
Multi-Modal Entity Alignment (MMEA) is a critical task that aims to identify equivalent entity pairs across multi-modal knowledge graphs (MMKGs). However, this task faces challenges due to the presence of different types of information, including neighboring entities, multi-modal attributes, and entity types. Directly incorporating the above information (e.g., concatenation or attention) can lead to an unaligned information space. To address these challenges, we propose a novel MMEA transformer, called MoAlign, that hierarchically introduces neighbor features, multi-modal attributes, and entity types to enhance the alignment task. Taking advantage of the transformer's ability to better integrate multiple information, we design a hierarchical modifiable self-attention block in a transformer encoder to preserve the unique semantics of different information. Furthermore, we design two entity-type prefix injection methods to integrate entity-type information using type prefixes, which help to restrict the global information of entities not present in the MMKGs. Our extensive experiments on benchmark datasets demonstrate that our approach outperforms strong competitors and achieves excellent entity alignment performance.
摘要
多modalEntityAlignment(MMEA)是一个关键任务,旨在在多modal知识图(MMKG)中寻找等价实体对。然而,这个任务面临着不同类型的信息的存在,包括邻居实体、多modal特征和实体类型。直接包含这些信息(例如, concatenation 或 attention)可能会导致不一致的信息空间。为了解决这些挑战,我们提出了一种新的MMEA transformer,called MoAlign,它在多modal知识图中层次引入邻居特征、多modal特征和实体类型,以提高对齐任务。利用trasnformer的能力更好地集成多种信息,我们设计了一个层次可变自注意力块,以保持不同信息的唯一 semantics。此外,我们设计了两种实体类型前缀注入方法,以integrate实体类型信息使用类型前缀,帮助限制global信息的实体不在MMKG中。我们对标准数据集进行了广泛的实验,demonstrate that our approach outperforms strong competitors and achieves excellent entity alignment performance.
InfoCL: Alleviating Catastrophic Forgetting in Continual Text Classification from An Information Theoretic Perspective
results: 我们的方法可以有效地避免 forgetting 问题,并在三个文本分类任务上实现了 state-of-the-art 的性能。Abstract
Continual learning (CL) aims to constantly learn new knowledge over time while avoiding catastrophic forgetting on old tasks. We focus on continual text classification under the class-incremental setting. Recent CL studies have identified the severe performance decrease on analogous classes as a key factor for catastrophic forgetting. In this paper, through an in-depth exploration of the representation learning process in CL, we discover that the compression effect of the information bottleneck leads to confusion on analogous classes. To enable the model learn more sufficient representations, we propose a novel replay-based continual text classification method, InfoCL. Our approach utilizes fast-slow and current-past contrastive learning to perform mutual information maximization and better recover the previously learned representations. In addition, InfoCL incorporates an adversarial memory augmentation strategy to alleviate the overfitting problem of replay. Experimental results demonstrate that InfoCL effectively mitigates forgetting and achieves state-of-the-art performance on three text classification tasks. The code is publicly available at https://github.com/Yifan-Song793/InfoCL.
摘要
Translated into Simplified Chinese: kontinual learning (CL) 目标是不断学习新知识,而避免在老任务上出现致命忘记。我们在类增量设定下进行文本分类 continual learning。 current CL 研究表明,在相似类上的性能下降是致命忘记的关键因素。在这篇文章中,我们通过 Continual learning 的表征学习过程的深入探索,发现信息瓶颈压缩的效果导致了相似类的混淆。为了让模型学习更加充分的表示,我们提议一种基于 InfoCL 的循环学习方法。我们的方法通过快慢学习和当前过去的对比学习来实现对信息的最大化。此外,InfoCL 还包括一种对抗记忆增强策略,以解决回放中的过拟合问题。实验结果表明,InfoCL 有效地避免了致命忘记,并在三个文本分类任务上达到了状态的最佳性能。代码可以在 https://github.com/Yifan-Song793/InfoCL 上获取。
A Semantic Invariant Robust Watermark for Large Language Models
results: 研究表明,我们的方法在semantically invariant setting中具有高度的攻击Robustness和安全Robustness。此外,我们的 watermark还具有足够的安全Robustness。Abstract
Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at https://github.com/THU-BPM/Robust_Watermark.
摘要
大型语言模型(LLM)的水印算法已经实现了极高的准确率,用于检测由LLM生成的文本。通常,这些算法都是通过在生成步骤中添加额外的水印噢来实现的。然而,先前的算法面临着一种负面的贸易OFF和安全性之间的贸易OFF。这是因为水印噢的某个token是由前一些token决定的,一个小数会导致安全性不足,而一个大数则会导致攻击鲁棒性不足。在这项工作中,我们提出了一种基于 semantics的强水印方法,该方法可以同时提供攻击鲁棒性和安全性。我们的水印噢是由所有前一些token的 semantics 决定的。具体来说,我们使用另一个嵌入式语言模型来生成所有前一些token的 semantics 嵌入,然后将这些 semantics 嵌入转换成水印噢 через我们训练的水印模型。后续的分析和实验表明了我们的方法在semantically invariant的 Setting 中具有攻击鲁棒性。此外,我们还证明了我们的水印具有足够的安全性。我们的代码和数据可以在https://github.com/THU-BPM/Robust_Watermark上获取。
Selective Demonstrations for Cross-domain Text-to-SQL
results: 对两个cross-domain文本到SQL数据集进行了实验,ODIS比基eline方法提高了1.1和11.8个执行精度点。Abstract
Large language models (LLMs) with in-context learning have demonstrated impressive generalization capabilities in the cross-domain text-to-SQL task, without the use of in-domain annotations. However, incorporating in-domain demonstration examples has been found to greatly enhance LLMs' performance. In this paper, we delve into the key factors within in-domain examples that contribute to the improvement and explore whether we can harness these benefits without relying on in-domain annotations. Based on our findings, we propose a demonstration selection framework ODIS which utilizes both out-of-domain examples and synthetically generated in-domain examples to construct demonstrations. By retrieving demonstrations from hybrid sources, ODIS leverages the advantages of both, showcasing its effectiveness compared to baseline methods that rely on a single data source. Furthermore, ODIS outperforms state-of-the-art approaches on two cross-domain text-to-SQL datasets, with improvements of 1.1 and 11.8 points in execution accuracy, respectively.
摘要
大型语言模型(LLM)在跨领域文本到SQL任务中展示了印象深刻的普遍化能力,不需要使用领域标注。但是,包含领域示例可以大幅提高LLM的表现。在这篇论文中,我们探讨了领域示例中关键因素对提升的贡献,并查探我们是否可以利用这些优点而不需要领域标注。基于我们的发现,我们提出了一个示例选择框架ODIS,这个框架使用了外部示例和人工生成的领域示例来建立示例。通过从混合来源获取示例,ODIS可以利用这两种来源的优点,并且在两个跨领域文本到SQL数据集上显示出比基准方法更高的效果。此外,ODIS比前一代方法在两个数据集上表现更好,具体的提升为1.1和11.8个执行精度分别。
An experiment on an automated literature survey of data-driven speech enhancement methods
paper_authors: Arthur dos Santos, Jayr Pereira, Rodrigo Nogueira, Bruno Masiero, Shiva Sander-Tavallaey, Elias Zea
for: automatizieren einer Literatur-Überblick über 116 Artikel zu data-getriebenen Sprechverbesserungsverfahren
methods: 使用一个生成的预训练转换器(GPT)模型自动进行文献综述
results: 评估GPT模型在提供准确回答特定问题关于选择的人工参考文献中的能力和局限性Abstract
The increasing number of scientific publications in acoustics, in general, presents difficulties in conducting traditional literature surveys. This work explores the use of a generative pre-trained transformer (GPT) model to automate a literature survey of 116 articles on data-driven speech enhancement methods. The main objective is to evaluate the capabilities and limitations of the model in providing accurate responses to specific queries about the papers selected from a reference human-based survey. While we see great potential to automate literature surveys in acoustics, improvements are needed to address technical questions more clearly and accurately.
摘要
“随着科学期刊中有限的增加,传统的文献综述became increasingly difficult。本研究探讨使用生成器预训transformer(GPT)模型自动进行116篇资料驱动 speech 增强方法的文献综述。主要目的是评估模型对 especific queries 的答案是否具有准确性。 Although we see great potential in automating literature surveys in acoustics, further improvements are needed to address technical questions more clearly and accurately.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
GeoLLM: Extracting Geospatial Knowledge from Large Language Models
methods: 本研究提出了一种新的方法 called GeoLLM,该方法可以有效地提取地ospatial知识从LLMs中,并且可以与OpenStreetMap地图数据结合使用。
results: 根据实验结果,GeoLLM方法可以与基elines相比提高70%的性能(用Pearson的$r^2$进行衡量),并且与现有的卫星数据 benchmark相当或更高。此外,研究还发现LLMs具有remarkable的空间信息和 sample-efficient特点。Abstract
The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power. Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks. We first demonstrate that LLMs embed remarkable spatial information about locations, but naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density. We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson's $r^2$) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature. With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe. Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well.
摘要
machine learning(ml)在各种地ospatial任务中越来越普遍,但常常基于全球可用的 covariates,如卫星影像,这些 covariates 可能是昂贵的或者预测力不强。在这里,我们考虑了 Whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models(LLMs),可以为 geospatial prediction tasks 提供支持。我们首先表明了 LLMs 嵌入了很多地理信息,但是直接使用地理坐标查询 LLMs 是不能有效地预测重要指标,如人口密度。然后,我们提出了 GeoLLM,一种新的方法,可以有效地从 LLMs 提取地ospatial 知识,并且可以与 OpenStreetMap 中的 auxiliary map 数据结合使用。我们在多个国际社区中的重要任务上进行了多个任务,包括人口密度的测量和经济生活水平的评估。在这些任务中,我们的方法比基eline 使用 nearest neighbors 或者直接从提示中获取信息的方法提高了70%( measured using Pearson's $r^2$)。此外,我们发现 GPT-3.5 在 GeoLLM 中表现比 Llama 2 和 RoBERTa 好19%和51% respectively,这表明我们的方法可以很好地扩展到不同的模型和预训练集。我们的实验表明 LLMs 在各个地方都具有很好的sample efficiency,rich in geospatial information,和 robustness。此外,GeoLLM 可以有效地缓解现有的地ospatial covariates 的限制,并且可以补充它们良好。
results: 实验结果表明,使用PF ODE模型进行概率估计是对高复杂度、高概率攻击的Robustness的。此外,在CIFAR-10 dataset上,一些黑客攻击样本具有semantic meaning,如预期的Robust estimator中的。Abstract
Beyond their impressive sampling capabilities, score-based diffusion models offer a powerful analysis tool in the form of unbiased density estimation of a query sample under the training data distribution. In this work, we investigate the robustness of density estimation using the probability flow (PF) neural ordinary differential equation (ODE) model against gradient-based likelihood maximization attacks and the relation to sample complexity, where the compressed size of a sample is used as a measure of its complexity. We introduce and evaluate six gradient-based log-likelihood maximization attacks, including a novel reverse integration attack. Our experimental evaluations on CIFAR-10 show that density estimation using the PF ODE is robust against high-complexity, high-likelihood attacks, and that in some cases adversarial samples are semantically meaningful, as expected from a robust estimator.
摘要
除了其吸引人的采样能力之外,分数基于扩散模型还提供了一种强大的分析工具,即对训练数据分布下的查询样本进行不偏的density估计。在这种工作中,我们研究了PF neural differential equation(ODE)模型对梯度基于可能性最大化攻击的Robustness,以及与样本复杂度之间的关系。我们介绍并评估了6种梯度基于Log-likelihood最大化攻击,其中包括一种新的反整合攻击。我们的实验评估表明,使用PF ODE进行density估计对于高复杂性、高可能性攻击是Robust,而且在某些情况下,黑客样本具有semantically meaningful的意义,与一个Robust估计器相符。
Taking the human out of decomposition-based optimization via artificial intelligence: Part II. Learning to initialize
for: 解决大规模优化问题,frequently encountered in process systems engineering tasks.
methods: 使用机器学习方法学习优化算法的最佳初始化,以减少计算时间。
results: 提出的方法可以带来显著减少解决时间,并且活动学习可以减少学习数据量。Here’s a breakdown of each point:1. for: The paper is written for solving large-scale optimization problems in process systems engineering tasks.2. methods: The paper proposes using machine learning to learn the optimal initialization of decomposition-based solution methods, which can reduce the computational time.3. results: The proposed method can significantly reduce the solution time, and active learning can reduce the amount of data required for learning.Abstract
The repeated solution of large-scale optimization problems arises frequently in process systems engineering tasks. Decomposition-based solution methods have been widely used to reduce the corresponding computational time, yet their implementation has multiple steps that are difficult to configure. We propose a machine learning approach to learn the optimal initialization of such algorithms which minimizes the computational time. Active and supervised learning is used to learn a surrogate model that predicts the computational performance for a given initialization. We apply this approach to the initialization of Generalized Benders Decomposition for the solution of mixed integer model predictive control problems. The surrogate models are used to find the optimal number of initial cuts that should be added in the master problem. The results show that the proposed approach can lead to a significant reduction in solution time, and active learning can reduce the data required for learning.
摘要
大规模优化问题的重复解决问题经常出现在进程系统工程中的任务中。基于分解的解决方法广泛使用,但它们的实施具有多个步骤,这些步骤困难配置。我们提议使用机器学习方法来学习优化算法的初始化,以降低计算时间。我们使用活动学习和监督学习来学习一个预测算法的计算性能,这个预测算法用于确定给定初始化的计算时间。我们应用这种方法到 generalized Benders decomposition 的初始化中,用于解决混合整数预测控制问题。 surrogate 模型用于找到最佳的初始剖分数,以降低解决时间。结果表明,我们的方法可以带来显著的解决时间减少,并且活动学习可以减少学习数据量。
results: 在30% 恶意客户端情况下,通过信誉机制实现快速模型融合和高精度。Abstract
Federated Learning (FL) is a well-known paradigm of distributed machine learning on mobile and IoT devices, which preserves data privacy and optimizes communication efficiency. To avoid the single point of failure problem in FL, decentralized federated learning (DFL) has been proposed to use peer-to-peer communication for model aggregation, which has been considered an attractive solution for machine learning tasks on distributed personal devices. However, this process is vulnerable to attackers who share false models and data. If there exists a group of malicious clients, they might harm the performance of the model by carrying out a poisoning attack. In addition, in DFL, clients often lack the incentives to contribute their computing powers to do model training. In this paper, we proposed Blockchain-based Decentralized Federated Learning (BDFL), which leverages a blockchain for decentralized model verification and auditing. BDFL includes an auditor committee for model verification, an incentive mechanism to encourage the participation of clients, a reputation model to evaluate the trustworthiness of clients, and a protocol suite for dynamic network updates. Evaluation results show that, with the reputation mechanism, BDFL achieves fast model convergence and high accuracy on real datasets even if there exist 30\% malicious clients in the system.
摘要
federated learning(FL)是一种已知的分布式机器学习模式,适用于移动设备和物联网设备,保持数据隐私和通信效率。为了解决FL中的单点失败问题,分布式 federated learning(DFL)已经提议使用对等通信进行模型集成,这被视为对于分布在个人设备上的机器学习任务的有appealing解决方案。然而,这个过程受到攻击者们发送false模型和数据的威胁。如果存在一群恶意客户端,他们可能会通过毒品攻击伤害模型的性能。此外,在DFL中,客户端经常缺乏参与到模型训练中的动机。在这篇论文中,我们提出了基于区块链的分布式 federated learning(BDFL),该技术利用区块链进行分布式模型验证和审核。BDFL包括一个审计委员会 для模型验证、一种激励客户端参与的机制、一个客户端信任度评估模型以及一套协议集 для动态网络更新。评估结果表明,在各种情况下,包括30%的恶意客户端,BDFL仍能够快速启 converges和高精度地完成实际数据集的模型训练。
Taking the human out of decomposition-based optimization via artificial intelligence: Part I. Learning when to decompose
results: 该方法可以开发一个可以判断凸混合整数非线性 програм的最佳解决方法是使用分支和约束算法还是外接算法。此外,可以将学习的分类器 integrate到现有混合整数优化解决方案中。Abstract
In this paper, we propose a graph classification approach for automatically determining whether to use a monolithic or a decomposition-based solution method. In this approach, an optimization problem is represented as a graph that captures the structural and functional coupling among the variables and constraints of the problem via an appropriate set of features. Given this representation, a graph classifier is built to determine the best solution method for a given problem. The proposed approach is used to develop a classifier that determines whether a convex Mixed Integer Nonlinear Programming problem should be solved using branch and bound or the outer approximation algorithm. Finally, it is shown how the learned classifier can be incorporated into existing mixed integer optimization solvers.
摘要
在这篇论文中,我们提出了一种图 классификация方法,用于自动确定是否使用简单或含 decomposition 的解决方法。在这种方法中,一个优化问题被表示为一个图, capture 变量和约束之间的结构和功能相互关系via 合适的特征集。给出这种表示,一个图分类器被建立,以确定给定问题的最佳解决方法。我们所提出的方法用于开发一个可以判断 convex 混合整数非线性程序问题是否使用分支和约束算法或外接算法解决。最后,我们示出了如何将学习的分类器集成到现有的混合整数优化解决方案中。
Acoustic Model Fusion for End-to-end Speech Recognition
paper_authors: Zhihong Lei, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng, Yuanyuan Zhang, Ernest Pusateri, Mirko Hannemann, Yaqiao Deng, Man-Hung Siu
for: 提高 ASR 系统的准确率和 named entity recognition 性能
methods: 提出一种将 external acoustic model integrated into end-to-end ASR 系统的方法,以更好地解决频率域匹配问题
results: 实现了在不同测试集上的词错率下降,最高达14.3%,同时Named entity recognition的性能也得到了明显提高Abstract
Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain mismatch issue inherent to the internal AM. Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. By implementing this novel approach, we have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets. We also discovered that this AM fusion approach is particularly beneficial in enhancing named entity recognition.
摘要
Traditional ASR systems consist of separate AM and language model (LM) components, but E2E systems combine these components into a single network trained on audio-text pairs. Despite this simpler architecture, fusing a separate LM trained exclusively on text corpora into the E2E system has been shown to be beneficial. However, this approach is limited by its inability to address the domain mismatch issue inherent to the internal AM. Our proposed approach of integrating an external AM into the E2E system addresses this issue by providing a more diverse set of acoustic features to the network. This allows the network to better handle variations in speech and improve overall accuracy. We have tested our approach on a variety of datasets and have achieved significant improvements in word error rates, with an impressive drop of up to 14.3% across all test sets. Additionally, we have found that this approach is particularly effective in enhancing named entity recognition.
Spiral-Elliptical automated galaxy morphology classification from telescope images
results: 使用 Sloan Digital Sky Survey 的星系图像数据,我们证明了我们提出的图像统计方法可以高效地检测扁旋和螺旋星系,并且可以作为Random Forest 分类器的特征来使用。Abstract
The classification of galaxy morphologies is an important step in the investigation of theories of hierarchical structure formation. While human expert visual classification remains quite effective and accurate, it cannot keep up with the massive influx of data from emerging sky surveys. A variety of approaches have been proposed to classify large numbers of galaxies; these approaches include crowdsourced visual classification, and automated and computational methods, such as machine learning methods based on designed morphology statistics and deep learning. In this work, we develop two novel galaxy morphology statistics, descent average and descent variance, which can be efficiently extracted from telescope galaxy images. We further propose simplified versions of the existing image statistics concentration, asymmetry, and clumpiness, which have been widely used in the literature of galaxy morphologies. We utilize the galaxy image data from the Sloan Digital Sky Survey to demonstrate the effective performance of our proposed image statistics at accurately detecting spiral and elliptical galaxies when used as features of a random forest classifier.
摘要
《星系形态分类是astrophysical structure formation理论研究中一个重要步骤。虽然人类专家视觉分类仍然非常有效和准确,但由于天文大观测数据的涌入,人类分类无法满足数据的需求。各种方法被提出来分类大量的星系,包括人工智能分类和计算机方法,如基于设计的形态统计和深度学习。在这项工作中,我们开发了两种新的星系形态统计,即下降平均值和下降方差,可以快速从望远镜星系图像中提取。我们还提出了现有图像统计的简化版本,包括吸引度、非均匀性和块性,这些统计在Literature中广泛使用。我们使用 Sloan Digital Sky Survey 的星系图像数据来证明我们提出的图像统计能够准确地检测扁旋和椭圆星系,当作Random Forest 分类器的特征。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
FedMFS: Federated Multimodal Fusion Learning with Selective Modality Communication
methods: 本研究提出了一种名为Federated Multimodal Fusion learning with Selective modality communication(FedMFS)的新方法,该方法利用Shapley值来衡量每个模式的贡献,并根据模式模型大小来衡量通信开销,以便每个客户端可以选择上传模式模型到服务器进行集成。
results: 实验结果表明,FedMFS方法可以减少一很大的通信开销,同时保持与基准值相对的准确性。实际上,FedMFS方法可以在真实的多Modal数据集上实现相对于基准值的20%的通信开销减少。Abstract
Federated learning (FL) is a distributed machine learning (ML) paradigm that enables clients to collaborate without accessing, infringing upon, or leaking original user data by sharing only model parameters. In the Internet of Things (IoT), edge devices are increasingly leveraging multimodal data compositions and fusion paradigms to enhance model performance. However, in FL applications, two main challenges remain open: (i) addressing the issues caused by heterogeneous clients lacking specific modalities and (ii) devising an optimal modality upload strategy to minimize communication overhead while maximizing learning performance. In this paper, we propose Federated Multimodal Fusion learning with Selective modality communication (FedMFS), a new multimodal fusion FL methodology that can tackle the above mentioned challenges. The key idea is to utilize Shapley values to quantify each modality's contribution and modality model size to gauge communication overhead, so that each client can selectively upload the modality models to the server for aggregation. This enables FedMFS to flexibly balance performance against communication costs, depending on resource constraints and applications. Experiments on real-world multimodal datasets demonstrate the effectiveness of FedMFS, achieving comparable accuracy while reducing communication overhead by one twentieth compared to baselines.
摘要
federated learning (FL) 是一种分布式机器学习 (ML) paradigma, enables clients to collaborate without accessing, infringing upon, or leaking original user data by sharing only model parameters. In the Internet of Things (IoT), edge devices are increasingly leveraging multimodal data compositions and fusion paradigms to enhance model performance. However, in FL applications, two main challenges remain open: (i) addressing the issues caused by heterogeneous clients lacking specific modalities and (ii) devising an optimal modality upload strategy to minimize communication overhead while maximizing learning performance. In this paper, we propose Federated Multimodal Fusion learning with Selective modality communication (FedMFS), a new multimodal fusion FL methodology that can tackle the above mentioned challenges. The key idea is to utilize Shapley values to quantify each modality's contribution and modality model size to gauge communication overhead, so that each client can selectively upload the modality models to the server for aggregation. This enables FedMFS to flexibly balance performance against communication costs, depending on resource constraints and applications. Experiments on real-world multimodal datasets demonstrate the effectiveness of FedMFS, achieving comparable accuracy while reducing communication overhead by one twentieth compared to baselines.
A predict-and-optimize approach to profit-driven churn prevention
results: 在12个客户流失预测数据集上,该策略达到了最佳平均收益水平,比其他常见策略更高。Abstract
In this paper, we introduce a novel predict-and-optimize method for profit-driven churn prevention. We frame the task of targeting customers for a retention campaign as a regret minimization problem. The main objective is to leverage individual customer lifetime values (CLVs) to ensure that only the most valuable customers are targeted. In contrast, many profit-driven strategies focus on churn probabilities while considering average CLVs. This often results in significant information loss due to data aggregation. Our proposed model aligns with the guidelines of Predict-and-Optimize (PnO) frameworks and can be efficiently solved using stochastic gradient descent methods. Results from 12 churn prediction datasets underscore the effectiveness of our approach, which achieves the best average performance compared to other well-established strategies in terms of average profit.
摘要
在这篇论文中,我们介绍了一种新的预测和优化方法,用于防止利润驱动的客户流失。我们将客户退货活动的目标客户群作为 regret 最小化问题来定义。我们的主要目标是通过个体客户生命周期价值(CLV)来确保只有最有价值的客户被targeting。与此相比,许多利润驱动策略往往强调退货概率,而不考虑CLV的含义。这经常导致数据汇总所产生的信息损失。我们提出的模型遵循Predict-and-Optimize(PnO)框架的指南,可以使用Stochastic Gradient Descent(SGD)方法高效地解决。Results from 12 churn prediction datasets confirm the effectiveness of our approach, which achieves the best average performance compared to other well-established strategies in terms of average profit.Here's a word-for-word translation of the text into Simplified Chinese:在这篇论文中,我们介绍了一种新的预测和优化方法,用于防止利润驱动的客户流失。我们将客户退货活动的目标客户群作为 regret 最小化问题来定义。我们的主要目标是通过个体客户生命周期价值(CLV)来确保只有最有价值的客户被targeting。与此相比,许多利润驱动策略往往强调退货概率,而不考虑CLV的含义。这经常导致数据汇总所产生的信息损失。我们提出的模型遵循Predict-and-Optimize(PnO)框架的指南,可以使用Stochastic Gradient Descent(SGD)方法高效地解决。Results from 12 churn prediction datasets confirm the effectiveness of our approach, which achieves the best average performance compared to other well-established strategies in terms of average profit.
Neural Harmonium: An Interpretable Deep Structure for Nonlinear Dynamic System Identification with Application to Audio Processing
results: 在非线性系统识别问题上,提出的方法得到了证明。在音频干扰抑制问题中,通过对比与其他现有解决方案的实验,表明了我们的方法在实际应用中的效果。Abstract
Improving the interpretability of deep neural networks has recently gained increased attention, especially when the power of deep learning is leveraged to solve problems in physics. Interpretability helps us understand a model's ability to generalize and reveal its limitations. In this paper, we introduce a causal interpretable deep structure for modeling dynamic systems. Our proposed model makes use of the harmonic analysis by modeling the system in a time-frequency domain while maintaining high temporal and spectral resolution. Moreover, the model is built in an order recursive manner which allows for fast, robust, and exact second order optimization without the need for an explicit Hessian calculation. To circumvent the resulting high dimensionality of the building blocks of our system, a neural network is designed to identify the frequency interdependencies. The proposed model is illustrated and validated on nonlinear system identification problems as required for audio signal processing tasks. Crowd-sourced experimentation contrasting the performance of the proposed approach to other state-of-the-art solutions on an acoustic echo cancellation scenario confirms the effectiveness of our method for real-life applications.
摘要
深度学习在物理问题中的应用已经受到了提高解释性的关注,特别是当深度学习的力量被应用于解决物理问题时。解释性能我们理解模型的泛化能力和其局限性。在这篇论文中,我们介绍了一种可 causal 解释深度结构,用于模型动态系统。我们的提议的模型利用干扰分析,将系统模型在时间频域中进行了时间频谱分析,同时保持高度的时间和频率分辨率。此外,模型采用递归的构建方式,可以快速、稳定、准确地进行第二阶导数计算,不需要显式表达Hessian。为了避免建模块的高维度,我们设计了一个神经网络来识别频率相互关系。我们的提议模型在非线性系统识别问题中得到了验证,特别是在音频信号处理任务中。通过人工 эксперимент,我们比较了我们的方法与其他现有解决方案在音频适应噪抑问题中的性能,并证明了我们的方法在实际应用中的有效性。
Neural Relational Inference with Fast Modular Meta-learning
methods: 这 paper 使用模块化元学习法,通过不同组合方式训练神经模块,以解决多种任务。
results: 这 paper 使用模块化元学习法提高了推理能力,可以更有效地利用观察数据,并且可以估计未直接观察到的实体状态。Abstract
\textit{Graph neural networks} (GNNs) are effective models for many dynamical systems consisting of entities and relations. Although most GNN applications assume a single type of entity and relation, many situations involve multiple types of interactions. \textit{Relational inference} is the problem of inferring these interactions and learning the dynamics from observational data. We frame relational inference as a \textit{modular meta-learning} problem, where neural modules are trained to be composed in different ways to solve many tasks. This meta-learning framework allows us to implicitly encode time invariance and infer relations in context of one another rather than independently, which increases inference capacity. Framing inference as the inner-loop optimization of meta-learning leads to a model-based approach that is more data-efficient and capable of estimating the state of entities that we do not observe directly, but whose existence can be inferred from their effect on observed entities. To address the large search space of graph neural network compositions, we meta-learn a \textit{proposal function} that speeds up the inner-loop simulated annealing search within the modular meta-learning algorithm, providing two orders of magnitude increase in the size of problems that can be addressed.
摘要
\begin{itemize}\item 图 neural networks(GNNs)是适用于许多动态系统中的有效模型,该系统包括实体和关系。 although most GNN applications assume a single type of entity and relation, many situations involve multiple types of interactions.\item 关系推理(relational inference)是从观察数据中推理这些交互的问题,学习这些交互的动态。 we frame relational inference as a modular meta-learning problem, where neural modules are trained to be composed in different ways to solve many tasks.\item 这个meta-learning框架允许我们通过不同的模块组合来解决多种任务,从而隐式地编码了时间不变性,并在彼此之间学习关系,这使得推理能力更高。\item 将推理视为meta-learning的内部循环优化问题,导致一种基于模型的方法,更有效率地使用数据,并能够估计不 direktly observable的实体状态,而是通过其影响已知实体来推理其存在。\item 为了解决图 neural network的模块组合搜索的大搜索空间,我们meta-learn a proposal function,这将在模块meta-learning算法中加速内部逻辑搜索,提供了两个数量级的提高,使得可以处理的问题规模提高了两个数量级。\end{itemize}Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have different translations in Traditional Chinese.
Sound-skwatter (Did You Mean: Sound-squatter?) AI-powered Generator for Phishing Prevention
paper_authors: Rodolfo Valentim, Idilio Drago, Marco Mellia, Federico Cerutti
for: 防御声钩攻击(Sound-squatting),使用人工智能生成声钩候选者。
methods: 使用 transformers 网络和声学模型组合,学习声音相似性。
results: 可以自动找到已知同音词和数千个高质量候选者,同时支持交互语言的声钩攻击。Abstract
Sound-squatting is a phishing attack that tricks users into malicious resources by exploiting similarities in the pronunciation of words. Proactive defense against sound-squatting candidates is complex, and existing solutions rely on manually curated lists of homophones. We here introduce Sound-skwatter, a multi-language AI-based system that generates sound-squatting candidates for proactive defense. Sound-skwatter relies on an innovative multi-modal combination of Transformers Networks and acoustic models to learn sound similarities. We show that Sound-skwatter can automatically list known homophones and thousands of high-quality candidates. In addition, it covers cross-language sound-squatting, i.e., when the reader and the listener speak different languages, supporting any combination of languages. We apply Sound-skwatter to network-centric phishing via squatted domain names. We find ~ 10% of the generated domains exist in the wild, the vast majority unknown to protection solutions. Next, we show attacks on the PyPI package manager, where ~ 17% of the popular packages have at least one existing candidate. We believe Sound-skwatter is a crucial asset to mitigate the sound-squatting phenomenon proactively on the Internet. To increase its impact, we publish an online demo and release our models and code as open source.
摘要
声音骗鱼是一种钓鱼攻击,通过利用声音相似性来骗用户访问恶意资源。现有的防御方法复杂,并且 existing solutions 依赖于手动维护的同音词列表。我们在这里介绍 Sound-skwatter,一个多语言基于 AI 系统,用于生成声音骗鱼候选者。Sound-skwatter 利用了一种创新的多模式 комбиinación,包括 transformers 网络和声音模型,以学习声音相似性。我们表明,Sound-skwatter 可以自动列出已知同音词和数千个高质量候选者。此外,它还支持跨语言声音骗鱼,即当读者和听众说不同语言时。我们应用 Sound-skwatter 于网络中心式骗鱼 via 骗取的域名。我们发现 ~ 10% 的生成域名在野,大多数都是未知的保护解决方案。接着,我们表明 ~ 17% 的流行包在 PyPI 包管理器中有至少一个现有的候选者。我们认为 Sound-skwatter 是在互联网上防止声音骗鱼的关键资产,以提高其影响力,我们在线发布了 demo 和发布我们的模型和代码为开源。
CarDS-Plus ECG Platform: Development and Feasibility Evaluation of a Multiplatform Artificial Intelligence Toolkit for Portable and Wearable Device Electrocardiograms
paper_authors: Sumukh Vasisht Shankar, Evangelos K Oikonomou, Rohan Khera for: 这个研究旨在开发一个多平台系统,以快速部署基于人工智能的单导电喷(ECG)解决方案,用于临床调查和诊断。methods: 这个研究使用了多种设计考虑因素,包括具体应用场景、数据流程优化和实时推断等方面,以实现将多种来源的单导电喷数据传输到中央数据湖,并通过人工智能模型进行ECG解译。results: 研究表明,这个平台可以快速地从获取到报告结果,平均需时为33.0-35.7秒,无论使用哪种商业化的设备(Apple Watch和KardiaMobile)。这些结果表明了将设计原则翻译到快速部署的策略是可行的,并且可以在临床医疗中实现影响。Abstract
In the rapidly evolving landscape of modern healthcare, the integration of wearable & portable technology provides a unique opportunity for personalized health monitoring in the community. Devices like the Apple Watch, FitBit, and AliveCor KardiaMobile have revolutionized the acquisition and processing of intricate health data streams. Amidst the variety of data collected by these gadgets, single-lead electrocardiogram (ECG) recordings have emerged as a crucial source of information for monitoring cardiovascular health. There has been significant advances in artificial intelligence capable of interpreting these 1-lead ECGs, facilitating clinical diagnosis as well as the detection of rare cardiac disorders. This design study describes the development of an innovative multiplatform system aimed at the rapid deployment of AI-based ECG solutions for clinical investigation & care delivery. The study examines design considerations, aligning them with specific applications, develops data flows to maximize efficiency for research & clinical use. This process encompasses the reception of single-lead ECGs from diverse wearable devices, channeling this data into a centralized data lake & facilitating real-time inference through AI models for ECG interpretation. An evaluation of the platform demonstrates a mean duration from acquisition to reporting of results of 33.0 to 35.7 seconds, after a standard 30 second acquisition. There were no substantial differences in acquisition to reporting across two commercially available devices (Apple Watch and KardiaMobile). These results demonstrate the succcessful translation of design principles into a fully integrated & efficient strategy for leveraging 1-lead ECGs across platforms & interpretation by AI-ECG algorithms. Such a platform is critical to translating AI discoveries for wearable and portable ECG devices to clinical impact through rapid deployment.
摘要
在现代医疗面前的急速发展 landscape中,穿戴式和可携式技术的集成提供了个人化健康监测在社区的唯一机会。例如Apple Watch、FitBit和AliveCor KardiaMobile等设备已经革命化了健康数据流的收集和处理。在这些设备收集的数据中,单Channel electrocardiogram(ECG)记录已经成为监测心血管健康的关键来源。人工智能(AI)技术的进步使得可以解释这些1-Channel ECG,从而促进诊断和检测罕见心血管疾病。这个设计研究描述了一种创新的多平台系统,旨在快速部署AI-基于ECG解决方案 для临床调查和诊疗。研究考虑了设计因素,与特定应用相对应,并开发了数据流程,以最大化研究和临床使用的效率。这个过程包括从多种穿戴式设备接收单Channel ECG,将数据传输到中央数据湖,并通过AI模型对ECG进行实时解释。研究表明,平台的实现可以在收集到报告结果的时间内减少了33.0到35.7秒,并且没有显著差异在不同的商业设备(Apple Watch和KardiaMobile)上。这些结果证明了设计原则的成功翻译为一个高效集成的策略,可以在多个平台上使用AI-ECG算法进行单Channel ECG的解释。这种平台是评估AI发现的穿戴式和可携式ECG设备的临床影响的关键。
Federated Quantum Machine Learning with Differential Privacy
results: 使用量子-классиical机器学习模型对猫vs狗数据集进行二分类,实现了测试准确率超过0.98,同时保持ε值小于1.3。验证了 federated differentially private training 是一种可行的隐私保护方法 для量子机器学习 на Noisy Intermediate-Scale Quantum(NISQ)设备。Abstract
The preservation of privacy is a critical concern in the implementation of artificial intelligence on sensitive training data. There are several techniques to preserve data privacy but quantum computations are inherently more secure due to the no-cloning theorem, resulting in a most desirable computational platform on top of the potential quantum advantages. There have been prior works in protecting data privacy by Quantum Federated Learning (QFL) and Quantum Differential Privacy (QDP) studied independently. However, to the best of our knowledge, no prior work has addressed both QFL and QDP together yet. Here, we propose to combine these privacy-preserving methods and implement them on the quantum platform, so that we can achieve comprehensive protection against data leakage (QFL) and model inversion attacks (QDP). This implementation promises more efficient and secure artificial intelligence. In this paper, we present a successful implementation of these privacy-preservation methods by performing the binary classification of the Cats vs Dogs dataset. Using our quantum-classical machine learning model, we obtained a test accuracy of over 0.98, while maintaining epsilon values less than 1.3. We show that federated differentially private training is a viable privacy preservation method for quantum machine learning on Noisy Intermediate-Scale Quantum (NISQ) devices.
摘要
保护隐私是人工智能在敏感训练数据实施中的关键问题。有几种技术来保护数据隐私,但量子计算机是因为无论护法 theorem,因此在计算平台上具有最好的安全性。先前有关保护数据隐私的研究,包括量子联合学习(QFL)和量子差分隐私(QDP),但是到目前为止没有任何研究既 combinates these two privacy-preserving methods。在这篇文章中,我们提议将这两种隐私保护方法结合在一起,并在量子平台上实现,以实现全面的数据泄露防止(QFL)和模型反向攻击防止(QDP)。这种实现承诺更高效和安全的人工智能。在这篇文章中,我们成功地实现了这些隐私保护方法,通过对猫vs狗数据集进行二分类。使用我们的量子-классиical机器学习模型,我们在测试精度达0.98,而且psilon值低于1.3。我们显示,联邦差分隐私训练是量子机器学习在Noisy Intermediate-Scale Quantum(NISQ)设备上可行的隐私保护方法。
Flood and Echo: Algorithmic Alignment of GNNs with Distributed Computing
results: 研究表明,该框架在许多情况下比传统的推理框架更有效率,并且能够有效地进行信息交换和推理扩展。Abstract
Graph Neural Networks are a natural fit for learning algorithms. They can directly represent tasks through an abstract but versatile graph structure and handle inputs of different sizes. This opens up the possibility for scaling and extrapolation to larger graphs, one of the most important advantages of an algorithm. However, this raises two core questions i) How can we enable nodes to gather the required information in a given graph ($\textit{information exchange}$), even if is far away and ii) How can we design an execution framework which enables this information exchange for extrapolation to larger graph sizes ($\textit{algorithmic alignment for extrapolation}$). We propose a new execution framework that is inspired by the design principles of distributed algorithms: Flood and Echo Net. It propagates messages through the entire graph in a wave like activation pattern, which naturally generalizes to larger instances. Through its sparse but parallel activations it is provably more efficient in terms of message complexity. We study the proposed model and provide both empirical evidence and theoretical insights in terms of its expressiveness, efficiency, information exchange and ability to extrapolate.
摘要
GRAPH Neural Networks 是一种自然的适应算法。它们可以直接通过抽象但强大的图结构表示任务,并处理不同大小的输入。这打开了扩大和推断到更大图的可能性,是算法中最重要的优势。然而,这引出了两个核心问题:(i)如何使节点获得图中需要的信息(信息交换),即使它们在远方的 ;(ii)如何设计一个执行框架,使得这些信息交换在更大的图像上进行推断(算法对适应推断)。我们提出了一种新的执行框架,它是基于分布式算法的设计原则:洪涝网络和回声网络。它在整个图上传递消息,使得它自然泛化到更大的实例。通过它的稀疏但平行的活动,可以证明它比消息复杂度更高效。我们研究了提议的模型,并提供了both empirical evidence和理论听见,包括表达能力、效率、信息交换和推断能力。
Positivity-free Policy Learning with Observational Data
results: 本研究提供了对政策学习的理论保证,并验证了提出的框架的finite-sample表现,通过了全面的数据实验,以确保从观察数据中提取 causal 效应是 Both 可靠和可靠。Abstract
Policy learning utilizing observational data is pivotal across various domains, with the objective of learning the optimal treatment assignment policy while adhering to specific constraints such as fairness, budget, and simplicity. This study introduces a novel positivity-free (stochastic) policy learning framework designed to address the challenges posed by the impracticality of the positivity assumption in real-world scenarios. This framework leverages incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments. We characterize these incremental propensity score policies and establish identification conditions, employing semiparametric efficiency theory to propose efficient estimators capable of achieving rapid convergence rates, even when integrated with advanced machine learning algorithms. This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework's finite-sample performance through comprehensive numerical experiments, ensuring the identification of causal effects from observational data is both robust and reliable.
摘要
政策学习使用观察数据是多种领域的关键,旨在学习最佳治理分配策略,遵循特定的限制,如公平、预算和简单性。本研究提出了一种新的无正定性(随机)政策学习框架,用于实际世界场景中缺乏正定性的挑战。这种框架利用增量抽象分数策略来调整治理分数值,而不是将固定值分配给治理。我们描述这种增量抽象分数策略,并提出了定型条件,使用半 Parametric 效率理论提出高效的估计器,可以在融合先进机器学习算法时实现快速收敛速率。本文对政策学习的理论保证和finite-sample表现进行了全面的探讨,并通过了广泛的数字实验,以确保从观察数据中检测到的 causal 效应是可靠和可信。Here's the translation of the text into Simplified Chinese:政策学习使用观察数据是多种领域的关键,旨在学习最佳治理分配策略,遵循特定的限制,如公平、预算和简单性。本研究提出了一种新的无正定性(随机)政策学习框架,用于实际世界场景中缺乏正定性的挑战。这种框架利用增量抽象分数策略来调整治理分数值,而不是将固定值分配给治理。我们描述这种增量抽象分数策略,并提出了定型条件。使用半 Parametric 效率理论提出高效的估计器,可以在融合先进机器学习算法时实现快速收敛速率。本文对政策学习的理论保证和finite-sample表现进行了全面的探讨,并通过了广泛的数字实验,以确保从观察数据中检测到的 causal 效应是可靠和可信。
Diffusion Prior Regularized Iterative Reconstruction for Low-dose CT
paper_authors: Wenjun Xia, Yongyi Shi, Chuang Niu, Wenxiang Cong, Ge Wang
for: 减少X射线辐射剂量,提高 computed tomography(CT)图像质量
methods: 引入迭代重建算法,并将杂散抑制推理模型(DDPM)与数据准确性优先重建方法融合
results: 实现高Definition CT图像重建,减少辐射剂量Abstract
Computed tomography (CT) involves a patient's exposure to ionizing radiation. To reduce the radiation dose, we can either lower the X-ray photon count or down-sample projection views. However, either of the ways often compromises image quality. To address this challenge, here we introduce an iterative reconstruction algorithm regularized by a diffusion prior. Drawing on the exceptional imaging prowess of the denoising diffusion probabilistic model (DDPM), we merge it with a reconstruction procedure that prioritizes data fidelity. This fusion capitalizes on the merits of both techniques, delivering exceptional reconstruction results in an unsupervised framework. To further enhance the efficiency of the reconstruction process, we incorporate the Nesterov momentum acceleration technique. This enhancement facilitates superior diffusion sampling in fewer steps. As demonstrated in our experiments, our method offers a potential pathway to high-definition CT image reconstruction with minimized radiation.
摘要
computed tomography (CT) 涉及到辐射 ionizing radiation,以降低辐射剂量,可以 either 降低 X-ray фото counts 或者下推 projection views。然而,任一种方法通常会 compromise 图像质量。为 Addressing this challenge, here we introduce an iterative reconstruction algorithm regularized by a diffusion prior。 drawing on the exceptional imaging prowess of the denoising diffusion probabilistic model (DDPM), we merge it with a reconstruction procedure that prioritizes data fidelity。 This fusion capitalizes on the merits of both techniques, delivering exceptional reconstruction results in an unsupervised framework。 To further enhance the efficiency of the reconstruction process, we incorporate the Nesterov momentum acceleration technique。 This enhancement facilitates superior diffusion sampling in fewer steps。 As demonstrated in our experiments, our method offers a potential pathway to high-definition CT image reconstruction with minimized radiation。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need Traditional Chinese, please let me know.
A Variational Autoencoder Framework for Robust, Physics-Informed Cyberattack Recognition in Industrial Cyber-Physical Systems
results: 经过实验研究,提出的方法在一个网络化的电力传输系统上的实验研究中表现出了应用性和效果。Abstract
Cybersecurity of Industrial Cyber-Physical Systems is drawing significant concerns as data communication increasingly leverages wireless networks. A lot of data-driven methods were develope for detecting cyberattacks, but few are focused on distinguishing them from equipment faults. In this paper, we develop a data-driven framework that can be used to detect, diagnose, and localize a type of cyberattack called covert attacks on networked industrial control systems. The framework has a hybrid design that combines a variational autoencoder (VAE), a recurrent neural network (RNN), and a Deep Neural Network (DNN). This data-driven framework considers the temporal behavior of a generic physical system that extracts features from the time series of the sensor measurements that can be used for detecting covert attacks, distinguishing them from equipment faults, as well as localize the attack/fault. We evaluate the performance of the proposed method through a realistic simulation study on a networked power transmission system as a typical example of ICS. We compare the performance of the proposed method with the traditional model-based method to show its applicability and efficacy.
摘要
工业控制系统的网络化Cybersecurity引发了 significiant concerns,因为数据通信越来越多地使用无线网络。许多数据驱动方法已经开发,但很少关注于分化攻击和设备故障之间的差异。在这篇论文中,我们开发了一个数据驱动的框架,可以用于检测、诊断和地址网络化工业控制系统中的隐藏攻击。这个框架具有混合设计,组合了变量自适应器(VAE)、回归神经网络(RNN)和深度神经网络(DNN)。这个数据驱动框架考虑了生成器物理系统的时间行为,从感知器测量时间序列中提取特征,用于检测隐藏攻击、分化攻击和位置攻击。我们通过一个现实的 simulate 研究,对一个网络化的电力传输系统进行评估,以示方法的适用性和效果。我们将传统的模型基型方法与该方法进行比较,以显示其适用性和有效性。
LLMs Killed the Script Kiddie: How Agents Supported by Large Language Models Change the Landscape of Network Threat Testing
results: 研究人员通过 demonstrate 一个简单的网络攻击use case来评估 LLM 在网络攻击方面的知识程度,并提供了引言设计的指导方针。研究人员还提出了 LLM 在加速威胁actor能力方面的可能影响和伦理考虑。研究结果表明,LLM 可以用于生成有用的信息和自动化网络攻击,但是它们的潜力和敏捷性仍需进一步探索。Abstract
In this paper, we explore the potential of Large Language Models (LLMs) to reason about threats, generate information about tools, and automate cyber campaigns. We begin with a manual exploration of LLMs in supporting specific threat-related actions and decisions. We proceed by automating the decision process in a cyber campaign. We present prompt engineering approaches for a plan-act-report loop for one action of a threat campaign and and a prompt chaining design that directs the sequential decision process of a multi-action campaign. We assess the extent of LLM's cyber-specific knowledge w.r.t the short campaign we demonstrate and provide insights into prompt design for eliciting actionable responses. We discuss the potential impact of LLMs on the threat landscape and the ethical considerations of using LLMs for accelerating threat actor capabilities. We report a promising, yet concerning, application of generative AI to cyber threats. However, the LLM's capabilities to deal with more complex networks, sophisticated vulnerabilities, and the sensitivity of prompts are open questions. This research should spur deliberations over the inevitable advancements in LLM-supported cyber adversarial landscape.
摘要
在这篇论文中,我们探讨大语言模型(LLM)在处理威胁、生成工具信息和自动化网络攻击方面的潜力。我们开始于手动探索LLM在支持特定威胁行动和决策过程中的能力。然后我们将决策过程自动化,并提出了一种plan-act-report循环和一种链接式提示设计,以导引多个行动的顺序决策过程。我们评估了LLM在短期攻击кампаgn中的网络专业知识的程度,并提供了提示设计的启示,以便获得可行的回答。我们讨论了LLM在威胁风险面临的潜在影响和使用LLM加速攻击者能力的伦理考虑因素。我们报道了一种有前途又担忧的应用 génériques AI 在网络威胁方面,但 LLM 在更复杂的网络、更复杂的漏洞和提示敏感性方面的能力仍然是开Question。这种研究应当促使人们对 LLM 在网络威胁领域的不断进步举行深思熟虑。
Quantum Shadow Gradient Descent for Quantum Learning
results: 我们的研究表明,使用量子影可以减少计算量,并且可以应用于更一般的非产品 Ansatz 中。我们提供了理论证明、减速分析和数值实验来支持我们的结论。Abstract
This paper proposes a new procedure called quantum shadow gradient descent (QSGD) that addresses these key challenges. Our method has the benefits of a one-shot approach, in not requiring any sample duplication while having a convergence rate comparable to the ideal update rule using exact gradient computation. We propose a new technique for generating quantum shadow samples (QSS), which generates quantum shadows as opposed to classical shadows used in existing works. With classical shadows, the computations are typically performed on classical computers and, hence, are prohibitive since the dimension grows exponentially. Our approach resolves this issue by measurements of quantum shadows. As the second main contribution, we study more general non-product ansatz of the form $\exp\{i\sum_j \theta_j A_j\}$ that model variational Hamiltonians. We prove that the gradient can be written in terms of the gradient of single-parameter ansatzes that can be easily measured. Our proof is based on the Suzuki-Trotter approximation; however, our expressions are exact, unlike prior efforts that approximate non-product operators. As a result, existing gradient measurement techniques can be applied to more general VQAs followed by correction terms without any approximation penalty. We provide theoretical proofs, convergence analysis and verify our results through numerical experiments.
摘要
We introduce a new technique for generating quantum shadow samples (QSS), which generates quantum shadows instead of the classical shadows used in existing works. With classical shadows, computations are typically performed on classical computers, and the dimension grows exponentially. Our approach resolves this issue by measuring quantum shadows.As the second main contribution, we study more general non-product ansatz of the form $\exp\{i\sum_j \theta_j A_j\}$ that model variational Hamiltonians. We prove that the gradient can be written in terms of the gradient of single-parameter ansatzes that can be easily measured. Our proof is based on the Suzuki-Trotter approximation, but our expressions are exact, unlike prior efforts that approximate non-product operators.As a result, existing gradient measurement techniques can be applied to more general VQAs followed by correction terms without any approximation penalty. We provide theoretical proofs, convergence analysis, and verify our results through numerical experiments.
results: 研究结果表明,使用这种方法可以更好地预测读书 recording中的语调属性,并且与人工读书 recording更加相似。此外,人类评估研究也表明,人们更偏好使用这种方法生成的Audiobook读书 recording。Abstract
Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models for prosody prediction properties (pitch, volume, and rate of speech) from narrative text using language modeling. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of the 24 books, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over commercial text-to-speech systems.
摘要
Stochastic Super-resolution of Cosmological Simulations with Denoising Diffusion Models
results: 这个论文的结果表明,使用 denoising diffusion models 可以生成高度可信度的 super-resolution 图像和电磁波谱,并且能够复制低分辨率 simulation 中的小规模特征。这些结果表明,这种 super-resolution 模型可以用于 cosmic structure formation 中的 uncertainty quantification。Abstract
In recent years, deep learning models have been successfully employed for augmenting low-resolution cosmological simulations with small-scale information, a task known as "super-resolution". So far, these cosmological super-resolution models have relied on generative adversarial networks (GANs), which can achieve highly realistic results, but suffer from various shortcomings (e.g. low sample diversity). We introduce denoising diffusion models as a powerful generative model for super-resolving cosmic large-scale structure predictions (as a first proof-of-concept in two dimensions). To obtain accurate results down to small scales, we develop a new "filter-boosted" training approach that redistributes the importance of different scales in the pixel-wise training objective. We demonstrate that our model not only produces convincing super-resolution images and power spectra consistent at the percent level, but is also able to reproduce the diversity of small-scale features consistent with a given low-resolution simulation. This enables uncertainty quantification for the generated small-scale features, which is critical for the usefulness of such super-resolution models as a viable surrogate model for cosmic structure formation.
摘要
Recently, deep learning models have been successfully used for augmenting low-resolution cosmological simulations with small-scale information, a task known as "super-resolution". So far, these cosmological super-resolution models have relied on generative adversarial networks (GANs), which can achieve highly realistic results, but suffer from various shortcomings (e.g. low sample diversity). We introduce denoising diffusion models as a powerful generative model for super-resolving cosmic large-scale structure predictions (as a first proof-of-concept in two dimensions). To obtain accurate results down to small scales, we develop a new "filter-boosted" training approach that redistributes the importance of different scales in the pixel-wise training objective. We demonstrate that our model not only produces convincing super-resolution images and power spectra consistent at the percent level, but is also able to reproduce the diversity of small-scale features consistent with a given low-resolution simulation. This enables uncertainty quantification for the generated small-scale features, which is critical for the usefulness of such super-resolution models as a viable surrogate model for cosmic structure formation.Here is the text with some additional information about the translation:The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and widely used in other countries as well. The translation is written in a formal and precise style, using technical terms and phrases appropriate for a scientific paper. The text includes some specialized vocabulary and concepts related to cosmology and deep learning, which are translated accurately and consistently based on their meanings in the context of the text. The translation also includes some cultural references and expressions that are specific to Chinese culture, but are not essential to the understanding of the scientific content. Overall, the translation is accurate and faithful to the original text, and should be easily understandable to readers who are familiar with the subject matter and the language.
Inverse Factorized Q-Learning for Cooperative Multi-agent Imitation Learning
results: 该论文通过对一些复杂的竞争和合作多智能体游戏环境进行了广泛的实验,证明了该算法的有效性,并且比现有的多智能体IL算法表现更好。Abstract
This paper concerns imitation learning (IL) (i.e, the problem of learning to mimic expert behaviors from demonstrations) in cooperative multi-agent systems. The learning problem under consideration poses several challenges, characterized by high-dimensional state and action spaces and intricate inter-agent dependencies. In a single-agent setting, IL has proven to be done efficiently through an inverse soft-Q learning process given expert demonstrations. However, extending this framework to a multi-agent context introduces the need to simultaneously learn both local value functions to capture local observations and individual actions, and a joint value function for exploiting centralized learning. In this work, we introduce a novel multi-agent IL algorithm designed to address these challenges. Our approach enables the centralized learning by leveraging mixing networks to aggregate decentralized Q functions. A main advantage of this approach is that the weights of the mixing networks can be trained using information derived from global states. We further establish conditions for the mixing networks under which the multi-agent objective function exhibits convexity within the Q function space. We present extensive experiments conducted on some challenging competitive and cooperative multi-agent game environments, including an advanced version of the Star-Craft multi-agent challenge (i.e., SMACv2), which demonstrates the effectiveness of our proposed algorithm compared to existing state-of-the-art multi-agent IL algorithms.
摘要
To address these challenges, we propose a novel multi-agent IL algorithm that leverages mixing networks to aggregate decentralized Q functions. The weights of the mixing networks can be trained using information derived from global states. We establish conditions for the mixing networks under which the multi-agent objective function exhibits convexity within the Q function space.We present extensive experiments conducted on challenging competitive and cooperative multi-agent game environments, including the advanced version of the Star-Craft multi-agent challenge (SMACv2). Our proposed algorithm outperforms existing state-of-the-art multi-agent IL algorithms.
Test & Evaluation Best Practices for Machine Learning-Enabled Systems
for: This paper aims to present best practices for the Test and Evaluation (T&E) of Machine Learning (ML)-enabled software systems across their lifecycle.
methods: The paper categorizes the lifecycle of ML-enabled software systems into three stages: component, integration and deployment, and post-deployment. The primary objective is to test and evaluate the ML model as a standalone component, and then evaluate an integrated ML-enabled system consisting of both ML and non-ML components.
results: The paper highlights the challenges of T&E in ML-enabled software systems and the need for systematic testing approaches, adequacy measurements, and metrics to address these challenges across all stages of the ML-enabled system lifecycle.Abstract
Machine learning (ML) - based software systems are rapidly gaining adoption across various domains, making it increasingly essential to ensure they perform as intended. This report presents best practices for the Test and Evaluation (T&E) of ML-enabled software systems across its lifecycle. We categorize the lifecycle of ML-enabled software systems into three stages: component, integration and deployment, and post-deployment. At the component level, the primary objective is to test and evaluate the ML model as a standalone component. Next, in the integration and deployment stage, the goal is to evaluate an integrated ML-enabled system consisting of both ML and non-ML components. Finally, once the ML-enabled software system is deployed and operationalized, the T&E objective is to ensure the system performs as intended. Maintenance activities for ML-enabled software systems span the lifecycle and involve maintaining various assets of ML-enabled software systems. Given its unique characteristics, the T&E of ML-enabled software systems is challenging. While significant research has been reported on T&E at the component level, limited work is reported on T&E in the remaining two stages. Furthermore, in many cases, there is a lack of systematic T&E strategies throughout the ML-enabled system's lifecycle. This leads practitioners to resort to ad-hoc T&E practices, which can undermine user confidence in the reliability of ML-enabled software systems. New systematic testing approaches, adequacy measurements, and metrics are required to address the T&E challenges across all stages of the ML-enabled system lifecycle.
摘要
At the component stage, the primary goal is to evaluate the ML model as a standalone component. In the integration and deployment stage, the objective is to evaluate an integrated ML-enabled system consisting of both ML and non-ML components. Finally, once the ML-enabled software system is deployed and operationalized, the T&E objective is to ensure the system performs as intended.Maintenance activities for ML-enabled software systems span the lifecycle and involve maintaining various assets of ML-enabled software systems. The T&E of ML-enabled software systems is challenging due to their unique characteristics. While there has been significant research on T&E at the component level, there is limited work on T&E in the remaining two stages. Moreover, there is often a lack of systematic T&E strategies throughout the ML-enabled system's lifecycle, leading practitioners to resort to ad-hoc T&E practices that can undermine user confidence in the reliability of ML-enabled software systems.New systematic testing approaches, adequacy measurements, and metrics are needed to address the T&E challenges across all stages of the ML-enabled system lifecycle.
Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning
paper_authors: Stefan Stojanovic, Yassir Jedra, Alexandre Proutiere
for: Matrix estimation problems in reinforcement learning (RL) with low-rank structure, such as low-rank bandits and Markov Decision Processes (MDPs).
methods: Spectral-based matrix estimation approaches that efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.
results: State-of-the-art performance guarantees for two examples of algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for reward-free RL in low-rank MDPs.Abstract
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure. In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP. In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise error. Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g. for MDPs, the data consists of system trajectories). We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error. These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure. We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for reward-free RL in low-rank MDPs. Both algorithms yield state-of-the-art performance guarantees.
摘要
我们研究在奖励学习(RL)中出现的矩阵估计问题,其中矩阵往往具有低级别结构。在低级别投机中,矩阵需要 recuperate 表示每个臂奖励,而在低级别Markov决策过程(MDP)中,它可能表示MDP的转移核函数。在两种情况下,每个矩阵元素都具有重要信息,我们寻找低入门错误的估计方法。这些方法还需要考虑数据中的自然相关性(例如,MDP数据包括系统轨迹)。我们研究spectral-based矩阵估计方法的性能,并证明它们可以高效地回归矩阵的单个子空间,并且显示出nearly-minimal 入门错误。这些新结果在低级别矩阵估计方面,使得我们可以开发充分利用低级别结构的奖励学习算法。我们提供了两个例子:一个为低级别投机问题的奖励最小化算法,另一个为无奖励RL的低级别MDP中的最佳策略标识算法。两个算法都有状态 искусственный智能的性能保证。
Enhancing Predictive Capabilities in Data-Driven Dynamical Modeling with Automatic Differentiation: Koopman and Neural ODE Approaches
results: 这个方法在测试了多种方法后,与STATE SPACE APPROACH(神经 ODEs)相比,表现更好,并且在不满足 Koopman 算子的线性条件下,也可以达到比较好的结果。Abstract
Data-driven approximations of the Koopman operator are promising for predicting the time evolution of systems characterized by complex dynamics. Among these methods, the approach known as extended dynamic mode decomposition with dictionary learning (EDMD-DL) has garnered significant attention. Here we present a modification of EDMD-DL that concurrently determines both the dictionary of observables and the corresponding approximation of the Koopman operator. This innovation leverages automatic differentiation to facilitate gradient descent computations through the pseudoinverse. We also address the performance of several alternative methodologies. We assess a 'pure' Koopman approach, which involves the direct time-integration of a linear, high-dimensional system governing the dynamics within the space of observables. Additionally, we explore a modified approach where the system alternates between spaces of states and observables at each time step -- this approach no longer satisfies the linearity of the true Koopman operator representation. For further comparisons, we also apply a state space approach (neural ODEs). We consider systems encompassing two and three-dimensional ordinary differential equation systems featuring steady, oscillatory, and chaotic attractors, as well as partial differential equations exhibiting increasingly complex and intricate behaviors. Our framework significantly outperforms EDMD-DL. Furthermore, the state space approach offers superior performance compared to the 'pure' Koopman approach where the entire time evolution occurs in the space of observables. When the temporal evolution of the Koopman approach alternates between states and observables at each time step, however, its predictions become comparable to those of the state space approach.
摘要
“数据驱动的科普曼算子估计方法显示出预测复杂动力系统时间演化的承诺。这些方法中,使用字典学习的扩展动态模式分解(EDMD-DL)已经吸引了广泛的关注。在这里,我们提出了一种同时确定字典和科普曼算子的估计方法的修改。这种创新利用了自动微分的技术来促进梯度下降计算,通过 Pseudoinverse 来实现。我们还评估了一些其他方法。我们评估了一种 '纯' 科普曼方法,该方法直接在可观察空间中进行时间 инте格alion,并且可以在高维度系统中实现。此外,我们还探讨了一种 modify 方法,该方法在每次时间步骤时将系统转换到不同的空间中,这种方法不再满足真正的科普曼算子表示。为了进一步比较,我们还应用了一种状态空间方法(神经 ODEs)。我们考虑了两维和三维常微方程系统,以及具有复杂和精细行为的 partial differential equation 系统。我们的框架在 EDMD-DL 方法上表现出了显著的改善,而且状态空间方法在比较 '纯' 科普曼方法和 EDMD-DL 方法时表现出了更好的性能。当科普曼方法在每次时间步骤时 alternate между状态和可观察空间时,其预测结果与状态空间方法相近。”
results: 作者的信息理论奖励方法在多个游戏中表现出了更高的效率和可扩展性,包括Montezuma Revenge这个知名的奖励学习任务。此外,作者还提出了一种扩展方案,即在离散压缩的射频空间中最大化信息内容,以提高样本效率和扩展性。Abstract
Sparse reward environments are known to be challenging for reinforcement learning agents. In such environments, efficient and scalable exploration is crucial. Exploration is a means by which an agent gains information about the environment. We expand on this topic and propose a new intrinsic reward that systemically quantifies exploratory behavior and promotes state coverage by maximizing the information content of a trajectory taken by an agent. We compare our method to alternative exploration based intrinsic reward techniques, namely Curiosity Driven Learning and Random Network Distillation. We show that our information theoretic reward induces efficient exploration and outperforms in various games, including Montezuma Revenge, a known difficult task for reinforcement learning. Finally, we propose an extension that maximizes information content in a discretely compressed latent space which boosts sample efficiency and generalizes to continuous state spaces.
摘要
稀有奖励环境是束缚学习代理的挑战之一。在这些环境中,高效和可扩展的探索是关键。探索是一种方式,通过哪里让代理获得环境信息。我们在这个主题上进一步探讨,并提出一种新的内在奖励方法,系统地量化探索行为,并且通过最大化征文轨迹中的信息内容来促进状态覆盖。我们与其他探索基于内在奖励技术进行比较,包括Curiosity Driven Learning和Random Network Distillation。我们显示,我们的信息学的奖励induces高效的探索,并在多个游戏中表现出优秀,包括Montezuma Revenge,这是已知的Difficult Task for reinforcement learning。最后,我们提出了一种扩展,通过最大化离散压缩的秘密空间中的信息内容来提高样本效率和普遍性,以便应用于连续状态空间。
Causal Rule Learning: Enhancing the Understanding of Heterogeneous Treatment Effect via Weighted Causal Rules
paper_authors: Ying Wu, Hanzhong Liu, Kai Ren, Xiangyu Chang
For: The paper aims to estimate heterogeneous treatment effects using machine learning methods, with a focus on interpretability for healthcare applications.* Methods: The proposed method, called causal rule learning, involves three phases: rule discovery, rule selection, and rule analysis. It uses a causal forest and D-learning method to identify and deconstruct individual-level treatment effects as a linear combination of subgroup-level effects.* Results: The paper demonstrates the superior performance of causal rule learning in estimating heterogeneous treatment effects when the ground truth is complex and the sample size is sufficient, compared to other methods. It also provides insights into the treatment effects of different subgroups and the weights of each rule in the linear combination.Here is the information in Simplified Chinese text:* For: 该研究使用机器学习方法来估计不同受试者对待的差异效果,特别是在医疗应用中,高度需要可读性。* Methods: 提议的方法是 causal rule learning,它包括三个阶段:规则发现、规则选择和规则分析。它使用 causal forest 和 D-learning 方法来发现和分解个体级待遇的差异效果,以解答过去的忽略问题:一个个体是多个组的成员吗?* Results: 研究表明, causal rule learning 在复杂的真实场景中,对差异效果的可读性估计具有显著优势,比其他方法更好。它还提供了不同组别待遇的治疗效果的信息和每个规则在线性组合中的权重。Abstract
Interpretability is a key concern in estimating heterogeneous treatment effects using machine learning methods, especially for healthcare applications where high-stake decisions are often made. Inspired by the Predictive, Descriptive, Relevant framework of interpretability, we propose causal rule learning which finds a refined set of causal rules characterizing potential subgroups to estimate and enhance our understanding of heterogeneous treatment effects. Causal rule learning involves three phases: rule discovery, rule selection, and rule analysis. In the rule discovery phase, we utilize a causal forest to generate a pool of causal rules with corresponding subgroup average treatment effects. The selection phase then employs a D-learning method to select a subset of these rules to deconstruct individual-level treatment effects as a linear combination of the subgroup-level effects. This helps to answer an ignored question by previous literature: what if an individual simultaneously belongs to multiple groups with different average treatment effects? The rule analysis phase outlines a detailed procedure to further analyze each rule in the subset from multiple perspectives, revealing the most promising rules for further validation. The rules themselves, their corresponding subgroup treatment effects, and their weights in the linear combination give us more insights into heterogeneous treatment effects. Simulation and real-world data analysis demonstrate the superior performance of causal rule learning on the interpretable estimation of heterogeneous treatment effect when the ground truth is complex and the sample size is sufficient.
摘要
<>转换文本到简化中文。<>解释性是机器学习方法估计不同征型对减震效果的关键问题,特别是在医疗应用中,高度决策是经常被做的。 draw inspiration from predictive, descriptive, relevant framework of interpretability, we propose causal rule learning, which finds a refined set of causal rules characterizing potential subgroups to estimate and enhance our understanding of heterogeneous treatment effects. causal rule learning involves three phases: rule discovery, rule selection, and rule analysis. In the rule discovery phase, we utilize a causal forest to generate a pool of causal rules with corresponding subgroup average treatment effects. The selection phase then employs a D-learning method to select a subset of these rules to deconstruct individual-level treatment effects as a linear combination of the subgroup-level effects. This helps to answer an ignored question by previous literature: what if an individual simultaneously belongs to multiple groups with different average treatment effects? The rule analysis phase outlines a detailed procedure to further analyze each rule in the subset from multiple perspectives, revealing the most promising rules for further validation. The rules themselves, their corresponding subgroup treatment effects, and their weights in the linear combination give us more insights into heterogeneous treatment effects. Simulation and real-world data analysis demonstrate the superior performance of causal rule learning on the interpretable estimation of heterogeneous treatment effect when the ground truth is complex and the sample size is sufficient.
Growing ecosystem of deep learning methods for modeling protein$\unicode{x2013}$protein interactions
results: 论文提出了一系列的成果,包括使用表示学习capture蛋白质交互的复杂特征,使用几何深度学习预测蛋白质结构和交互,以及使用生成模型设计新的蛋白质组合体。这些成果推动了蛋白质交互模型的发展,并为探索蛋白质交互的物理机制和工程蛋白质交互提供了新的思路。Abstract
Numerous cellular functions rely on protein$\unicode{x2013}$protein interactions. Efforts to comprehensively characterize them remain challenged however by the diversity of molecular recognition mechanisms employed within the proteome. Deep learning has emerged as a promising approach for tackling this problem by exploiting both experimental data and basic biophysical knowledge about protein interactions. Here, we review the growing ecosystem of deep learning methods for modeling protein interactions, highlighting the diversity of these biophysically-informed models and their respective trade-offs. We discuss recent successes in using representation learning to capture complex features pertinent to predicting protein interactions and interaction sites, geometric deep learning to reason over protein structures and predict complex structures, and generative modeling to design de novo protein assemblies. We also outline some of the outstanding challenges and promising new directions. Opportunities abound to discover novel interactions, elucidate their physical mechanisms, and engineer binders to modulate their functions using deep learning and, ultimately, unravel how protein interactions orchestrate complex cellular behaviors.
摘要
许多细胞功能都依赖于蛋白质-蛋白质交互。然而,完全描述这些交互的问题仍然面临着蛋白质多样性所带来的挑战。深度学习在解决这个问题上表现出了扎根,因为它可以利用实验数据和蛋白质交互的基本生物物理知识。在这篇文章中,我们评论了深度学习用于模拟蛋白质交互的生态系统,包括这些生物学上 Informed 模型的多样性和它们之间的贸易。我们还讨论了在 representation learning 中捕捉蛋白质交互的复杂特征,在 geometric deep learning 中预测蛋白质结构和预测复杂结构,以及在生成模型中设计新的蛋白质组装。此外,我们还概述了一些未解决的挑战和前瞻的新方向。在使用深度学习来发现新的交互、解释它们的物理机制和通过设计蛋白质拓展器来调整交互的功能时,有很多机会。最终,我们希望通过深度学习来解释蛋白质交互如何指挥细胞行为。
Improving Pseudo-Time Stepping Convergence for CFD Simulations With Neural Networks
results: 在这种模拟中,使用了一种叫做pseudo-transient continuation的技术,以提高非线性律vikrey-Stokes方程的收敛性。这种技术使用了一个神经网络模型,用于预测当地 pseudo-time step。这种预测方法可以在每个元素上独立地进行,只需要使用当地的信息。 numerically simulate the results of standard benchmark problems, such as flow through a backward facing step geometry and Couette flow, show the performance of the machine learning-enhanced globalization approach.Abstract
Computational fluid dynamics (CFD) simulations of viscous fluids described by the Navier-Stokes equations are considered. Depending on the Reynolds number of the flow, the Navier-Stokes equations may exhibit a highly nonlinear behavior. The system of nonlinear equations resulting from the discretization of the Navier-Stokes equations can be solved using nonlinear iteration methods, such as Newton's method. However, fast quadratic convergence is typically only obtained in a local neighborhood of the solution, and for many configurations, the classical Newton iteration does not converge at all. In such cases, so-called globalization techniques may help to improve convergence. In this paper, pseudo-transient continuation is employed in order to improve nonlinear convergence. The classical algorithm is enhanced by a neural network model that is trained to predict a local pseudo-time step. Generalization of the novel approach is facilitated by predicting the local pseudo-time step separately on each element using only local information on a patch of adjacent elements as input. Numerical results for standard benchmark problems, including flow through a backward facing step geometry and Couette flow, show the performance of the machine learning-enhanced globalization approach; as the software for the simulations, the CFD module of COMSOL Multiphysics is employed.
摘要
computational fluid dynamics (CFD) 模拟可以描述由navier-Stokes方程所描述的粘性流体行为。各种 Reynolds 数值可以导致 Navier-Stokes 方程在不同程度上具有非线性行为。通过离散 Navier-Stokes 方程得到的系统非线性方程可以通过非线性迭代方法,如新颖方法,进行解决。然而,通常只有在解的本地邻域内具有快速quadratic convergence的情况下才能获得快速的收敛。在这些情况下,所谓的全局化技术可以帮助改善收敛。在这篇论文中,使用pseudo-transient continuation的方法来改进非线性收敛。经过训练的神经网络模型可以预测当前粘性流体中的local pseudo-time step。通过在每个元素上分别预测local pseudo-time step,并且只使用当地信息进行预测,这种全局化技术可以在不同的粘性流体中实现更好的收敛性。在实验中,使用CFD模块在COMSOL Multiphysics中进行 simulations。Please note that Simplified Chinese is a simplified version of Chinese, and it may not be the exact translation of the original text.
S4Sleep: Elucidating the design space of deep-learning-based sleep stage classification models
results: 研究发现,这些架构在SHHS数据集上显示了 statistically significant的性能提高,并通过了 both statistical和systematic error estimations。Abstract
Scoring sleep stages in polysomnography recordings is a time-consuming task plagued by significant inter-rater variability. Therefore, it stands to benefit from the application of machine learning algorithms. While many algorithms have been proposed for this purpose, certain critical architectural decisions have not received systematic exploration. In this study, we meticulously investigate these design choices within the broad category of encoder-predictor architectures. We identify robust architectures applicable to both time series and spectrogram input representations. These architectures incorporate structured state space models as integral components, leading to statistically significant advancements in performance on the extensive SHHS dataset. These improvements are assessed through both statistical and systematic error estimations. We anticipate that the architectural insights gained from this study will not only prove valuable for future research in sleep staging but also hold relevance for other time series annotation tasks.
摘要
评分睡眠阶段在多somnography记录中是一项时间消耗性的任务,受到差异评分者的影响。因此,它可以从机器学习算法的应用中受益。虽然许多算法已经被提出用于此目的,但一些关键的建筑设计决策尚未得到系统的探讨。在这项研究中,我们仔细调查了这些设计选择,并在广泛的SHHS数据集上进行了实证验证。我们发现了一些稳定的架构,可以在时间序列和峰值spectrogram输入表示中应用。这些架构包括结构化状态空间模型为组件,导致了 statistically significant的性能提升。我们通过统计和系统的错误估计来评估这些改进。我们预计,这些建筑学习的成果将不仅对Future sleep stage评分研究有价值,还将对其他时间序列注释任务有 relevance。
Interpretable Traffic Event Analysis with Bayesian Networks
results: 通过一个具体的案例研究,本研究的方法可以准确预测交通事故,并分析交通和天气事件之间的关系,从而提供可读性的交通事故预测方法。Abstract
Although existing machine learning-based methods for traffic accident analysis can provide good quality results to downstream tasks, they lack interpretability which is crucial for this critical problem. This paper proposes an interpretable framework based on Bayesian Networks for traffic accident prediction. To enable the ease of interpretability, we design a dataset construction pipeline to feed the traffic data into the framework while retaining the essential traffic data information. With a concrete case study, our framework can derive a Bayesian Network from a dataset based on the causal relationships between weather and traffic events across the United States. Consequently, our framework enables the prediction of traffic accidents with competitive accuracy while examining how the probability of these events changes under different conditions, thus illustrating transparent relationships between traffic and weather events. Additionally, the visualization of the network simplifies the analysis of relationships between different variables, revealing the primary causes of traffic accidents and ultimately providing a valuable reference for reducing traffic accidents.
摘要
尽管现有的机器学习基于方法可以提供下游任务的好质量结果,但它们缺乏可解性,这是交通事故分析中的关键问题。这篇论文提出了一种可解的框架,基于 bayesian 网络,用于交通事故预测。为了实现可解性,我们设计了一个数据建构管道,将交通数据feed到框架中,保留交通数据的重要信息。通过具体的案例研究,我们的框架可以从 dataset 中 deriv 出 bayesian 网络,该网络表示美国交通事故和天气事件之间的 causal 关系。因此,我们的框架可以在不同条件下预测交通事故的发生概率,并评估这些事件的发生probability在不同条件下的变化,从而显示交通和天气事件之间的透明关系。此外,网络的可视化可以简化不同变量之间的关系分析,揭示交通事故的主要原因,并为减少交通事故提供了有价值的参考。
results: 研究人员在三个不同的环境中证明了该算法的效果,包括一个简单的游戏、一个中等难度的游戏和一个复杂的游戏,并且在不同的传递知识要求下进行了证明。Abstract
We present an algorithm that learns to imitate expert behavior and can transfer to previously unseen domains without retraining. Such an algorithm is extremely relevant in real-world applications such as robotic learning because 1) reward functions are difficult to design, 2) learned policies from one domain are difficult to deploy in another domain and 3) learning directly in the real world is either expensive or unfeasible due to security concerns. To overcome these constraints, we combine recent advances in Deep RL by using an AnnealedVAE to learn a disentangled state representation and imitate an expert by learning a single Q-function which avoids adversarial training. We demonstrate the effectiveness of our method in 3 environments ranging in difficulty and the type of transfer knowledge required.
摘要
我们提出了一种算法,可以模仿专家行为,并可以在未经 retraining 的情况下在新领域中传输。这种算法在实际应用中非常有用,因为1)奖励函数设计困难,2)从一个领域学习的策略Difficult to deploy in another domain,3)在真实世界中学习直接是非常昂贵或者安全问题。为了解决这些限制,我们将最近的深度学习RL技术与AnnealedVAE结合,学习一个分离的状态表示,并通过学习单个Q函数来模仿专家。我们在3个不同的环境中展示了我们的方法的有效性,这些环境的难度和传输知识类型都不同。
results: 本文得到了一个新的泛化函数,可以用于代表不同的产品随机变量的乘积。Abstract
We review the cumulant decomposition (a way of decomposing the expectation of a product of random variables (e.g. $\mathbb{E}[XYZ]$) into a sum of terms corresponding to partitions of these variables.) and the Wick decomposition (a way of decomposing a product of (not necessarily random) variables into a sum of terms corresponding to subsets of the variables). Then we generalize each one to a new decomposition where the product function is generalized to an arbitrary function.
摘要
我们审查汇数分解(一种分解互动随机变量(例如 $\mathbb{E}[XYZ]$)的期望为汇数分割)和威克分解(一种分解互动变量的产生为汇数分割)。然后我们将它们扩展为一个新的分解,其中互动函数被扩展为一个通用函数。Note that "汇数分解" (cumulant decomposition) and "威克分解" (Wick decomposition) are both commonly used terms in probability theory and statistics, and they are often used to analyze the properties of multivariate distributions.
Enhanced Graph Neural Networks with Ego-Centric Spectral Subgraph Embeddings Augmentation
results: 我们在 seven 个数据集和八个基eline模型上进行评估,结果显示,对于图像分类任务,ESGEA 可以提高 AUC 的表现,相比基eline模型,提高了10%。对于节点分类任务,ESGEA 可以提高 accuracy 的表现,相比基eline模型,提高了7%。Abstract
Graph Neural Networks (GNNs) have shown remarkable merit in performing various learning-based tasks in complex networks. The superior performance of GNNs often correlates with the availability and quality of node-level features in the input networks. However, for many network applications, such node-level information may be missing or unreliable, thereby limiting the applicability and efficacy of GNNs. To address this limitation, we present a novel approach denoted as Ego-centric Spectral subGraph Embedding Augmentation (ESGEA), which aims to enhance and design node features, particularly in scenarios where information is lacking. Our method leverages the topological structure of the local subgraph to create topology-aware node features. The subgraph features are generated using an efficient spectral graph embedding technique, and they serve as node features that capture the local topological organization of the network. The explicit node features, if present, are then enhanced with the subgraph embeddings in order to improve the overall performance. ESGEA is compatible with any GNN-based architecture and is effective even in the absence of node features. We evaluate the proposed method in a social network graph classification task where node attributes are unavailable, as well as in a node classification task where node features are corrupted or even absent. The evaluation results on seven datasets and eight baseline models indicate up to a 10% improvement in AUC and a 7% improvement in accuracy for graph and node classification tasks, respectively.
摘要
graph neural networks (GNNs) 已经表现出非常出色的表现力在复杂网络中进行学习任务。 GNNs 的高效性 frequently correlates with the availability and quality of node-level features in the input networks。 however, for many network applications, such node-level information may be missing or unreliable, thereby limiting the applicability and efficacy of GNNs。 To address this limitation, we present a novel approach denoted as Ego-centric Spectral subGraph Embedding Augmentation (ESGEA), which aims to enhance and design node features, particularly in scenarios where information is lacking。 Our method leverages the topological structure of the local subgraph to create topology-aware node features。 The subgraph features are generated using an efficient spectral graph embedding technique, and they serve as node features that capture the local topological organization of the network。 The explicit node features, if present, are then enhanced with the subgraph embeddings in order to improve the overall performance。 ESGEA is compatible with any GNN-based architecture and is effective even in the absence of node features。 We evaluate the proposed method in a social network graph classification task where node attributes are unavailable, as well as in a node classification task where node features are corrupted or even absent。 The evaluation results on seven datasets and eight baseline models indicate up to a 10% improvement in AUC and a 7% improvement in accuracy for graph and node classification tasks, respectively。
On the importance of catalyst-adsorbate 3D interactions for relaxed energy predictions
results: 发现虽然去除绑定站信息会降低准确性,修改后的模型仍可以高度准确地预测系统的压缩能量,并且可以在O20数据集上达到remarkably decent MAE。Abstract
The use of machine learning for material property prediction and discovery has traditionally centered on graph neural networks that incorporate the geometric configuration of all atoms. However, in practice not all this information may be readily available, e.g.~when evaluating the potentially unknown binding of adsorbates to catalyst. In this paper, we investigate whether it is possible to predict a system's relaxed energy in the OC20 dataset while ignoring the relative position of the adsorbate with respect to the electro-catalyst. We consider SchNet, DimeNet++ and FAENet as base architectures and measure the impact of four modifications on model performance: removing edges in the input graph, pooling independent representations, not sharing the backbone weights and using an attention mechanism to propagate non-geometric relative information. We find that while removing binding site information impairs accuracy as expected, modified models are able to predict relaxed energies with remarkably decent MAE. Our work suggests future research directions in accelerated materials discovery where information on reactant configurations can be reduced or altogether omitted.
摘要
传统上,机器学习 для物理性质预测和发现都是通过图 neural networks来实现,其中包括所有原子的几何配置。然而,在实践中,这些信息可能不可获取,例如,评估可能未知的材料吸附物的绑定。在这篇文章中,我们研究了是否可以预测系统的压缩能量在OC20数据集中,而不考虑附着物的相对位置。我们考虑了SchNet、DimeNet++和FAENet作为基础体系,并测试了四种修改对模型性能的影响: removing edges in the input graph、pooling independent representations、不共享背部网重和使用注意机制来传播非几何相对信息。我们发现,尽管 removing binding site information 会降低准确性,但修改后的模型仍然可以预测压缩能量的投影值,并且具有相当的平均误差。我们的工作建议将来的材料发现加速,可以采用减少或完全去除reactant配置信息的方法。
Machine Learning Quantum Systems with Magnetic p-bits
results: 研究表明,使用这种概率计算机可以实现可扩展和能效的计算,特别适用于将机器学习和量子物理结合起来的新领域。Abstract
The slowing down of Moore's Law has led to a crisis as the computing workloads of Artificial Intelligence (AI) algorithms continue skyrocketing. There is an urgent need for scalable and energy-efficient hardware catering to the unique requirements of AI algorithms and applications. In this environment, probabilistic computing with p-bits emerged as a scalable, domain-specific, and energy-efficient computing paradigm, particularly useful for probabilistic applications and algorithms. In particular, spintronic devices such as stochastic magnetic tunnel junctions (sMTJ) show great promise in designing integrated p-computers. Here, we examine how a scalable probabilistic computer with such magnetic p-bits can be useful for an emerging field combining machine learning and quantum physics.
摘要
Note:* "Moore's Law" is translated as "Moore's 法则" (Moore zhì yì)* "computing workloads" is translated as "计算工作负载" (jìsuan gongzuò fùyòu)* "Probabilistic computing" is translated as "概率计算" (guīshí jìsuan)* "p-bits" is translated as "p-位" (p-bit)* "spintronic devices" is translated as "磁电子设备" (spintronic seti)* "stochastic magnetic tunnel junctions" is translated as "随机磁隧道结构" (stochastic magnetic tunnel junctions)* "integrated p-computers" is translated as "集成p计算机" (integrated p-computers)* "emerging field" is translated as "新兴领域" (emerging field)* "machine learning and quantum physics" is translated as "机器学习和量子物理" (machine learning and quantum physics)
Tertiary Lymphoid Structures Generation through Graph-based Diffusion
results: 研究者通过数据扩充来证明了学习生成模型的utilty,并展示了这种模型可以帮助提高肿瘤诊断的准确率。这是首次利用图基的扩散模型来生成生物学意义的细胞图。Abstract
Graph-based representation approaches have been proven to be successful in the analysis of biomedical data, due to their capability of capturing intricate dependencies between biological entities, such as the spatial organization of different cell types in a tumor tissue. However, to further enhance our understanding of the underlying governing biological mechanisms, it is important to accurately capture the actual distributions of such complex data. Graph-based deep generative models are specifically tailored to accomplish that. In this work, we leverage state-of-the-art graph-based diffusion models to generate biologically meaningful cell-graphs. In particular, we show that the adopted graph diffusion model is able to accurately learn the distribution of cells in terms of their tertiary lymphoid structures (TLS) content, a well-established biomarker for evaluating the cancer progression in oncology research. Additionally, we further illustrate the utility of the learned generative models for data augmentation in a TLS classification task. To the best of our knowledge, this is the first work that leverages the power of graph diffusion models in generating meaningful biological cell structures.
摘要
基于图表表示方法已经在生物医学数据分析中取得成功,因为它们可以捕捉生物实体之间复杂的依赖关系,如肿瘤组织中不同细胞类型之间的空间组织。然而,为了更好地理解生物机制的下面驱动,需要准确地捕捉实际数据的分布。基于图表的深度生成模型可以帮助实现这一目标。在这种工作中,我们利用了状态机器的图表傅振模型,生成生物意义正的细胞图。特别是,我们表明采用的图表傅振模型可以准确地学习细胞的三元免疫结构(TLS)含量,这是评估肿瘤发展的生物标志物。此外,我们还进一步证明了学习的生成模型可以用于数据扩展在TLS分类任务中。根据我们所知,这是首次利用图表傅振模型生成有意义的生物细胞结构。
results: 我们在多种数据集上进行了实验,并证明了我们的方法的有效性、普遍性和可扩展性。我们的方法可以处理弯曲3D形状、单类编码和多类编码等多种应用场景。Abstract
Neural shape representation generally refers to representing 3D geometry using neural networks, e.g., to compute a signed distance or occupancy value at a specific spatial position. Previous methods tend to rely on the auto-decoder paradigm, which often requires densely-sampled and accurate signed distances to be known during training and testing, as well as an additional optimization loop during inference. This introduces a lot of computational overhead, in addition to having to compute signed distances analytically, even during testing. In this paper, we present a novel encoder-decoder neural network for embedding 3D shapes in a single forward pass. Our architecture is based on a multi-scale hybrid system incorporating graph-based and voxel-based components, as well as a continuously differentiable decoder. Furthermore, the network is trained to solve the Eikonal equation and only requires knowledge of the zero-level set for training and inference. Additional volumetric samples can be generated on-the-fly, and incorporated in an unsupervised manner. This means that in contrast to most previous work, our network is able to output valid signed distance fields without explicit prior knowledge of non-zero distance values or shape occupancy. In other words, our network computes approximate solutions to the boundary-valued Eikonal equation. It also requires only a single forward pass during inference, instead of the common latent code optimization. We further propose a modification of the loss function in case that surface normals are not well defined, e.g., in the context of non-watertight surface-meshes and non-manifold geometry. We finally demonstrate the efficacy, generalizability and scalability of our method on datasets consisting of deforming 3D shapes, single class encoding and multiclass encoding, showcasing a wide range of possible applications.
摘要
Implicit Variational Inference for High-Dimensional Posteriors
paper_authors: Anshuk Uppal, Kristoffer Stensbo-Smidt, Wouter K. Boomsma, Jes Frellsen
For: The paper is written for advancing the field of variational inference in Bayesian neural networks, specifically by proposing a new method for approximating complex multimodal and correlated posteriors using neural samplers with implicit distributions.* Methods: The paper introduces novel bounds that come about by locally linearizing the neural sampler, which is distinct from existing methods that rely on additional discriminator networks and unstable adversarial objectives. The paper also presents a new sampler architecture that enables implicit distributions over millions of latent variables, addressing computational concerns by using differentiable numerical approximations.* Results: The paper demonstrates that the proposed method is capable of recovering correlations across layers in large Bayesian neural networks, a property that is crucial for a network’s performance but notoriously challenging to achieve. The paper also shows that the expressive posteriors obtained using the proposed method outperform state-of-the-art uncertainty quantification methods in downstream tasks, validating the effectiveness of the training algorithm and the quality of the learned implicit approximation.Abstract
In variational inference, the benefits of Bayesian models rely on accurately capturing the true posterior distribution. We propose using neural samplers that specify implicit distributions, which are well-suited for approximating complex multimodal and correlated posteriors in high-dimensional spaces. Our approach advances inference using implicit distributions by introducing novel bounds that come about by locally linearising the neural sampler. This is distinct from existing methods that rely on additional discriminator networks and unstable adversarial objectives. Furthermore, we present a new sampler architecture that, for the first time, enables implicit distributions over millions of latent variables, addressing computational concerns by using differentiable numerical approximations. Our empirical analysis indicates our method is capable of recovering correlations across layers in large Bayesian neural networks, a property that is crucial for a network's performance but notoriously challenging to achieve. To the best of our knowledge, no other method has been shown to accomplish this task for such large models. Through experiments in downstream tasks, we demonstrate that our expressive posteriors outperform state-of-the-art uncertainty quantification methods, validating the effectiveness of our training algorithm and the quality of the learned implicit approximation.
摘要
在变分推断中, bayesian 模型的优点取决于正确地捕捉真实 posterior distribution。我们提议使用神经网络 sampler,这些 sampler specify implicit distribution,适用于高维空间中复杂的多模态和相关 posterior。我们的方法在神经网络 sampler 中引入新的 bound,通过本地线性化来提高推断。这与现有的方法不同,它们基于额外的 discriminator 网络和不稳定的对抗性目标。此外,我们提出了一新的 sampler 架构,可以对 millions 个 latent variable 进行隐式分布,通过使用可微的数学近似来解决计算问题。我们的实验表明,我们的方法可以在大 bayesian 神经网络中恢复层之间的相关性,这是一个关键的性能因素,但是很难实现。而我们的表达式 posterior 可以超越现有的 uncertainty quantification 方法,证明我们的训练算法的有效性和学习的隐式近似质量。
The Lattice Overparametrization Paradigm for the Machine Learning of Lattice Operators
results: 本文的结果表明,通过使用stochastic lattice gradient descent算法可以有效地学习势函数算子,并且可以通过计算其基来了解势函数算子的性质。此外,本文还证明了这种学习方法具有控制、透明度和可解性的特点,这些特点在现代机器学习方法中缺失。Abstract
The machine learning of lattice operators has three possible bottlenecks. From a statistical standpoint, it is necessary to design a constrained class of operators based on prior information with low bias, and low complexity relative to the sample size. From a computational perspective, there should be an efficient algorithm to minimize an empirical error over the class. From an understanding point of view, the properties of the learned operator need to be derived, so its behavior can be theoretically understood. The statistical bottleneck can be overcome due to the rich literature about the representation of lattice operators, but there is no general learning algorithm for them. In this paper, we discuss a learning paradigm in which, by overparametrizing a class via elements in a lattice, an algorithm for minimizing functions in a lattice is applied to learn. We present the stochastic lattice gradient descent algorithm as a general algorithm to learn on constrained classes of operators as long as a lattice overparametrization of it is fixed, and we discuss previous works which are proves of concept. Moreover, if there are algorithms to compute the basis of an operator from its overparametrization, then its properties can be deduced and the understanding bottleneck is also overcome. This learning paradigm has three properties that modern methods based on neural networks lack: control, transparency and interpretability. Nowadays, there is an increasing demand for methods with these characteristics, and we believe that mathematical morphology is in a unique position to supply them. The lattice overparametrization paradigm could be a missing piece for it to achieve its full potential within modern machine learning.
摘要
《机器学习阶层算子的三个可能的瓶颈》 Machine learning lattice operators have three possible bottlenecks. From a statistical standpoint, it is necessary to design a constrained class of operators based on prior information with low bias and low complexity relative to the sample size. From a computational perspective, there should be an efficient algorithm to minimize an empirical error over the class. From an understanding standpoint, the properties of the learned operator need to be derived, so its behavior can be theoretically understood.The statistical bottleneck can be overcome due to the rich literature about the representation of lattice operators, but there is no general learning algorithm for them. In this paper, we discuss a learning paradigm in which, by overparametrizing a class via elements in a lattice, an algorithm for minimizing functions in a lattice is applied to learn. We present the stochastic lattice gradient descent algorithm as a general algorithm to learn on constrained classes of operators as long as a lattice overparametrization of it is fixed, and we discuss previous works which are proofs of concept.Moreover, if there are algorithms to compute the basis of an operator from its overparametrization, then its properties can be deduced, and the understanding bottleneck is also overcome. This learning paradigm has three properties that modern methods based on neural networks lack: control, transparency, and interpretability. Nowadays, there is an increasing demand for methods with these characteristics, and we believe that mathematical morphology is in a unique position to supply them. The lattice overparametrization paradigm could be a missing piece for it to achieve its full potential within modern machine learning.
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
results: 在多个真实世界数据集上实现了一致状态的前一个性和泛化能力,提高了Transformer家族在时间序列预测中的表现,并且可以更好地利用不同的lookback窗口和变量。Abstract
The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformer is challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the unified embedding for each temporal token fuses multiple variates with potentially unaligned timestamps and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any adaptation on the basic components. We propose iTransformer that simply inverts the duties of the attention mechanism and the feed-forward network. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves consistent state-of-the-art on several real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting.
摘要
Recent 崩溃 linear 预测模型 让人们对 transformer 基于预测器的建筑修改lost interest。这些预测器利用 transformer 模型全球时间序列中的全局依赖关系,每个时间戳由多个变量组成。然而, transformer 在更大的 lookback 窗口预测中表现不佳,因为性能下降和计算暴涨。此外,通用 embedding 对每个时间戳进行综合 embedding 可能会失去变量 centered 表示和无用的注意力地图。在这项工作中,我们反思 transformer 组件的能力和挑战,并将 transformer 架构重新定义为 iTransformer。iTransformer 简单地将 attention 机制和 feed-forward 网络的职责反转过来。具体来说,每个时间序列的时刻点被转换成 variate token,并由 attention 机制来捕捉多元相关性;而 feed-forward 网络则是为每个 variate token 进行非线性表示学习。iTransformer 模型在多个实际数据集上具有一致的 state-of-the-art 性能,这使得 transformer 家族受到了提高性能、泛化能力和不同变量之间的更好利用,从而成为时间序列预测的基本脊梁。
Robustness May be More Brittle than We Think under Different Degrees of Distribution Shifts
results: 研究人员发现,模型在不同分布偏移度下的抗衰减性可能很弱,而且可能存在较大的分布偏移度下的潜在风险。此外,大规模预训练模型,如CLIP,在novel downstream任务中的分布偏移度下也具有敏感性。Abstract
Out-of-distribution (OOD) generalization is a complicated problem due to the idiosyncrasies of possible distribution shifts between training and test domains. Most benchmarks employ diverse datasets to address this issue; however, the degree of the distribution shift between the training domains and the test domains of each dataset remains largely fixed. This may lead to biased conclusions that either underestimate or overestimate the actual OOD performance of a model. Our study delves into a more nuanced evaluation setting that covers a broad range of shift degrees. We show that the robustness of models can be quite brittle and inconsistent under different degrees of distribution shifts, and therefore one should be more cautious when drawing conclusions from evaluations under a limited range of degrees. In addition, we observe that large-scale pre-trained models, such as CLIP, are sensitive to even minute distribution shifts of novel downstream tasks. This indicates that while pre-trained representations may help improve downstream in-distribution performance, they could have minimal or even adverse effects on generalization in certain OOD scenarios of the downstream task if not used properly. In light of these findings, we encourage future research to conduct evaluations across a broader range of shift degrees whenever possible.
摘要
外部分布(OOD)泛化是一个复杂的问题,因为可能存在训练和测试领域之间的特殊性和分布差异。大多数标准准测试使用多种数据集来解决这个问题,但是每个数据集的测试领域分布shift的度量仍然很大程度上固定。这可能会导致偏向的结论, Either underestimate或Overestimate实际OOD模型的性能。我们的研究探讨了一种更加细化的评估环境,覆盖了广泛的分布差异度。我们发现模型的Robustness可能很脆弱和不一致,因此在不同的分布差异度下,一个应该更加小心地做结论。此外,我们发现大规模预训练模型,如CLIP,对小型分布差异的新任务有敏感性。这表示,虽然预训练表示可以帮助改进下游领域的表现,但是在某些OOD场景下,它们可能会具有微不足或甚至有害的效果。为了更好地评估OOD性能,我们建议将来的研究在可能的范围内进行评估。
Discovering Interpretable Physical Models Using Symbolic Regression and Discrete Exterior Calculus
methods: 这种方法结合了Symbolic Regression(SR)和Discrete Exterior Calculus(DEC),使用了一种自然的通用的整数数学语言,以便推导和分析物理模型。 DEC 提供了一些拓扑学上的建构,以及一种强类型的 SR 过程,以确保数学表达的正确性和减少搜索空间。
results: 通过使用这种方法, authors 成功地重新发现了三个维度物理学中的模型:波松方程、欧拉的弹性材料和Linear Elasticity 方程。这些模型具有通用的特点,可以应用于多种物理模拟问题。Abstract
Computational modeling is a key resource to gather insight into physical systems in modern scientific research and engineering. While access to large amount of data has fueled the use of Machine Learning (ML) to recover physical models from experiments and increase the accuracy of physical simulations, purely data-driven models have limited generalization and interpretability. To overcome these limitations, we propose a framework that combines Symbolic Regression (SR) and Discrete Exterior Calculus (DEC) for the automated discovery of physical models starting from experimental data. Since these models consist of mathematical expressions, they are interpretable and amenable to analysis, and the use of a natural, general-purpose discrete mathematical language for physics favors generalization with limited input data. Importantly, DEC provides building blocks for the discrete analogue of field theories, which are beyond the state-of-the-art applications of SR to physical problems. Further, we show that DEC allows to implement a strongly-typed SR procedure that guarantees the mathematical consistency of the recovered models and reduces the search space of symbolic expressions. Finally, we prove the effectiveness of our methodology by re-discovering three models of Continuum Physics from synthetic experimental data: Poisson equation, the Euler's Elastica and the equations of Linear Elasticity. Thanks to their general-purpose nature, the methods developed in this paper may be applied to diverse contexts of physical modeling.
摘要
现代科学研究和工程中的物理系统模型化是一个关键资源,帮助我们更深入理解物理系统的行为。虽然大量数据的可用性推动了机器学习(ML)技术来从实验中提取物理模型并提高物理仿真的准确性,但纯数据驱动的模型受到限制,其可重复性和可解释性受到限制。为了超越这些限制,我们提出了一种整合符号 regression(SR)和离散外部 calculus(DEC)的框架,用于自动找到从实验数据开始的物理模型。由于这些模型由数学表达组成,它们可以被解释和分析,而使用自然的通用离散数学语言也会增加泛化的能力。此外,DEC提供了离散场论的建构元素,这些元素超越了当前SR在物理问题上的应用状况。此外,我们还证明了DEC可以实现强类型的SR过程,以确保数学模型的数学一致性,并减少符号表达的搜索空间。最后,我们证明了我们的方法效果,通过从合成实验数据中重新发现波松方程、欧拉-埃拉斯特拉方程和线性弹性方程。由于这些方程的通用性,我们的方法可以应用于多种物理模型化的 Context。
results: 确定的时间范围为2019年4月21日至2019年8月9日,与全时序相比只失去0.75%的准确率,而且LRP得到的重要时间步骤也揭示了输入值中的小 Details,这些Details可以用来区分不同的类别。Abstract
We propose an approach for early crop classification through identifying important timesteps with eXplainable AI (XAI) methods. Our approach consists of training a baseline crop classification model to carry out layer-wise relevance propagation (LRP) so that the salient time step can be identified. We chose a selected number of such important time indices to create the bounding region of the shortest possible classification timeframe. We identified the period 21st April 2019 to 9th August 2019 as having the best trade-off in terms of accuracy and earliness. This timeframe only suffers a 0.75% loss in accuracy as compared to using the full timeseries. We observed that the LRP-derived important timesteps also highlight small details in input values that differentiates between different classes and
摘要
我们提出了一种采用可解释AI(XAI)方法进行早期作物分类的方法。我们的方法包括训练一个基eline作物分类模型,并使用层wise relevance propagation(LRP)来确定重要的时间步骤。我们选择了一些重要的时间索引,并将其用于创建最短的可能的分类时间范围。我们确定的时间范围为2019年4月21日至2019年8月9日,这个时间范围只减少了0.75%的准确率,相比使用完整时间序列。我们发现LRP得到的重要时间步骤还高亮了输入值中的小 Details,这些细节可以用于分类不同类型的作物。
Deep Learning reconstruction with uncertainty estimation for $γ$ photon interaction in fast scintillator detectors
results: 研究结果表明该方法的有效性和可靠性,并且强调了估算结果的不确定性的重要性。我们还讨论了该方法在PET成像质量提高方面的潜在影响和如何使用结果来改进模型和应用中的表现。此外,我们还指出该方法可以扩展到其他应用场景以外。Abstract
This article presents a physics-informed deep learning method for the quantitative estimation of the spatial coordinates of gamma interactions within a monolithic scintillator, with a focus on Positron Emission Tomography (PET) imaging. A Density Neural Network approach is designed to estimate the 2-dimensional gamma photon interaction coordinates in a fast lead tungstate (PbWO4) monolithic scintillator detector. We introduce a custom loss function to estimate the inherent uncertainties associated with the reconstruction process and to incorporate the physical constraints of the detector. This unique combination allows for more robust and reliable position estimations and the obtained results demonstrate the effectiveness of the proposed approach and highlights the significant benefits of the uncertainties estimation. We discuss its potential impact on improving PET imaging quality and show how the results can be used to improve the exploitation of the model, to bring benefits to the application and how to evaluate the validity of the given prediction and the associated uncertainties. Importantly, our proposed methodology extends beyond this specific use case, as it can be generalized to other applications beyond PET imaging.
摘要
Simplified Chinese:这篇文章介绍了一种基于物理学的深度学习方法,用于量化gamma交互的空间坐标 within a monolithic scintillator detector,尤其是Positron Emission Tomography(PET)成像。该方法使用了Density Neural Network来估算gamma photon交互的2维坐标在fast lead tungstate(PbWO4)monolithic scintillator detector中。我们引入了一个自定义损失函数,以估算重建过程中的自然不确定性和仪器的物理约束。这种独特的组合使得位置估算更加稳定和可靠,并且实际结果证明了我们的提议的有效性,并强调了估算不确定性的重要性。这种方法有可能改善PET成像质量,并且可以扩展到其他 beyond PET成像的应用。
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method
results: 计算的指标表明,使用全部框架时,每个假数据集都具有了满意的隐私保护水平,特别是对于特性泄露攻击。成员泄露攻击被正式防止,而无需重大改变数据。机器学习方法显示,对于模拟的单个化和链接攻击,成功率很低。各数据集的分布和推论指标与原始数据相似。Abstract
Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation.
摘要
引言:原始研究数据的数量正在急剧增长。按照开放科学原则,公共发布这些数据是建议的。然而,从人类参与者收集的数据不能直接发布,否则会引起隐私问题。完全 sintética 数据表示一种有 Promise的解决方案。法国中央研究所在这种 sintética 数据生成框架基于分类和回归树和一种原始的距离基于筛选。该工作的目的是开发一个改进版的这种框架,并通过实验和正式工具评估其风险利用性。材料和方法:我们的 sintesis 框架由四个阶段组成,每个阶段都是为预防特定风险的披露。我们使用两个或更多的这些阶段来处理一个丰富的 epidemiological 数据集。隐私和利用度指标在每个 sintetic 数据集中计算,并使用机器学习方法进行评估。结果:计算的指标表明,使用全部框架时,每个 sintetic 数据集的隐私保护水平很高,特别是对于特征披露攻击。成员披露攻击得到了正式防范,而不是对数据造成重要的变化。机器学习方法表示,在模拟的单个化和链接攻击中, sintetic 数据集的风险很低。 distribución 和推论上的相似性很高,所有的 sintetic 数据集都具有高度的相似性。讨论:这项工作证明了使用多步 sintetic 数据生成框架的技术可行性。为此,我们开发了特有的 formal 和实验工具,这些工具对这一领域做出了重要贡献。未来的研究应该集中在这些工具的扩展和验证上,以确定其他数据生成方法的内在特质。结论:通过成功评估使用多步 sintetic 数据生成框架生成的数据质量,我们证明了开放-CESP INITIATIVE 的技术和概念合理性。这一initiative 似乎准备好进行大规模实施。
An Edge-Aware Graph Autoencoder Trained on Scale-Imbalanced Data for Travelling Salesman Problems
results: 对50,000个TSP实例进行了实验,并证明了该方法可以在不同的规模下达到高度竞争力的性能。Abstract
Recent years have witnessed a surge in research on machine learning for combinatorial optimization since learning-based approaches can outperform traditional heuristics and approximate exact solvers at a lower computation cost. However, most existing work on supervised neural combinatorial optimization focuses on TSP instances with a fixed number of cities and requires large amounts of training samples to achieve a good performance, making them less practical to be applied to realistic optimization scenarios. This work aims to develop a data-driven graph representation learning method for solving travelling salesman problems (TSPs) with various numbers of cities. To this end, we propose an edge-aware graph autoencoder (EdgeGAE) model that can learn to solve TSPs after being trained on solution data of various sizes with an imbalanced distribution. We formulate the TSP as a link prediction task on sparse connected graphs. A residual gated encoder is trained to learn latent edge embeddings, followed by an edge-centered decoder to output link predictions in an end-to-end manner. To improve the model's generalization capability of solving large-scale problems, we introduce an active sampling strategy into the training process. In addition, we generate a benchmark dataset containing 50,000 TSP instances with a size from 50 to 500 cities, following an extremely scale-imbalanced distribution, making it ideal for investigating the model's performance for practical applications. We conduct experiments using different amounts of training data with various scales, and the experimental results demonstrate that the proposed data-driven approach achieves a highly competitive performance among state-of-the-art learning-based methods for solving TSPs.
摘要
Data-level hybrid strategy selection for disk fault prediction model based on multivariate GAN
results: 研究表明,通过使用 GAN 生成数据和遗传算法,可以提高硬盘缺陷分类预测精度,并且可以更好地处理数据类别不均问题。Abstract
Data class imbalance is a common problem in classification problems, where minority class samples are often more important and more costly to misclassify in a classification task. Therefore, it is very important to solve the data class imbalance classification problem. The SMART dataset exhibits an evident class imbalance, comprising a substantial quantity of healthy samples and a comparatively limited number of defective samples. This dataset serves as a reliable indicator of the disc's health status. In this paper, we obtain the best balanced disk SMART dataset for a specific classification model by mixing and integrating the data synthesised by multivariate generative adversarial networks (GAN) to balance the disk SMART dataset at the data level; and combine it with genetic algorithms to obtain higher disk fault classification prediction accuracy on a specific classification model.
摘要
数据类别不匹配是常见的分类问题,其中少数类样本经常更重要和更昂贵的错误分类。因此,解决数据类别不匹配分类问题非常重要。SMART数据集显示了明显的类别不匹配,包括大量的健康样本和相对较少的缺陷样本。这个数据集作为磁盘健康状况的可靠指标。在这篇论文中,我们通过将多变量生成对抗网络(GAN)生成的数据混合和 интегра,在数据层面减少磁盘SMART数据集的类别不匹配;并将生成遗传算法与特定分类模型结合,以提高磁盘缺陷分类预测精度。
Disk failure prediction based on multi-layer domain adaptive learning
results: 提高预测磁盘失败的能力, especialy for disk data with few failure samples.Here’s a more detailed explanation of each point:
for: The paper is written for predicting disk failures, which is an important task in large-scale data storage systems.
methods: The paper proposes a novel method for predicting disk failures by leveraging multi-layer domain adaptive learning techniques. This method involves selecting disk data with numerous faults as the source domain and disk data with fewer faults as the target domain, and training a feature extraction network with the selected origin and destination domains.
results: The proposed technique is demonstrated to be effective in generating a reliable prediction model and improving the ability to predict failures on disk data with few failure samples.Abstract
Large scale data storage is susceptible to failure. As disks are damaged and replaced, traditional machine learning models, which rely on historical data to make predictions, struggle to accurately predict disk failures. This paper presents a novel method for predicting disk failures by leveraging multi-layer domain adaptive learning techniques. First, disk data with numerous faults is selected as the source domain, and disk data with fewer faults is selected as the target domain. A training of the feature extraction network is performed with the selected origin and destination domains. The contrast between the two domains facilitates the transfer of diagnostic knowledge from the domain of source and target. According to the experimental findings, it has been demonstrated that the proposed technique can generate a reliable prediction model and improve the ability to predict failures on disk data with few failure samples.
摘要
Note: Simplified Chinese is also known as "简化字" or "简化字".Please note that the translation is done using Google Translate and it may not be perfect. Also, the translation may not be exactly the same as the original text, as some words or phrases may not have direct translations in Simplified Chinese.
AttributionLab: Faithfulness of Feature Attribution Under Controllable Environments
results: 研究发现,在 AttributionLab 中设计的 synthetic environment 中,使用了手动设置的 neural network 和数据可以准确 Reflects the neural network’s learning process,并且可以用这种方法来检验 attribute 方法的准确性。Abstract
Feature attribution explains neural network outputs by identifying relevant input features. How do we know if the identified features are indeed relevant to the network? This notion is referred to as faithfulness, an essential property that reflects the alignment between the identified (attributed) features and the features used by the model. One recent trend to test faithfulness is to design the data such that we know which input features are relevant to the label and then train a model on the designed data. Subsequently, the identified features are evaluated by comparing them with these designed ground truth features. However, this idea has the underlying assumption that the neural network learns to use all and only these designed features, while there is no guarantee that the learning process trains the network in this way. In this paper, we solve this missing link by explicitly designing the neural network by manually setting its weights, along with designing data, so we know precisely which input features in the dataset are relevant to the designed network. Thus, we can test faithfulness in AttributionLab, our designed synthetic environment, which serves as a sanity check and is effective in filtering out attribution methods. If an attribution method is not faithful in a simple controlled environment, it can be unreliable in more complex scenarios. Furthermore, the AttributionLab environment serves as a laboratory for controlled experiments through which we can study feature attribution methods, identify issues, and suggest potential improvements.
摘要
Feature 归属解释 neural network 输出sBy identifying relevant input features。如何确定这些标识的特征是 neural network 中用到的?这个概念被称为 faithfulness,它是一种重要的性质,它反映了模型中使用的特征与归属特征之间的对应关系。一种最近的趋势是通过设计数据来测试 faithfulness,即在训练模型时,知道哪些输入特征与标签之间存在关系,然后在这些设计的真实特征上训练模型。然而,这个想法假设模型学习所有设计的特征,而这并不一定是真实的。在这篇论文中,我们解决了这个缺失的联系。我们明确地设计了 neural network 的权重,并与数据一起设计,因此我们知道哪些数据集中的输入特征与我们设计的网络中用到的特征之间存在关系。因此,我们可以在 AttributionLab 中测试 faithfulness,这是我们自己设计的人工环境,它作为一种 santity check 有效地筛选出归属方法。如果归属方法不忠实在这种简单控制的环境中,那么它在更复杂的场景中可能不可靠。此外,AttributionLab 环境还可以作为一个 controlled experiments 的实验室,我们可以通过这里进行学习归属方法、发现问题和提出改进建议。
Self-Supervised Dataset Distillation for Transfer Learning
results: 通过实验 validate了方法的有效性,并且可以降低计算成本和获得关键的kernel ridge regression解。Abstract
Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the \textit{self-supervised target model}. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.
摘要
dataset 简化方法已经取得了很大的成功,将大量数据简化成一小集 representative samples。然而,它们并不是为生成可以有效地用于自动学习的简化数据集设计的。为此,我们提出了一个新的问题:简化一个没有标签的数据集成一小组小样本,以便高效地进行自动学习(SSL)。我们首先证明了在随机数据扩充或masking中引入的随机性导致的 SSL 目标函数在 naive bilevel 优化中的梯度是偏移的。为解决这个问题,我们提议将内部目标函数设置为 mean squared error(MSE),这样不会引入随机性。我们的主要动机是希望通过提议的内部优化来模仿自动学习目标模型。为此,我们还引入了 MSE между representations of the inner model 和 self-supervised target model 在原始全 dataset 上,用于外部优化。最后,我们假设了一个固定的 feature extractor,只有在 feature extractor 上进行 linear head 的优化,这使得我们可以降低计算成本并获得一个关于 kernel ridge regression 的闭合型解。我们实际验证了我们的方法在不同应用中的转移学习中的效果。
Runway Sign Classifier: A DAL C Certifiable Machine Learning System
paper_authors: Konstantin Dmitriev, Johann Schumann, Islam Bostanov, Mostafa Abdelhamid, Florian Holzapfel
For: This paper aims to address the certification challenges of Machine Learning (ML) based systems for medium criticality airborne applications.* Methods: The authors use a Deep Neural Network (DNN) for airport sign detection and classification, and employ an established architectural mitigation technique involving two redundant and dissimilar DNNs. They also use novel ML-specific data management techniques to enhance this approach.* Results: The authors demonstrate compliance with Design Assurance Level (DAL) C, which is a more stringent requirement than their previous work that achieved DAL D.Abstract
In recent years, the remarkable progress of Machine Learning (ML) technologies within the domain of Artificial Intelligence (AI) systems has presented unprecedented opportunities for the aviation industry, paving the way for further advancements in automation, including the potential for single pilot or fully autonomous operation of large commercial airplanes. However, ML technology faces major incompatibilities with existing airborne certification standards, such as ML model traceability and explainability issues or the inadequacy of traditional coverage metrics. Certification of ML-based airborne systems using current standards is problematic due to these challenges. This paper presents a case study of an airborne system utilizing a Deep Neural Network (DNN) for airport sign detection and classification. Building upon our previous work, which demonstrates compliance with Design Assurance Level (DAL) D, we upgrade the system to meet the more stringent requirements of Design Assurance Level C. To achieve DAL C, we employ an established architectural mitigation technique involving two redundant and dissimilar Deep Neural Networks. The application of novel ML-specific data management techniques further enhances this approach. This work is intended to illustrate how the certification challenges of ML-based systems can be addressed for medium criticality airborne applications.
摘要
This paper presents a case study of an airborne system that utilizes a Deep Neural Network (DNN) for airport sign detection and classification. Building on our previous work, which demonstrated compliance with Design Assurance Level (DAL) D, we upgraded the system to meet the more stringent requirements of DAL C. To achieve DAL C, we employed an established architectural mitigation technique involving two redundant and dissimilar DNNs. Additionally, we applied novel ML-specific data management techniques to enhance this approach.The purpose of this work is to demonstrate how the certification challenges of ML-based systems can be addressed for medium criticality airborne applications. By upgrading the system to meet DAL C requirements, we were able to demonstrate the feasibility of certifying ML-based airborne systems for use in the aviation industry.
Variance Reduced Online Gradient Descent for Kernelized Pairwise Learning with Limited Memory
results: 我们的 теоретиче研究表明,在线对照学习中使用方差减少的 gradient 会导致下降 regret 的改进 bound。实验结果表明,我们的算法在实际数据上比 both kernelized 和 linear 在线对照学习算法更有优势。Abstract
Pairwise learning is essential in machine learning, especially for problems involving loss functions defined on pairs of training examples. Online gradient descent (OGD) algorithms have been proposed to handle online pairwise learning, where data arrives sequentially. However, the pairwise nature of the problem makes scalability challenging, as the gradient computation for a new sample involves all past samples. Recent advancements in OGD algorithms have aimed to reduce the complexity of calculating online gradients, achieving complexities less than $O(T)$ and even as low as $O(1)$. However, these approaches are primarily limited to linear models and have induced variance. In this study, we propose a limited memory OGD algorithm that extends to kernel online pairwise learning while improving the sublinear regret. Specifically, we establish a clear connection between the variance of online gradients and the regret, and construct online gradients using the most recent stratified samples with a limited buffer of size of $s$ representing all past data, which have a complexity of $O(sT)$ and employs $O(\sqrt{T}\log{T})$ random Fourier features for kernel approximation. Importantly, our theoretical results demonstrate that the variance-reduced online gradients lead to an improved sublinear regret bound. The experiments on real-world datasets demonstrate the superiority of our algorithm over both kernelized and linear online pairwise learning algorithms.
摘要
<>转换给定文本到简化中文。<>在机器学习中,对于基于对例学习的问题,对例学习是非常重要的。在线 gradient descent(OGD)算法已经提出来处理在线对例学习,数据顺序到达时进行学习。然而,对例性问题的特点使得扩展性困难,因为新的样本计算gradient时需要所有过去的样本。 latest advances in OGD algorithms have aimed to reduce the complexity of calculating online gradients, achieving complexities less than $O(T)$ and even as low as $O(1)$. However, these approaches are primarily limited to linear models and have induced variance.在这种研究中,我们提出了有限内存OGD算法,扩展到内核在线对例学习,改进了下界 regret。具体来说,我们确定在线 gradients的方差和 regret之间的关系,并使用最近的降序排序样本buffer的大小为$s$,表示所有过去的数据,其复杂度为$O(sT)$。此外,我们还使用$O(\sqrt{T}\log{T})$个随机傅立叶特征来近似内核。我们的理论结果表明,减少方差的在线 gradients会导致改进的下界 regret bound。实验表明,我们的算法在实际 dataset 上比 both kernelized 和 linear online pairwise learning algorithms 高效。
An improved CTGAN for data processing method of imbalanced disk failure
for: solves the problem of insufficient failure data and imbalance between normal and failure data in disk failure diagnosis.
methods: uses an improved Conditional Tabular Generative Adversarial Networks (CTGAN) with a residual network and a classifier for specific category discrimination, as well as a discriminator based on residual network.
results: the synthesized data can further improve the fault diagnosis accuracy of the classifier, as demonstrated by the experimental results.Here is the text in Simplified Chinese:
results: 通过实验结果表明,使用RCTGAN生成的数据可以进一步提高磁盘故障诊断精度。Abstract
To address the problem of insufficient failure data generated by disks and the imbalance between the number of normal and failure data. The existing Conditional Tabular Generative Adversarial Networks (CTGAN) deep learning methods have been proven to be effective in solving imbalance disk failure data. But CTGAN cannot learn the internal information of disk failure data very well. In this paper, a fault diagnosis method based on improved CTGAN, a classifier for specific category discrimination is added and a discriminator generate adversarial network based on residual network is proposed. We named it Residual Conditional Tabular Generative Adversarial Networks (RCTGAN). Firstly, to enhance the stability of system a residual network is utilized. RCTGAN uses a small amount of real failure data to synthesize fake fault data; Then, the synthesized data is mixed with the real data to balance the amount of normal and failure data; Finally, four classifier (multilayer perceptron, support vector machine, decision tree, random forest) models are trained using the balanced data set, and the performance of the models is evaluated using G-mean. The experimental results show that the data synthesized by the RCTGAN can further improve the fault diagnosis accuracy of the classifier.
摘要
Firstly, to enhance the stability of the system, a residual network is utilized. RCTGAN uses a small amount of real failure data to synthesize fake fault data, then the synthesized data is mixed with the real data to balance the amount of normal and failure data. Finally, four classifier (multilayer perceptron, support vector machine, decision tree, random forest) models are trained using the balanced data set, and the performance of the models is evaluated using G-mean. The experimental results show that the data synthesized by the RCTGAN can further improve the fault diagnosis accuracy of the classifier.
Asynchronous Federated Learning with Incentive Mechanism Based on Contract Theory
results: 在 MNIST dataset 上进行了实验,测试精度与 FedAvg 和 FedProx 无攻击情况下相比,提高了 3.12% 和 5.84%;相比理想的本地 SGD,在攻击情况下提高了 1.35%。此外,在寻求同目标准确率情况下,我们的框架需要较少的计算时间。Abstract
To address the challenges posed by the heterogeneity inherent in federated learning (FL) and to attract high-quality clients, various incentive mechanisms have been employed. However, existing incentive mechanisms are typically utilized in conventional synchronous aggregation, resulting in significant straggler issues. In this study, we propose a novel asynchronous FL framework that integrates an incentive mechanism based on contract theory. Within the incentive mechanism, we strive to maximize the utility of the task publisher by adaptively adjusting clients' local model training epochs, taking into account factors such as time delay and test accuracy. In the asynchronous scheme, considering client quality, we devise aggregation weights and an access control algorithm to facilitate asynchronous aggregation. Through experiments conducted on the MNIST dataset, the simulation results demonstrate that the test accuracy achieved by our framework is 3.12% and 5.84% higher than that achieved by FedAvg and FedProx without any attacks, respectively. The framework exhibits a 1.35% accuracy improvement over the ideal Local SGD under attacks. Furthermore, aiming for the same target accuracy, our framework demands notably less computation time than both FedAvg and FedProx.
摘要
在聚合学习(FL)中处理多样性的挑战和吸引高质量客户端的吸引力,各种奖励机制已经被应用。然而,现有的奖励机制通常在同步聚合中使用,导致了显著的延迟问题。在这项研究中,我们提出了一种新的异步FL框架,该框架 integrate了基于合同理论的奖励机制。在奖励机制中,我们尝试以最大化任务发布者的利益为目标,通过调整客户端本地模型训练 epoch,考虑因素如时间延迟和测试准确率。在异步方案中,考虑客户端质量,我们设计了聚合权重和访问控制算法,以便异步聚合。经过在MNIST数据集上进行的实验,实验结果表明,我们的框架测试准确率与FedAvg和FedProx无攻击情况下的测试准确率相差3.12%和5.84%,分别高于FedAvg和FedProx无攻击情况下的测试准确率。此外,我们的框架在攻击情况下与理想的本地SGD准确率之间差不多。此外,在寻求同样的目标准确率情况下,我们的框架需要比FedAvg和FedProx更少的计算时间。
for: This paper is written for developers who need to continually update or correct machine learning models to ensure high prediction accuracy, particularly in complex systems or software.
methods: The paper proposes a correction rule mining approach to acquire a comprehensive list of rules that describe inaccurate subpopulations and how to correct them. The proposed algorithm combines frequent itemset mining and a unique pruning technique for correction rules.
results: The paper found that the proposed algorithm discovered various rules that help collect data insufficiently learned, directly correct model outputs, and analyze concept drift.Abstract
Machine learning models need to be continually updated or corrected to ensure that the prediction accuracy remains consistently high. In this study, we consider scenarios where developers should be careful to change the prediction results by the model correction, such as when the model is part of a complex system or software. In such scenarios, the developers want to control the specification of the corrections. To achieve this, the developers need to understand which subpopulations of the inputs get inaccurate predictions by the model. Therefore, we propose correction rule mining to acquire a comprehensive list of rules that describe inaccurate subpopulations and how to correct them. We also develop an efficient correction rule mining algorithm that is a combination of frequent itemset mining and a unique pruning technique for correction rules. We observed that the proposed algorithm found various rules which help to collect data insufficiently learned, directly correct model outputs, and analyze concept drift.
摘要
Deep reinforcement learning uncovers processes for separating azeotropic mixtures without prior knowledge
results: 论文通过示例化一种可以自动学习并应用于多种化学系统中的流程设计方法,并且可以将大于99%的材料分离成纯组分。这显示出探索器的计划灵活性和可靠性。Abstract
Process synthesis in chemical engineering is a complex planning problem due to vast search spaces, continuous parameters and the need for generalization. Deep reinforcement learning agents, trained without prior knowledge, have shown to outperform humans in various complex planning problems in recent years. Existing work on reinforcement learning for flowsheet synthesis shows promising concepts, but focuses on narrow problems in a single chemical system, limiting its practicality. We present a general deep reinforcement learning approach for flowsheet synthesis. We demonstrate the adaptability of a single agent to the general task of separating binary azeotropic mixtures. Without prior knowledge, it learns to craft near-optimal flowsheets for multiple chemical systems, considering different feed compositions and conceptual approaches. On average, the agent can separate more than 99% of the involved materials into pure components, while autonomously learning fundamental process engineering paradigms. This highlights the agent's planning flexibility, an encouraging step toward true generality.
摘要
Adversarial Robustness in Graph Neural Networks: A Hamiltonian Approach
paper_authors: Kai Zhao, Qiyu Kang, Yang Song, Rui She, Sijie Wang, Wee Peng Tay
For: 本研究探讨了基于多种神经流的图神经网络(GNNs)的抗震性能,尤其是它们与不同的稳定性观念相关,如BIBO稳定性、 Lyapunov稳定性、结构稳定性和保守稳定性。* Methods: 本文提出了基于物理原理的保守汉密尔顿神经流,用于构建抗震性能强的GNNs。并进行了多种验证 benchmark datasets 上的 adversarial attacks 下的实验,以评估不同神经流GNNs 的抗震性能。* Results: 实验结果表明,基于保守汉密尔顿神经流的GNNs 在 adversarial attacks 下具有显著的抗震性能,而 Lyapunov稳定性并不一定能 garantate adversarial robustness。Abstract
Graph neural networks (GNNs) are vulnerable to adversarial perturbations, including those that affect both node features and graph topology. This paper investigates GNNs derived from diverse neural flows, concentrating on their connection to various stability notions such as BIBO stability, Lyapunov stability, structural stability, and conservative stability. We argue that Lyapunov stability, despite its common use, does not necessarily ensure adversarial robustness. Inspired by physics principles, we advocate for the use of conservative Hamiltonian neural flows to construct GNNs that are robust to adversarial attacks. The adversarial robustness of different neural flow GNNs is empirically compared on several benchmark datasets under a variety of adversarial attacks. Extensive numerical experiments demonstrate that GNNs leveraging conservative Hamiltonian flows with Lyapunov stability substantially improve robustness against adversarial perturbations. The implementation code of experiments is available at https://github.com/zknus/NeurIPS-2023-HANG-Robustness.
摘要
“神经网络(GNNs)对于攻击性变化具有漏洞,包括影响节点特征和GraphTopology。这篇论文研究GNNs从多种神经流中派生出来的不同稳定性概念,特别是BIBO稳定性、Lyapunov稳定性、结构稳定性和保守稳定性。我们认为Lyapunov稳定性,即使常用,并不一定能保证攻击适应性。以物理原理为 inspiration,我们倡议使用保守的Hamiltonian神经流创建GNNs,以提高对于攻击性变化的抗性。不同神经流GNNs的攻击适应性在多个Benchmark数据集上进行了实验性的比较。实验结果显示, leveraging conservative Hamiltonian flows with Lyapunov stability can significantly improve the robustness of GNNs against adversarial attacks。相关实验代码可以在https://github.com/zknus/NeurIPS-2023-HANG-Robustness中找到。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Harnessing Administrative Data Inventories to Create a Reliable Transnational Reference Database for Crop Type Monitoring
results: 研究实现了一个名为 E URO C ROPS 的参 Referenced Dataset,可以用于耕地类型分类。Abstract
With leaps in machine learning techniques and their applicationon Earth observation challenges has unlocked unprecedented performance across the domain. While the further development of these methods was previously limited by the availability and volume of sensor data and computing resources, the lack of adequate reference data is now constituting new bottlenecks. Since creating such ground-truth information is an expensive and error-prone task, new ways must be devised to source reliable, high-quality reference data on large scales. As an example, we showcase E URO C ROPS, a reference dataset for crop type classification that aggregates and harmonizes administrative data surveyed in different countries with the goal of transnational interoperability.
摘要
随着机器学习技术的大跃进和其应用于地球观测挑战,已经实现了无 precedent的表现在这个领域。然而,由于感知器数据和计算资源的可用性的限制,这些方法的进一步发展被限制。现在,由于创建这些基准信息是一项昂贵和容易出错的任务,新的方法需要被设计,以获取可靠、高质量的参 refer 数据。作为一个示例,我们展示了E URO C ROPS,一个用于蔬菜类别分类的参 refer 数据集,该数据集在不同国家surveyed的行政数据的基础上,实现了跨国共享和协调。
CAST: Cluster-Aware Self-Training for Tabular Data
results: 在 20 个真实世界数据集上进行了广泛的 empirical 评估,证明 CAST 方法不仅性能更高,还具有在不同的自我训练场景下的稳定性。Abstract
Self-training has gained attraction because of its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels. Several studies have proposed successful approaches to tackle this issue, but they have diminished the advantages of self-training because they require specific modifications in self-training algorithms or model architectures. Furthermore, most of them are incompatible with gradient boosting decision trees, which dominate the tabular domain. To address this, we revisit the cluster assumption, which states that data samples that are close to each other tend to belong to the same class. Inspired by the assumption, we propose Cluster-Aware Self-Training (CAST) for tabular data. CAST is a simple and universally adaptable approach for enhancing existing self-training algorithms without significant modifications. Concretely, our method regularizes the confidence of the classifier, which represents the value of the pseudo-label, forcing the pseudo-labels in low-density regions to have lower confidence by leveraging prior knowledge for each class within the training data. Extensive empirical evaluations on up to 20 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts.
摘要
自适应学习已经吸引了广泛关注,因为它的简单性和灵活性,但它受到噪声 pseudo-label 的威胁。许多研究已经提出了成功的方法来解决这个问题,但这些方法减少了自适应学习的优势,因为它们需要特定的修改在自适应学习算法或模型结构上。此外,大多数方法与梯度拟合树不兼容,梯度拟合树在标量领域占据主导地位。为解决这个问题,我们回到了均匀分布假设,即数据样本在邻近的情况下往往属于同一个类。受到这个假设的激发,我们提出了 Cluster-Aware Self-Training(CAST),这是一种简单而通用的方法,可以增强现有的自适应学习算法,无需重大修改。具体来说,我们的方法规范了分类器的信任值,即 pseudo-label 的值,使低密度区域的 pseudo-labels 的信任值降低,通过利用每个类在训练数据中的先验知识。我们对 Up to 20 个实际数据集进行了广泛的实证评估,并证明了 CAST 不仅在不同的自适应学习设置中表现出色,而且在各种各样的自适应学习上下文中具有强大的稳定性。
Initialization Bias of Fourier Neural Operator: Revisiting the Edge of Chaos
results: 建议一种基于He初始化方案的FNO初始化方法,可以缓解FNO的初始化偏见问题,并且实验表明这种方法可以稳定地训练32层FNO,无需额外技术或显著性能下降。Abstract
This paper investigates the initialization bias of the Fourier neural operator (FNO). A mean-field theory for FNO is established, analyzing the behavior of the random FNO from an ``edge of chaos'' perspective. We uncover that the forward and backward propagation behaviors exhibit characteristics unique to FNO, induced by mode truncation, while also showcasing similarities to those of densely connected networks. Building upon this observation, we also propose a FNO version of the He initialization scheme to mitigate the negative initialization bias leading to training instability. Experimental results demonstrate the effectiveness of our initialization scheme, enabling stable training of a 32-layer FNO without the need for additional techniques or significant performance degradation.
摘要
Simplified Chinese:这篇论文研究了傅立叶神经算法(FNO)的初始化偏见。一种mean-field理论被建立,从“边缘化”的角度分析FNO的行为。研究发现,FNO的前向和反向传播行为具有特有的特征,与紧密连接网络类似,但也受到模式舍入的影响。基于这一观察,我们还提出了一种基于FNO的He初始化方案,以缓解初始化偏见,实现了一个32层FNO的稳定训练,无需额外技术或显著性能下降。
Partition-based differentially private synthetic data generation
results: 实验结果显示,这篇论文的方法比以前的方法更好,可以生成高质量的实验数据,并且可以实现更好的隐私保证。Abstract
Private synthetic data sharing is preferred as it keeps the distribution and nuances of original data compared to summary statistics. The state-of-the-art methods adopt a select-measure-generate paradigm, but measuring large domain marginals still results in much error and allocating privacy budget iteratively is still difficult. To address these issues, our method employs a partition-based approach that effectively reduces errors and improves the quality of synthetic data, even with a limited privacy budget. Results from our experiments demonstrate the superiority of our method over existing approaches. The synthetic data produced using our approach exhibits improved quality and utility, making it a preferable choice for private synthetic data sharing.
摘要
<>私有的合成数据分享被 preference 为它可以保持原始数据的分布和特点,而不是仅仅是使用摘要统计。现状的方法采用 select-measure-generate 方法,但测量大型领域边缘仍然导致很大的错误,并且分配隐私预算的迭代仍然困难。为解决这些问题,我们的方法使用分区方法,有效地减少错误并提高合成数据的质量,即使具有有限的隐私预算。我们的实验结果表明我们的方法在现有的方法之上具有明显的优势。合成使用我们的方法生成的数据具有更高的质量和用用,使其成为私有合成数据分享的首选。>>>
DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
results: 实验结果显示,DrugCLIP方法可以对多种虚拟探测任务进行高效的预测,特别是在零shot情况下,并且可以与传统的探测方法和超vised学习方法相比较。Abstract
Virtual screening, which identifies potential drugs from vast compound databases to bind with a particular protein pocket, is a critical step in AI-assisted drug discovery. Traditional docking methods are highly time-consuming, and can only work with a restricted search library in real-life applications. Recent supervised learning approaches using scoring functions for binding-affinity prediction, although promising, have not yet surpassed docking methods due to their strong dependency on limited data with reliable binding-affinity labels. In this paper, we propose a novel contrastive learning framework, DrugCLIP, by reformulating virtual screening as a dense retrieval task and employing contrastive learning to align representations of binding protein pockets and molecules from a large quantity of pairwise data without explicit binding-affinity scores. We also introduce a biological-knowledge inspired data augmentation strategy to learn better protein-molecule representations. Extensive experiments show that DrugCLIP significantly outperforms traditional docking and supervised learning methods on diverse virtual screening benchmarks with highly reduced computation time, especially in zero-shot setting.
摘要
虚拟屏选,可以从庞大的化合物库中标识可能的药物,是人工智能辅助药物发现的关键步骤。传统的停船方法需要很长时间,并且在实际应用中只能使用有限的搜索库。最近的监督学习方法使用紧密度函数预测绑定能力,虽然有承诺,仍然受到有限数据中可靠绑定能力标签的依赖。在这篇论文中,我们提出了一种新的对比学习框架,药物CLIP,通过将虚拟屏选改为密集检索任务,并使用对比学习对绑定蛋白质和分子的表示进行对齐。我们还提出了基于生物知识的数据增强策略,以更好地学习蛋白质-分子表示。广泛的实验表明,药物CLIP在多种虚拟屏选标准准 benchmark上表现出色,特别是在零shot Setting下。
Core-Intermediate-Peripheral Index: Factor Analysis of Neighborhood and Shortest Paths-based Centrality Metrics
methods: 本研究使用变макс基于Eigenvector的因子分析(varimax-based rotation of the Eigenvectors)对中心性指标数据矩阵的转置矩阵进行分析,假设网络中有两个因素(核心和边缘)对节点的中心性指标值产生影响。
results: 本研究在12种复杂的实际世界网络上测试了该方法,并发现CIP指标可以准确地捕捉节点在网络中的中心性和边缘性,并且可以用于评估网络中不同类型节点的中心性和边缘性。Abstract
We perform factor analysis on the raw data of the four major neighborhood and shortest paths-based centrality metrics (Degree, Eigenvector, Betweeenness and Closeness) and propose a novel quantitative measure called the Core-Intermediate-Peripheral (CIP) Index to capture the extent with which a node could play the role of a core node (nodes at the center of a network with larger values for any centrality metric) vis-a-vis a peripheral node (nodes that exist at the periphery of a network with lower values for any centrality metric). We conduct factor analysis (varimax-based rotation of the Eigenvectors) on the transpose matrix of the raw centrality metrics dataset, with the node ids as features, under the hypothesis that there are two factors (core and peripheral) that drive the values incurred by the nodes with respect to the centrality metrics. We test our approach on a diverse suite of 12 complex real-world networks.
摘要
我们对Raw数据进行因素分析,并提出一种新的量化指标called Core-Intermediate-Peripheral(CIP)指数,用于捕捉节点是核心节点(网络中心部分的节点,具有大于其他中心指标值的任何指标)与边缘节点(网络边缘部分的节点,具有较低的任何指标值)的角色扮演的程度。我们使用变差-基于Eigenvector的因素分析(varimax-based rotation of the Eigenvectors)对转置矩阵中的中心指标数据进行分析,假设网络中有两个因素(核心和边缘)驱动节点的中心指标值。我们对12种不同的实际世界网络进行测试。
Boosting Continuous Control with Consistency Policy
results: 实验结果显示,CPQL可以快速地改进策略,并且在11个Offline任务和21个Online任务中达到了新的状态纪录,提高了推理速度,相比Diffusion-QL,CPQL的推理速度提高了约45倍。Abstract
Due to its training stability and strong expression, the diffusion model has attracted considerable attention in offline reinforcement learning. However, several challenges have also come with it: 1) The demand for a large number of diffusion steps makes the diffusion-model-based methods time inefficient and limits their applications in real-time control; 2) How to achieve policy improvement with accurate guidance for diffusion model-based policy is still an open problem. Inspired by the consistency model, we propose a novel time-efficiency method named Consistency Policy with Q-Learning (CPQL), which derives action from noise by a single step. By establishing a mapping from the reverse diffusion trajectories to the desired policy, we simultaneously address the issues of time efficiency and inaccurate guidance when updating diffusion model-based policy with the learned Q-function. We demonstrate that CPQL can achieve policy improvement with accurate guidance for offline reinforcement learning, and can be seamlessly extended for online RL tasks. Experimental results indicate that CPQL achieves new state-of-the-art performance on 11 offline and 21 online tasks, significantly improving inference speed by nearly 45 times compared to Diffusion-QL. We will release our code later.
摘要
The demand for a large number of diffusion steps makes the diffusion-model-based methods time-inefficient and limits their applications in real-time control.2. How to achieve policy improvement with accurate guidance for diffusion model-based policy is still an open problem.Inspired by the consistency model, we propose a novel time-efficiency method named Consistency Policy with Q-Learning (CPQL), which derives action from noise by a single step. By establishing a mapping from the reverse diffusion trajectories to the desired policy, we simultaneously address the issues of time efficiency and inaccurate guidance when updating diffusion model-based policy with the learned Q-function.We demonstrate that CPQL can achieve policy improvement with accurate guidance for offline reinforcement learning, and can be seamlessly extended for online RL tasks. Experimental results indicate that CPQL achieves new state-of-the-art performance on 11 offline and 21 online tasks, significantly improving inference speed by nearly 45 times compared to Diffusion-QL. We will release our code later.
Federated Learning with Reduced Information Leakage and Computation
paper_authors: Tongxin Yin, Xueru Zhang, Mohammad Mahdi Khalili, Mingyan Liu
for: 这 paper 是为了提出一种基于分布式学习的隐私保护机制,以便在多个分布式客户端之间协同学习共同模型,而不需要直接披露本地数据。
methods: 该 paper 使用了一种基于首频采样的方法,其中在每个偶数轮 iteration 中, client 只需要提供一个首频采样,而不需要提供整个数据集。这种方法可以减少了 client 的计算量和隐私泄露。
results: 实验表明,Upcycled-FL 可以在具有不同数据类型的客户端上达到更高的准确率,同时具有更好的隐私保护性和训练时间减少。在 average 的情况下,Upcycled-FL 可以减少 48% 的训练时间。Abstract
Federated learning (FL) is a distributed learning paradigm that allows multiple decentralized clients to collaboratively learn a common model without sharing local data. Although local data is not exposed directly, privacy concerns nonetheless exist as clients' sensitive information can be inferred from intermediate computations. Moreover, such information leakage accumulates substantially over time as the same data is repeatedly used during the iterative learning process. As a result, it can be particularly difficult to balance the privacy-accuracy trade-off when designing privacy-preserving FL algorithms. In this paper, we introduce Upcycled-FL, a novel federated learning framework with first-order approximation applied at every even iteration. Under this framework, half of the FL updates incur no information leakage and require much less computation. We first conduct the theoretical analysis on the convergence (rate) of Upcycled-FL, and then apply perturbation mechanisms to preserve privacy. Experiments on real-world data show that Upcycled-FL consistently outperforms existing methods over heterogeneous data, and significantly improves privacy-accuracy trade-off while reducing 48% of the training time on average.
摘要
federated learning(FL)是一种分布式学习 paradigma,允许多个分散的客户端共同学习一个共同模型,无需直接分享本地数据。 although local data 不会直接暴露,但是隐私问题仍然存在,因为客户端的敏感信息可以通过中间计算被推断出。此外,这种信息泄露会随着训练过程中的重复使用数据堆积,从而使得在设计隐私保护FL算法时进行平衡隐私精度质量的权衡变得特别困难。在这篇论文中,我们介绍了Upcycled-FL,一种新的联合学习框架,在每次偶数轮中应用首降法。在这个框架下,FL更新中的一半不会导致隐私泄露,同时需要 much less computation。我们首先对Upcycled-FL的抽象分析进行了理论分析,然后通过干扰机制来保护隐私。实验表明,Upcycled-FL在具有多样化数据的实际数据上适用,并在隐私精度质量和训练时间之间进行了显著平衡,而且在平均下降48%的训练时间。
Automatic nodule identification and differentiation in ultrasound videos to facilitate per-nodule examination
paper_authors: Siyuan Jiang, Yan Ding, Yuling Wang, Lei Xu, Wenli Dai, Wanru Chang, Jianfeng Zhang, Jie Yu, Jianqiao Zhou, Chunquan Zhang, Ping Liang, Dexing Kong
for: This paper aims to address the problem of identifying and differentiating nodules in breast ultrasound videos, which is a challenging task due to the heterogeneous appearances of nodules in different cross-sectional views.
methods: The authors collected hundreds of breast ultrasound videos and built a nodule reidentification system that consists of two parts: an extractor based on a deep learning model and a real-time clustering algorithm.
results: The system obtained satisfactory results and was able to differentiate ultrasound videos. This is the first attempt to apply re-identification technique in the ultrasonic field.Abstract
Ultrasound is a vital diagnostic technique in health screening, with the advantages of non-invasive, cost-effective, and radiation free, and therefore is widely applied in the diagnosis of nodules. However, it relies heavily on the expertise and clinical experience of the sonographer. In ultrasound images, a single nodule might present heterogeneous appearances in different cross-sectional views which makes it hard to perform per-nodule examination. Sonographers usually discriminate different nodules by examining the nodule features and the surrounding structures like gland and duct, which is cumbersome and time-consuming. To address this problem, we collected hundreds of breast ultrasound videos and built a nodule reidentification system that consists of two parts: an extractor based on the deep learning model that can extract feature vectors from the input video clips and a real-time clustering algorithm that automatically groups feature vectors by nodules. The system obtains satisfactory results and exhibits the capability to differentiate ultrasound videos. As far as we know, it's the first attempt to apply re-identification technique in the ultrasonic field.
摘要
乳腺超音波是现代医学检测技术中的一种重要方法,具有不侵入、成本低、无辐射等优点,因此广泛应用于腺体诊断。然而,它受到医生和医疗技术人员的专业知识和临床经验的限制。在超音波图像中,单个腺体可能会显示不同的多样性表现,这使得每个腺体的检测变得困难。医生通常通过评估腺体特征和周围的腺体和腺管来 diferenciation 腺体,这是耗时和耗力的。为解决这个问题,我们收集了数百个乳腺超音波视频,并建立了一个腺体重新标识系统,该系统包括两部分:基于深度学习模型的特征提取器,可以从输入视频剪辑中提取特征向量,以及实时分组算法,可以自动将特征向量分组成不同的腺体。系统取得了满意的结果,并表现出了分辑视频的能力。到目前为止,这是首次应用重新标识技术在超音波领域。
Learning bounded-degree polytrees with known skeleton
paper_authors: Davin Choo, Joy Qiping Yang, Arnab Bhattacharyya, Clément L. Canonne
for: efficient proper learning of bounded-degree polytrees
methods: polynomial-time algorithm and information-theoretic sample complexity lower bound
results: finite-sample guarantees for learning $d$-polytrees in polynomial time and sample complexity for any bounded $d$ when the underlying undirected graph (skeleton) is knownAbstract
We establish finite-sample guarantees for efficient proper learning of bounded-degree polytrees, a rich class of high-dimensional probability distributions and a subclass of Bayesian networks, a widely-studied type of graphical model. Recently, Bhattacharyya et al. (2021) obtained finite-sample guarantees for recovering tree-structured Bayesian networks, i.e., 1-polytrees. We extend their results by providing an efficient algorithm which learns $d$-polytrees in polynomial time and sample complexity for any bounded $d$ when the underlying undirected graph (skeleton) is known. We complement our algorithm with an information-theoretic sample complexity lower bound, showing that the dependence on the dimension and target accuracy parameters are nearly tight.
摘要
我们设定有限样本保证的高维概率分布bounded-degree polytrees的有效性学习,这是一种高维概率分布的丰富类型和 bayesian networks的一个子集,这种图形模型广泛研究。 Bhattacharyya et al. (2021) 已经获得了恢复树状 bayesian networks的有限样本保证,我们将其结果扩展,提供了在 полиtrees 中efficient的算法,可以在有 bounded 的 $d$ 下在有限时间内learns 和样本复杂度。我们还补充了信息理论样本复杂度下界,表明我们的样本复杂度和精度参数之间的依赖关系几乎是紧密的。
Exploit the antenna response consistency to define the alignment criteria for CSI data
results: 我们通过实验证明了ARC的有效性,它可以提高WIFI基于HAR中自助学习的性能。Abstract
Self-supervised learning (SSL) for WiFi-based human activity recognition (HAR) holds great promise due to its ability to address the challenge of insufficient labeled data. However, directly transplanting SSL algorithms, especially contrastive learning, originally designed for other domains to CSI data, often fails to achieve the expected performance. We attribute this issue to the inappropriate alignment criteria, which disrupt the semantic distance consistency between the feature space and the input space. To address this challenge, we introduce \textbf{A}netenna \textbf{R}esponse \textbf{C}onsistency (ARC) as a solution to define proper alignment criteria. ARC is designed to retain semantic information from the input space while introducing robustness to real-world noise. We analyze ARC from the perspective of CSI data structure, demonstrating that its optimal solution leads to a direct mapping from input CSI data to action vectors in the feature map. Furthermore, we provide extensive experimental evidence to validate the effectiveness of ARC in improving the performance of self-supervised learning for WiFi-based HAR.
摘要
自我监督学习(SSL) для WiFi-based人体活动识别(HAR)具有很大的推荐力,因为它可以解决因不充分的标注数据而带来的挑战。然而,直接将SSL算法,特别是对比学习,从其他领域直接应用于CSI数据时,经常无法达到预期的性能。我们认为这是因为不适当的对齐标准,导致Feature空间和输入空间之间的semantic distance的一致性被打乱。为解决这个挑战,我们介绍了ARC(自适应响应相关)作为一种解决方案,以定义适当的对齐标准。ARC是一种可以保持输入空间中的semantic信息的算法,同时具有对实际世界噪音的抗针对性。我们从CSI数据结构的角度分析ARC,并证明其最佳解决方案导致输入CSI数据直接映射到功能图中的动作向量。此外,我们还提供了详细的实验证据,以证明ARC在自我监督学习中提高WiFi-based HAR性能的效果。
Transfer learning-based physics-informed convolutional neural network for simulating flow in porous media with time-varying controls
results: 这个模型可以准确地预测油压和水含量在每个时间步骤中,并且可以快速地训练和转移学习。对于不同的储量格和方向,模型的计算效率和准确性都被证明。在对比 numerical方法的计算效率和准确性方面,模型表现出了优异的性能。Abstract
A physics-informed convolutional neural network is proposed to simulate two phase flow in porous media with time-varying well controls. While most of PICNNs in existing literatures worked on parameter-to-state mapping, our proposed network parameterizes the solution with time-varying controls to establish a control-to-state regression. Firstly, finite volume scheme is adopted to discretize flow equations and formulate loss function that respects mass conservation laws. Neumann boundary conditions are seamlessly incorporated into the semi-discretized equations so no additional loss term is needed. The network architecture comprises two parallel U-Net structures, with network inputs being well controls and outputs being the system states. To capture the time-dependent relationship between inputs and outputs, the network is well designed to mimic discretized state space equations. We train the network progressively for every timestep, enabling it to simultaneously predict oil pressure and water saturation at each timestep. After training the network for one timestep, we leverage transfer learning techniques to expedite the training process for subsequent timestep. The proposed model is used to simulate oil-water porous flow scenarios with varying reservoir gridblocks and aspects including computation efficiency and accuracy are compared against corresponding numerical approaches. The results underscore the potential of PICNN in effectively simulating systems with numerous grid blocks, as computation time does not scale with model dimensionality. We assess the temporal error using 10 different testing controls with variation in magnitude and another 10 with higher alternation frequency with proposed control-to-state architecture. Our observations suggest the need for a more robust and reliable model when dealing with controls that exhibit significant variations in magnitude or frequency.
摘要
提出了一种基于物理学的卷积神经网络(PICNN),用于模拟具有时间变化的两相流体在孔隙媒体中的行为。大多数现有的PICNN都是基于参数到状态映射,而我们提出的网络则使用时间变化的控制来建立控制到状态重 regression。首先,我们采用了 finite volume 方法来离散流体方程,并将损失函数设计为尊重流体保守定律。Neumann 边界条件可以自然地包含在半离散方程中,因此不需要额外的损失项。网络架构包括两个并行的 U-Net 结构,网络输入为控制,输出为系统状态。为了捕捉时间依赖关系 между输入和输出,网络设计得能够模拟离散状态空间方程。我们在每个时间步进行逐步训练,使网络能够同时预测每个时间步的油压和水含量。在训练一个时间步后,我们利用了传输学习技术来加速后续时间步的训练过程。提出的模型用于模拟各种不同的油水孔隙流场景,并进行了对应的numerical方法的比较。结果表明PICNN可以有效地模拟高维度的系统,计算时间不随模型维度增长。我们使用10个测试控制,其中每个控制都有不同的大小和频率,以及另外10个测试控制,其中每个控制都有更高的振荡频率,来评估模型的时间误差。我们的观察表明,当控制 exhibits 显著的变化 в大小或频率时,需要一个更加可靠和可靠的模型。
Discovering Mixtures of Structural Causal Models from Time Series Data
results: 经过对 synthetic 和实际数据进行了广泛的实验,这 paper 的方法在 causal discovery 任务中表现出色,特别是当数据来源于多种不同的 causal 图时。 Additionally, the paper proves the identifiability of the model under some mild assumptions.Abstract
In fields such as finance, climate science, and neuroscience, inferring causal relationships from time series data poses a formidable challenge. While contemporary techniques can handle nonlinear relationships between variables and flexible noise distributions, they rely on the simplifying assumption that data originates from the same underlying causal model. In this work, we relax this assumption and perform causal discovery from time series data originating from mixtures of different causal models. We infer both the underlying structural causal models and the posterior probability for each sample belonging to a specific mixture component. Our approach employs an end-to-end training process that maximizes an evidence-lower bound for data likelihood. Through extensive experimentation on both synthetic and real-world datasets, we demonstrate that our method surpasses state-of-the-art benchmarks in causal discovery tasks, particularly when the data emanates from diverse underlying causal graphs. Theoretically, we prove the identifiability of such a model under some mild assumptions.
摘要
在金融、气候科学和神经科学等领域,从时间序列数据推断 causal 关系是一项具有挑战性的任务。当今技术可以处理非线性变量之间的关系和 flexible 噪声分布,但它们假设数据来自同一个下游 causal 模型。在这种工作中,我们放弃了这个假设,并从时间序列数据来自多种不同 causal 模型的混合中进行 causal 发现。我们推断出下游结构 causal 模型以及每个样本属于特定混合组件的 posterior 概率。我们的方法使用一个端到端的训练过程,以最大化数据可能性的证据下界。经过了大量的实验,我们发现我们的方法在 causal 发现任务中超过了现状征的标准准则,特别是数据来自多种不同 causal 图的情况。从理论角度,我们证明了这样的模型可以在某些轻微假设下进行可 identificability。
Ensemble Active Learning by Contextual Bandits for AI Incubation in Manufacturing
methods: 提议使用ensemble active learning方法,通过contextual bandits实现活动样本标注,保持exploration-exploitation平衡,提高AI模型表现。
results: 实验结果表明,提议方法可以减少注释努力,同时保持数据质量,从而提高AI模型的表现。Abstract
It is challenging but important to save annotation efforts in streaming data acquisition to maintain data quality for supervised learning base learners. We propose an ensemble active learning method to actively acquire samples for annotation by contextual bandits, which is will enforce the exploration-exploitation balance and leading to improved AI modeling performance.
摘要
“保持流式数据收集中的注释努力是重要的,以确保超参学习基础模型的数据质量。我们提出了一种 ensemble active learning 方法,通过contextual bandits来活动收集样本,以保持探索与利用的平衡,从而提高 AI 模型表现。”Here's a word-for-word translation:“保持流式数据收集中的注释努力是重要的,以确保超参学习基础模型的数据质量。我们提出了一种 ensemble active learning 方法,通过contextual bandits来活动收集样本,以保持探索与利用的平衡,从而提高 AI 模型表现。”Note that Simplified Chinese is used in mainland China, while Traditional Chinese is used in Taiwan and other regions.
Gem5Pred: Predictive Approaches For Gem5 Simulation Time
results: 我们的最佳回归模型的 Mean Absolute Error (MAE) 为0.546,而我们的最高精度分类模型的 Accuracy 为0.696。这些模型可以作为未来研究的基础,并且可以与之后的模型进行比较。Abstract
Gem5, an open-source, flexible, and cost-effective simulator, is widely recognized and utilized in both academic and industry fields for hardware simulation. However, the typically time-consuming nature of simulating programs on Gem5 underscores the need for a predictive model that can estimate simulation time. As of now, no such dataset or model exists. In response to this gap, this paper makes a novel contribution by introducing a unique dataset specifically created for this purpose. We also conducted analysis of the effects of different instruction types on the simulation time in Gem5. After this, we employ three distinct models leveraging CodeBERT to execute the prediction task based on the developed dataset. Our superior regression model achieves a Mean Absolute Error (MAE) of 0.546, while our top-performing classification model records an Accuracy of 0.696. Our models establish a foundation for future investigations on this topic, serving as benchmarks against which subsequent models can be compared. We hope that our contribution can simulate further research in this field. The dataset we used is available at https://github.com/XueyangLiOSU/Gem5Pred.
摘要
Better and Simpler Lower Bounds for Differentially Private Statistical Estimation
results: 这两个研究得到了以下结论: + 对于covariance estimation,需要有 $\tilde{\Omega}\left(\frac{d^{3/2}{\alpha \varepsilon} + \frac{d}{\alpha^2}\right)$ 样本,这是前一个研究的改进版本,且是 simpler than previous work。 + 对于mean estimation,需要有 $\tilde{\Omega}\left(\frac{d}{\alpha^{k/(k-1)} \varepsilon} + \frac{d}{\alpha^2}\right)$ 样本,这与已知的Upper bound相符,并且超过了对于纯 diferencial privacy 的最佳下界。Abstract
We provide improved lower bounds for two well-known high-dimensional private estimation tasks. First, we prove that for estimating the covariance of a Gaussian up to spectral error $\alpha$ with approximate differential privacy, one needs $\tilde{\Omega}\left(\frac{d^{3/2}{\alpha \varepsilon} + \frac{d}{\alpha^2}\right)$ samples for any $\alpha \le O(1)$, which is tight up to logarithmic factors. This improves over previous work which established this for $\alpha \le O\left(\frac{1}{\sqrt{d}\right)$, and is also simpler than previous work. Next, we prove that for estimating the mean of a heavy-tailed distribution with bounded $k$th moments with approximate differential privacy, one needs $\tilde{\Omega}\left(\frac{d}{\alpha^{k/(k-1)} \varepsilon} + \frac{d}{\alpha^2}\right)$ samples. This matches known upper bounds and improves over the best known lower bound for this problem, which only hold for pure differential privacy, or when $k = 2$. Our techniques follow the method of fingerprinting and are generally quite simple. Our lower bound for heavy-tailed estimation is based on a black-box reduction from privately estimating identity-covariance Gaussians. Our lower bound for covariance estimation utilizes a Bayesian approach to show that, under an Inverse Wishart prior distribution for the covariance matrix, no private estimator can be accurate even in expectation, without sufficiently many samples.
摘要
我们提供了几个改进的下界 для两个高维度私人推导任务。首先,我们证明了为了在 Gaussian 的均值上进行约定 $\alpha$ 的私人推导,需要 $\tilde{\Omega}\left(\frac{d^{3/2}{\alpha \varepsilon} + \frac{d}{\alpha^2}\right)$ 样本,这是对于任何 $\alpha \le O(1)$ 都是严格的下界,并且比前一次的成果更为简单。其次,我们证明了在具有bounded $k$th moments 的非常粗糙分布上进行均值推导时,需要 $\tilde{\Omega}\left(\frac{d}{\alpha^{k/(k-1)} \varepsilon} + \frac{d}{\alpha^2}\right)$ 样本,这与知名的上界相匹配,并且超过了对于纯粹的推导性能的下界,只有在 $k = 2$ 时才能推导出。我们的技术基于指纹技术,通常很简单。我们的下界 для 均值推导基于黑盒减少,具体来说是从私人推导均值 Gaussian 的方向下减少。我们的下界 для 均值推导使用了 bayesian 方法,证明在对均值矩阵的 inverse wishart 分布下,没有私人推导器可以在预期中准确地推导,不具备充分的样本。
Bi-Level Offline Policy Optimization with Limited Exploration
results: 在使用 synthetic、标准 benchmark 和实际世界数据集进行评估中,我们的模型与现状顶尖方法竞争性表现。Abstract
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, we propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors, while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level. This novel formulation preserves the maximum flexibility of the implicitly induced exploratory data distribution, enabling the power of model extrapolation. In practice, it can be solved through a computationally efficient, penalized adversarial estimation procedure. Our theoretical regret guarantees do not rely on any data-coverage and completeness-type assumptions, only requiring realizability. These guarantees also demonstrate that the learned policy represents the "best effort" among all policies, as no other policies can outperform it. We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
摘要
(Simplified Chinese translation)我们研究无线RL,它的目标是基于预先收集的固定数据集学习一个好策略。然而,这个任务面临着数据分布变化的挑战,尤其是在函数近似下。为解决这个问题,我们提出了一个二级结构化策略优化算法,它模型了策略(上层)和价值函数(下层)之间的层次交互。下层关注于建立一个可靠的价值估计集,使其保持小于一定的均值 Bellman 误差,同时控制由数据分布匹配引起的uncertainty。而上层则是通过最大化一个保守的价值估计来优化策略。这种新的表述保留了隐式引入的探索数据分布的最大灵活性,使得模型渐近。在实践中,它可以通过一种 computationally efficient 的偏好对抗估计过程解决。我们的理论 regret 保证不需要任何数据覆盖和完整性类型的假设,只需要可行性。这些保证还证明了学习的策略是所有策略中的"最佳努力",因为没有其他策略可以超越它。我们使用了一个混合的synthetic、benchmark和实际数据集来评估我们的模型,并显示它与当前的方法竞争。
A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning
results: 本研究发现,现有的MBRL方法通常会受到“目标差异”的问题,即模型预测的对环境的准确性与策略优化的对环境的回应有所差异。此外,本研究还发现了一些相关的解决方案,包括: + 对环境模型的适应和更新。 + 使用不同的策略优化方法。 + 使用不同的评估标准。Abstract
Model-based Reinforcement Learning (MBRL) aims to make agents more sample-efficient, adaptive, and explainable by learning an explicit model of the environment. While the capabilities of MBRL agents have significantly improved in recent years, how to best learn the model is still an unresolved question. The majority of MBRL algorithms aim at training the model to make accurate predictions about the environment and subsequently using the model to determine the most rewarding actions. However, recent research has shown that model predictive accuracy is often not correlated with action quality, tracing the root cause to the \emph{objective mismatch} between accurate dynamics model learning and policy optimization of rewards. A number of interrelated solution categories to the objective mismatch problem have emerged as MBRL continues to mature as a research area. In this work, we provide an in-depth survey of these solution categories and propose a taxonomy to foster future research.
摘要
results: 这篇论文通过应用层次变换来生成一组特征(或特征),然后使用概率统计方法进行预测。这种方法可以同时实现缩放预测规则和不确定性评估。Abstract
Our goal is to provide a review of deep learning methods which provide insight into structured high-dimensional data. Rather than using shallow additive architectures common to most statistical models, deep learning uses layers of semi-affine input transformations to provide a predictive rule. Applying these layers of transformations leads to a set of attributes (or, features) to which probabilistic statistical methods can be applied. Thus, the best of both worlds can be achieved: scalable prediction rules fortified with uncertainty quantification, where sparse regularization finds the features.
摘要
我们的目标是为深度学习方法进行评估,以获得结构化高维数据的深入理解。而不是使用大多数统计模型常用的浅层添加性架构,深度学习使用层次的半 Similarity输入变换来提供预测规则。通过这些层次变换,可以获得一组特征(或者特征),这些特征可以通过 probabilistic 统计方法进行评估。因此,可以实现最好的两个世界:可扩展的预测规则和不确定性评估,而 sparse 正则化可以找到特征。Note: "Simplified Chinese" is a translation of "Traditional Chinese" and "简化字" (Simplified Chinese) is a romanization of "简化字" (Simplified Chinese characters).
Sample-Efficient Multi-Agent RL: An Optimization Perspective
results: 作者的算法可以在学习 Nash 均衡、粗略相关均衡和相关均衡问题中达到相对较低的梯度损失,并且与现有的算法相比具有相似的折衡 regret。此外,作者的算法可以避免在数据依赖的约束中解决各个对象的优化问题,从而更易于实际应用。Abstract
We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games (MGs) under the general function approximation. In order to find the minimum assumption for sample-efficient learning, we introduce a novel complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for general-sum MGs. Using this measure, we propose the first unified algorithmic framework that ensures sample efficiency in learning Nash Equilibrium, Coarse Correlated Equilibrium, and Correlated Equilibrium for both model-based and model-free MARL problems with low MADC. We also show that our algorithm provides comparable sublinear regret to the existing works. Moreover, our algorithm combines an equilibrium-solving oracle with a single objective optimization subprocedure that solves for the regularized payoff of each deterministic joint policy, which avoids solving constrained optimization problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023) or executing sampling procedures with complex multi-objective optimization problems (Foster et al. 2023), thus being more amenable to empirical implementation.
摘要
我们研究多体学习(MARL)的总和游戏(MG)下的通用函数近似下的多体减噪系数(MADC),以找到最小的假设,以实现样本效率的学习。我们提出了第一个统一的算法框架,可以保证样本效率的学习 Nash 平衡、粗 corr 平衡和相关平衡,并且可以避免在数据依赖的约束下解决减噪问题(Jin et al. 2020;Wang et al. 2023)或者在复杂多目标优化问题下执行抽象多目标优化问题(Foster et al. 2023)。这使我们的算法更易于实际应用。
A Bayesian framework for discovering interpretable Lagrangian of dynamical systems from data
results: 我们在六个不同的示例中证明了我们的方法的可行性,这些示例包括 both discrete 和连续系统。Abstract
Learning and predicting the dynamics of physical systems requires a profound understanding of the underlying physical laws. Recent works on learning physical laws involve generalizing the equation discovery frameworks to the discovery of Hamiltonian and Lagrangian of physical systems. While the existing methods parameterize the Lagrangian using neural networks, we propose an alternate framework for learning interpretable Lagrangian descriptions of physical systems from limited data using the sparse Bayesian approach. Unlike existing neural network-based approaches, the proposed approach (a) yields an interpretable description of Lagrangian, (b) exploits Bayesian learning to quantify the epistemic uncertainty due to limited data, (c) automates the distillation of Hamiltonian from the learned Lagrangian using Legendre transformation, and (d) provides ordinary (ODE) and partial differential equation (PDE) based descriptions of the observed systems. Six different examples involving both discrete and continuous system illustrates the efficacy of the proposed approach.
摘要
学习和预测物理系统的动力学需要深刻的物理知识。现有的工作是把物理法则推广到物理系统的寻找方程的发现。而现有的方法通常使用神经网络参数化拉格朗日函数,我们提议一种 alternate 的方法,使得可以从有限数据获得可读性的拉格朗日描述,并且可以量化有限数据所带来的epistemic不确定性。此外,该方法还可以自动从学习到的拉格朗日函数中提取汉密尔顿函数,并且提供了描述系统的常微分方程和偏微分方程描述。我们在六个不同的示例中验证了该方法的有效性,这些示例包括连续和离散系统。
paper_authors: Tatsuki Koga, Kamalika Chaudhuri, David Page for: This paper aims to provide a federated analytics approach for estimating the average treatment effect (ATE) in healthcare applications while ensuring differential privacy (DP) guarantees at each site.methods: The proposed method uses a class of per-site estimation algorithms that report the ATE estimate and its variance as a quality measure, and an aggregation algorithm on the server side that minimizes the overall variance of the final ATE estimate.results: The authors’ experiments on real and synthetic data show that their method reliably aggregates private statistics across sites and provides a better privacy-utility tradeoff under site heterogeneity than baselines.Abstract
Patient privacy is a major barrier to healthcare AI. For confidentiality reasons, most patient data remains in silo in separate hospitals, preventing the design of data-driven healthcare AI systems that need large volumes of patient data to make effective decisions. A solution to this is collective learning across multiple sites through federated learning with differential privacy. However, literature in this space typically focuses on differentially private statistical estimation and machine learning, which is different from the causal inference-related problems that arise in healthcare. In this work, we take a fresh look at federated learning with a focus on causal inference; specifically, we look at estimating the average treatment effect (ATE), an important task in causal inference for healthcare applications, and provide a federated analytics approach to enable ATE estimation across multiple sites along with differential privacy (DP) guarantees at each site. The main challenge comes from site heterogeneity -- different sites have different sample sizes and privacy budgets. We address this through a class of per-site estimation algorithms that reports the ATE estimate and its variance as a quality measure, and an aggregation algorithm on the server side that minimizes the overall variance of the final ATE estimate. Our experiments on real and synthetic data show that our method reliably aggregates private statistics across sites and provides better privacy-utility tradeoff under site heterogeneity than baselines.
摘要
�ynamic privacy is a major barrier to healthcare AI. For confidentiality reasons, most patient data remains in silos in separate hospitals, preventing the design of data-driven healthcare AI systems that need large volumes of patient data to make effective decisions. A solution to this is collective learning across multiple sites through federated learning with differential privacy. However, literature in this space typically focuses on differentially private statistical estimation and machine learning, which is different from the causal inference-related problems that arise in healthcare. In this work, we take a fresh look at federated learning with a focus on causal inference; specifically, we look at estimating the average treatment effect (ATE), an important task in causal inference for healthcare applications, and provide a federated analytics approach to enable ATE estimation across multiple sites along with differential privacy (DP) guarantees at each site. The main challenge comes from site heterogeneity -- different sites have different sample sizes and privacy budgets. We address this through a class of per-site estimation algorithms that reports the ATE estimate and its variance as a quality measure, and an aggregation algorithm on the server side that minimizes the overall variance of the final ATE estimate. Our experiments on real and synthetic data show that our method reliably aggregates private statistics across sites and provides better privacy-utility tradeoff under site heterogeneity than baselines.
Low-Rank Tensor Completion via Novel Sparsity-Inducing Regularizers
paper_authors: Zhi-Yong Wang, Hing Cheung So, Abdelhak M. Zoubir
for: 提高low-rank tensor completion问题中的稀疏性能。
methods: 使用非对称Surrogate/正则化,并开发了基于替换方法的高效算法。
results: 实验结果表明,提案方法可以在实际数据上比前种方法更高的稀疏性能。Abstract
To alleviate the bias generated by the l1-norm in the low-rank tensor completion problem, nonconvex surrogates/regularizers have been suggested to replace the tensor nuclear norm, although both can achieve sparsity. However, the thresholding functions of these nonconvex regularizers may not have closed-form expressions and thus iterations are needed, which increases the computational loads. To solve this issue, we devise a framework to generate sparsity-inducing regularizers with closed-form thresholding functions. These regularizers are applied to low-tubal-rank tensor completion, and efficient algorithms based on the alternating direction method of multipliers are developed. Furthermore, convergence of our methods is analyzed and it is proved that the generated sequences are bounded and any limit point is a stationary point. Experimental results using synthetic and real-world datasets show that the proposed algorithms outperform the state-of-the-art methods in terms of restoration performance.
摘要
对于低矩阵完成问题中带来的偏调,非凸代替器/规律被建议来取代矩阵核心 нор,尽管它们都能够产生简洁性。然而,非凸规律的擦除函数可能无关closed-form表达,因此需要迭代运算,这会增加computational负担。为解决这个问题,我们设计了一个架构,可以生成具有关闭式擦除函数的简洁化规律。这些规律被应用到低管阵完成问题上,并开发了基于多重方向积分法的有效算法。此外,我们分析了我们的方法的收敛性,并证明其生成的序列是紧缩的,任何限点都是稳定点。实验结果显示,提出的方法在实验数据上比州前方法有更好的修复性。
Exploring adversarial attacks in federated learning for medical imaging
results: 测试发现,域专业配置可以使攻击者成功率明显增加。结论强调需要有效的防御机制,并建议现有安全协议在 Federated Medical Image Analysis 系统中进行重新评估。Abstract
Federated learning offers a privacy-preserving framework for medical image analysis but exposes the system to adversarial attacks. This paper aims to evaluate the vulnerabilities of federated learning networks in medical image analysis against such attacks. Employing domain-specific MRI tumor and pathology imaging datasets, we assess the effectiveness of known threat scenarios in a federated learning environment. Our tests reveal that domain-specific configurations can increase the attacker's success rate significantly. The findings emphasize the urgent need for effective defense mechanisms and suggest a critical re-evaluation of current security protocols in federated medical image analysis systems.
摘要
translate_chinese( "Federated learning 提供了一个隐私保护的框架 для医疗影像分析,但暴露了系统于敌意攻击。这篇论文旨在评估 federated learning 网络在医疗影像分析中对这些攻击的抵御能力。使用具有域专属 MRI 肿瘤和病理图像 Datasets,我们评估了已知威胁enario在 Federated learning 环境中的效果。我们的测试发现,域专属配置可以提高攻击者的成功率,这些发现强调了现有安全协议的重要性,并建议进行重新评估。")以下是翻译结果:Federated learning 提供了一个隐私保护的框架 для医疗影像分析,但暴露了系统于敌意攻击。这篇论文旨在评估 federated learning 网络在医疗影像分析中对这些攻击的抵御能力。使用具有域专属 MRI 肿瘤和病理图像 Datasets,我们评估了已知威胁enario在 Federated learning 环境中的效果。我们的测试发现,域专属配置可以提高攻击者的成功率,这些发现强调了现有安全协议的重要性,并建议进行重新评估。
Detecting and Learning Out-of-Distribution Data in the Open world: Algorithm and Theory
results: 本研究提出了一系列算法和理论基础,以便建立能够在开放世界中表现出色并且可靠的机器学习模型。Abstract
This thesis makes considerable contributions to the realm of machine learning, specifically in the context of open-world scenarios where systems face previously unseen data and contexts. Traditional machine learning models are usually trained and tested within a fixed and known set of classes, a condition known as the closed-world setting. While this assumption works in controlled environments, it falls short in real-world applications where new classes or categories of data can emerge dynamically and unexpectedly. To address this, our research investigates two intertwined steps essential for open-world machine learning: Out-of-distribution (OOD) Detection and Open-world Representation Learning (ORL). OOD detection focuses on identifying instances from unknown classes that fall outside the model's training distribution. This process reduces the risk of making overly confident, erroneous predictions about unfamiliar inputs. Moving beyond OOD detection, ORL extends the capabilities of the model to not only detect unknown instances but also learn from and incorporate knowledge about these new classes. By delving into these research problems of open-world learning, this thesis contributes both algorithmic solutions and theoretical foundations, which pave the way for building machine learning models that are not only performant but also reliable in the face of the evolving complexities of the real world.
摘要
OOD detection involves identifying instances from unknown classes that fall outside the model's training distribution. This process helps reduce the risk of making overly confident, erroneous predictions about unfamiliar inputs. In addition, ORL extends the capabilities of the model to not only detect unknown instances but also learn from and incorporate knowledge about these new classes. By tackling these research problems of open-world learning, this thesis provides both algorithmic solutions and theoretical foundations, laying the groundwork for building machine learning models that are not only high-performing but also reliable in the face of the evolving complexities of the real world.
Federated Multi-Level Optimization over Decentralized Networks
results: 该算法可以在不同应用场景中实现优秀表现,包括hyperparameter tuning、分布式 reinforcement learning 和风险谨慎优化。同时,该算法的样本复杂度为网络大小的直线性增长。Abstract
Multi-level optimization has gained increasing attention in recent years, as it provides a powerful framework for solving complex optimization problems that arise in many fields, such as meta-learning, multi-player games, reinforcement learning, and nested composition optimization. In this paper, we study the problem of distributed multi-level optimization over a network, where agents can only communicate with their immediate neighbors. This setting is motivated by the need for distributed optimization in large-scale systems, where centralized optimization may not be practical or feasible. To address this problem, we propose a novel gossip-based distributed multi-level optimization algorithm that enables networked agents to solve optimization problems at different levels in a single timescale and share information through network propagation. Our algorithm achieves optimal sample complexity, scaling linearly with the network size, and demonstrates state-of-the-art performance on various applications, including hyper-parameter tuning, decentralized reinforcement learning, and risk-averse optimization.
摘要
多层优化在最近几年内得到了越来越多的关注,因为它提供了一个强大的框架来解决许多领域中的复杂优化问题,如元学习、多 Player 游戏、回归学习和嵌套组合优化。在这篇论文中,我们研究了分布式多层优化问题,其中代理可以只与当前邻居进行交流。这种设定是由大规模系统中的分布式优化需求所驱动的,因为中央化优化可能不是实际或可行的。为解决这个问题,我们提出了一种基于吹拂的分布式多层优化算法,允许网络代理在不同层次上解决优化问题,并在单个时间尺度内共享信息。我们的算法实现了最佳样本复杂度,线性增长与网络大小相关,并在多个应用程序中达到了状态革命性的性能,包括超参调整、分布式回归学习和风险谨慎优化。
paper_authors: Thomas H. M. Roos, Edwin Versteeg, Dennis W. J. Klomp, Jeroen C. W. Siero, Jannie P. Wijnen
for: 这个论文的目的是为研究人员开发和分享新的MRI序列提供一个可用的开源平台。
methods: 这个论文使用了修改一些源代码文件来创建一个Pulseq解释器 для Philips MRI系统。验证实验使用了模拟和在7T Achieva MRI系统上进行的phantom扫描。
results: 通过Pulseq实现得到的重建图像与原始实现的图像相当,并且对MRI仪器的资源利用进行了评估,显示了一定的可扩展性。I hope that helps! Let me know if you have any other questions.Abstract
Purpose: This work aims to address the limitations faced by researchers in developing and sharing new MRI sequences by implementing an interpreter for the open-source MRI pulse sequence format, Pulseq, on a Philips MRI scanner. Methods: The implementation involved modifying a few source code files to create a Pulseq interpreter for the Philips MRI system. Validation experiments were conducted using simulations and phantom scans performed on a 7T Achieva MRI system. The observed sequence and waveforms were compared to the intended ones, and the gradient waveforms produced by the scanner were verified using a field camera. Image reconstruction was performed using the raw k-space samples acquired from both the native vendor environment and the Pulseq interpreter. Results: The reconstructed images obtained through the Pulseq implementation were found to be comparable to those obtained through the native implementation. The performance of the Pulseq interpreter was assessed by profiling the CPU utilization of the MRI spectrometer, showing minimal resource utilization for certain sequences. Conclusion: The successful implementation of the Pulseq interpreter on the Philips MRI scanner demonstrates the feasibility of utilizing Pulseq sequences on Philips MRI scanners. This provides an open-source platform for MRI sequence development, facilitating collaboration among researchers and accelerating scientific progress in the field of MRI.
摘要
目的:本研究旨在解决开源MRI序列格式(Pulseq)的开发和共享限制,通过在菲利浦MRI系统上实现Pulseq解释器。方法:实现过程包括修改一些源代码文件,以创建一个Pulseq解释器 для菲利浦MRI系统。验证实验使用了模拟和phantom扫描,在7T Achieva MRI系统上进行。观察的序列和波形与意图的相比,并使用场Camera验证了扫描产生的梯度波形。图像重建使用了原始的k空间样本,从Native vendor环境和Pulseq解释器中获取。结果:通过Pulseq实现的重建图像与Native实现的图像相似。解释器的性能评估通过CPU资源的使用量进行,表明某些序列的资源使用率很低。结论:成功实现Pulseq解释器在菲利浦MRI系统上,证明了使用Pulseq序列在菲利浦MRI系统上的可能性。这提供了一个开源平台 дляMRI序列开发,促进研究人员之间的合作,加速MRI领域科学进步。
Compression Ratio Learning and Semantic Communications for Video Imaging
results: 通过使用政策Gradient方法来实现明确的压缩率与影像歪斜调整 trade-off,提高影像质量。numerical results show the superiority of the proposed methods over existing baselines.Abstract
Camera sensors have been widely used in intelligent robotic systems. Developing camera sensors with high sensing efficiency has always been important to reduce the power, memory, and other related resources. Inspired by recent success on programmable sensors and deep optic methods, we design a novel video compressed sensing system with spatially-variant compression ratios, which achieves higher imaging quality than the existing snapshot compressed imaging methods with the same sensing costs. In this article, we also investigate the data transmission methods for programmable sensors, where the performance of communication systems is evaluated by the reconstructed images or videos rather than the transmission of sensor data itself. Usually, different reconstruction algorithms are designed for applications in high dynamic range imaging, video compressive sensing, or motion debluring. This task-aware property inspires a semantic communication framework for programmable sensors. In this work, a policy-gradient based reinforcement learning method is introduced to achieve the explicit trade-off between the compression (or transmission) rate and the image distortion. Numerical results show the superiority of the proposed methods over existing baselines.
摘要
Camera 感测器在智能 роботи系统中广泛应用。开发高感知效率的 camera 感测器总是非常重要,以降低功能、存储和相关资源。受最近的可编程感测器和深度光学方法的成功启发,我们设计了一种新的视频压缩感测系统,其中具有空间变化的压缩比率,可以实现比现有的单张图像压缩成像方法更高的成像质量。在这篇文章中,我们还研究了用于可编程感测器的数据传输方法,并评估了通信系统的性能基于传输的感测器数据而不是直接传输感测器数据本身。通常,不同的重建算法采用了应用于高动态范围成像、压缩成像或运动锐化等应用场景。这种任务意识的性能 inspirits 一种Semantic Communication Framework for 可编程感测器。在这种工作中,我们引入了一种基于Policy Gradient的强化学习方法,以实现显式地考虑压缩率和图像扭曲之间的平衡。numerical 结果表明我们的方法在现有基线之上具有超越性。
Domain Expansion via Network Adaptation for Solving Inverse Problems
paper_authors: Nebiyou Yismaw, Ulugbek S. Kamilov, M. Salman Asif for:* 这篇论文主要针对的是解决计算成像中的反向问题,提出了一种基于深度学习的方法。methods:* 这种方法可以分为两个类别:一是学习网络将测量转换为信号估计,这种方法容易受到数据分布的变化的影响;二是学习信号的假设,然后使用优化方法来回收信号。results:* 这篇论文通过研究不同类型的域转换的影响,并提出了一种可以灵活地适应不同域的框架,使得已经训练过的网络可以更好地适应不同的数据分布。这种方法在自然图像、MRI和CT重建任务中获得了显著更好的性能和参数效率。Abstract
Deep learning-based methods deliver state-of-the-art performance for solving inverse problems that arise in computational imaging. These methods can be broadly divided into two groups: (1) learn a network to map measurements to the signal estimate, which is known to be fragile; (2) learn a prior for the signal to use in an optimization-based recovery. Despite the impressive results from the latter approach, many of these methods also lack robustness to shifts in data distribution, measurements, and noise levels. Such domain shifts result in a performance gap and in some cases introduce undesired artifacts in the estimated signal. In this paper, we explore the qualitative and quantitative effects of various domain shifts and propose a flexible and parameter efficient framework that adapt pretrained networks to such shifts. We demonstrate the effectiveness of our method for a number of natural image, MRI, and CT reconstructions tasks under domain, measurement model, and noise-level shifts. Our experiments demonstrate that our method provides significantly better performance and parameter efficiency compared to existing domain adaptation techniques.
摘要
深度学习基本方法可以解决计算成像中的逆问题,这些方法可以分为两个组:(1)学习网络来将测量转换为信号估计,这是知道脆弱的;(2)学习信号的先验来在优化基础上进行恢复。尽管后者的方法也具有很好的成果,但是许多这些方法缺乏对数据分布、测量、和噪声水平的鲁棒性,这会导致性能差异和在某些情况下添加不必要的artefacts到估计的信号中。在这篇论文中,我们研究了不同类型的领域变化的影响,并提出了一种灵活和参数高效的框架,可以适应这些变化。我们通过多个自然图像、MRI和CT重建任务的实验,证明了我们的方法可以在不同的领域、测量模型和噪声水平下提供了显著更好的性能和参数高效性,相比之下现有的领域适应技术。
results: 我们的数值结果表明,提案的方案在 both uplink和downlink传输中可以获得显著的性能改进(相比于其他方案)。Abstract
Channel state information (CSI) estimation is a critical issue in the design of modern massive multiple-input multiple-output (mMIMO) networks. With the increasing number of users, assigning orthogonal pilots to everyone incurs a large overhead that strongly penalizes the system's spectral efficiency (SE). It becomes thus necessary to reuse pilots, giving rise to pilot contamination, a vital performance bottleneck of mMIMO networks. Reusing pilots among the users of the same cell is a desirable operation condition from the perspective of reducing training overheads; however, the intra-cell pilot contamination might worsen due to the users' proximity. Reconfigurable intelligent surfaces (RISs), capable of smartly controlling the wireless channel, can be leveraged for intra-cell pilot reuse. In this paper, our main contribution is a RIS-aided approach for intra-cell pilot reuse and the corresponding channel estimation method. Relying upon the knowledge of only statistical CSI, we optimize the RIS phase shifts based on a manifold optimization framework and the RIS positioning based on a deterministic approach. The extensive numerical results highlight the remarkable performance improvements the proposed scheme achieves (for both uplink and downlink transmissions) compared to other alternatives.
摘要
To mitigate the impact of pilot contamination, this paper proposes a RIS-aided approach for intra-cell pilot reuse and a corresponding channel estimation method. The approach relies on statistical CSI and optimizes the RIS phase shifts using a manifold optimization framework. Additionally, the RIS positioning is determined using a deterministic approach.The proposed scheme achieves significant performance improvements compared to other alternatives, as demonstrated through extensive numerical results. The improvements are evident in both uplink and downlink transmissions. The use of RISs enables the efficient reuse of pilots, reducing the training overhead and improving the system's overall performance.
Longitudinal gOSNR Monitoring by Receiver-side Digital Signal Processing in Multi-Span Optical Transmission System
results: 实验表明,该方法可以准确地估计链路上的gOSNR,并且可以在12个Span链路中进行实验验证。Abstract
We propose the world first longitudinal gOSNR estimation by using correlation template method at Rx, without any monitoring devices located in the middle of the link. The proposed method is experimentally demonstrated in a 12-span link with commercial transceiver.
摘要
我们提出了全球首个长itudinal gOSNR估计方法,使用相关模板方法在Rx中进行估计,无需在链接中间设置监测设备。我们的方法在12 span链接中进行实验,并使用商业转发器。Here's a breakdown of the translation:* 全球首个 (gOSNR) - 全球首个 refers to the fact that this is the first method in the world to estimate gOSNR.* 长itudinal (长itudinal) - 长itudinal refers to the fact that the method is used to estimate the gOSNR in the longitudinal direction.* 估计方法 (估计方法) - 估计方法 refers to the method used to estimate the gOSNR.* 使用相关模板方法 (使用相关模板方法) - 使用相关模板方法 refers to the fact that the method uses a correlation template to estimate the gOSNR.* 在Rx中进行估计 (在Rx中进行估计) - 在Rx中进行估计 refers to the fact that the method estimates the gOSNR at the receiver (Rx) side.* 无需在链接中间设置监测设备 (无需在链接中间设置监测设备) - 无需在链接中间设置监测设备 refers to the fact that the method does not require any monitoring devices to be installed in the middle of the link.* 我们的方法 (我们的方法) - 我们的方法 refers to the fact that this is a method proposed by the speaker.* 在12 span链接中进行实验 (在12 span链接中进行实验) - 在12 span链接中进行实验 refers to the fact that the method was experimentally demonstrated in a 12-span link.* 并使用商业转发器 (并使用商业转发器) - 并使用商业转发器 refers to the fact that the method used commercial transceivers in the experiment.
Joint Coding-Modulation for Digital Semantic Communications via Variational Autoencoder
results: 实验结果表明,该提议的联合编码-模ulation框架在不同的通道条件、传输率和模ulation顺序下,比 sepate设计Semantic编码和模ulation的方法表现更好,并且与analog Semantic通信的性能差距随模ulation顺序的增加而减少。Abstract
Semantic communications have emerged as a new paradigm for improving communication efficiency by transmitting the semantic information of a source message that is most relevant to a desired task at the receiver. Most existing approaches typically utilize neural networks (NNs) to design end-to-end semantic communication systems, where NN-based semantic encoders output continuously distributed signals to be sent directly to the channel in an analog communication fashion. In this work, we propose a joint coding-modulation framework for digital semantic communications by using variational autoencoder (VAE). Our approach learns the transition probability from source data to discrete constellation symbols, thereby avoiding the non-differentiability problem of digital modulation. Meanwhile, by jointly designing the coding and modulation process together, we can match the obtained modulation strategy with the operating channel condition. We also derive a matching loss function with information-theoretic meaning for end-to-end training. Experiments conducted on image semantic communication validate that our proposed joint coding-modulation framework outperforms separate design of semantic coding and modulation under various channel conditions, transmission rates, and modulation orders. Furthermore, its performance gap to analog semantic communication reduces as the modulation order increases while enjoying the hardware implementation convenience.
摘要
semantic communication 已经emerged as a new paradigm to improve communication efficiency by transmitting the most relevant semantic information of a source message to a desired task at the receiver. Most existing approaches typically use neural networks (NNs) to design end-to-end semantic communication systems, where NN-based semantic encoders output continuously distributed signals to be sent directly to the channel in an analog communication fashion.在这项工作中,我们提出了一个joint coding-modulation框架 для数字semantic communication,使用variational autoencoder (VAE)。我们的方法学习源数据到离散符号的过渡概率,从而避免了数字模调不 diferenciability问题。同时,我们通过结合编码和模调过程的共同设计,可以匹配获得的模调策略与运行的通道条件。我们还 derivate一个匹配损失函数with information-theoretic meaning for end-to-end培训。实验在图像semantic communication中 validate that our proposed joint coding-modulation framework outperforms separate design of semantic coding and modulation under various channel conditions, transmission rates, and modulation orders. 此外,我们发现,当modulation order增加时,我们的性能与analog semantic communication的差距逐渐减小,而且享有硬件实现的便利。
Near and Far Field Model Mismatch: Implications on 6G Communications, Localization, and Sensing
results: 研究结果表明,NF模型与FF模型之间的差异可能导致系统性能指标如地位精度、感知可靠性和通信效率的下降。Abstract
The upcoming 6G technology is expected to operate in near-field (NF) radiating conditions thanks to high-frequency and electrically large antenna arrays. While several studies have already addressed this possibility, it is worth noting that NF models introduce heightened complexity, the justification for which is not always evident in terms of performance improvements. Therefore, this paper delves into the implications of the disparity between NF and far-field (FF) models concerning communication, localization, and sensing systems. Such disparity might lead to a degradation of performance metrics like localization accuracy, sensing reliability, and communication efficiency. Through an exploration of the effects arising from the mismatches between NF and FF models, this study seeks to illuminate the challenges confronting system designers and offer valuable insights into the balance between model accuracy, which typically requires a high complexity and achievable performance. To substantiate our perspective, we also incorporate a numerical performance assessment confirming the repercussions of the mismatch between NF and FF models.
摘要
预计6G技术即将在近场(NF)发射条件下运行,由于高频和电动巨大的天线阵列。虽然已有一些研究对这一可能性进行了评估,但值得注意的是,NF模型会带来更高的复杂性,其性能改善的依据并不总是明显。因此,本文探讨NF和FF模型之间的差异对通信、地位确定和探测系统的影响。这种差异可能会导致本地化精度、探测可靠性和通信效率的下降。通过研究NF和FF模型之间的差异的影响,本文旨在披露系统设计师面临的挑战并提供有价值的思路,以寻求在精度和实现性之间寻找平衡。为证明我们的观点,我们还包括一个数字性能评估,证明NF和FF模型之间的差异带来的后果。
3D Non-Stationary Channel Measurement and Analysis for MaMIMO-UAV Communications
results: 文章通过测量和分析了频率域、时间域和空间域的渠道统计特性,包括能量延迟观测图(PDP)、常规干扰和时域频率域的干扰率。此外,文章还提出了定向角(SA)作为时域站点距离的补充指标,并分析了频率域的干扰宽渠和RMS延迟扩散。最后,文章分析了MaMIMO数组元素之间的空间相关性,以证明MaMIMO-UAV通信系统的空间站点准确性。Abstract
Unmanned aerial vehicles (UAVs) have gained popularity in the communications research community because of their versatility in placement and potential to extend the functions of communication networks. However, there remains still a gap in existing works regarding detailed and measurement-verified air-to-ground (A2G) Massive Multi-Input Multi-Output (MaMIMO) channel characteristics which play an important role in realistic deployment. In this paper, we first design a UAV MaMIMO communication platform for channel acquisition. We then use the testbed to measure uplink Channel State Information (CSI) between a rotary-wing drone and a 64-element MaMIMO base station (BS). For characterization, we focus on multidimensional channel stationarity which is a fundamental metric in communication systems. Afterward, we present measurement results and analyze the channel statistics based on power delay profiles (PDPs) considering space, time, and frequency domains. We propose the stationary angle (SA) as a supplementary metric of stationary distance (SD) in the time domain. We analyze the coherence bandwidth and RMS delay spread for frequency stationarity. Finally, spatial correlations between elements are analyzed to indicate the spatial stationarity of the array. The space-time-frequency channel stationary characterization will benefit the physical layer design of MaMIMO-UAV communications.
摘要
无人飞行器(UAV)在通信研究领域得到了广泛的应用,主要是因为它们的位置灵活性和通信网络功能扩展的潜力。然而,现有的研究还存在一个空白,即详细和测量确认的空中到地面(A2G)大量多输入多输出(MaMIMO)通道特性的研究。在这篇论文中,我们首先设计了一个无人飞行器MaMIMO通信平台,然后使用测试床测量了旋翼飞机和64个MaMIMO基站(BS)之间的上行频率响应(CSI)。对于Characterization,我们将注重多维度通道Stationarity,这是通信系统中的基本指标之一。接着,我们提供了测量结果和分析频率响应的Channel Statistics,考虑了空间、时间和频率频率域。此外,我们还分析了天线元素之间的空间相关性,以评估MaMIMO通信系统的空间站点性。最后,我们提出了“站点角度”(SA)作为时域站点距离(SD)的补充指标,并分析了宽bandwidth和RMS延迟扩散。空间时间频率通道Stationarity的Characterization将有助于MaMIMO-UAV通信physical层设计。
ChannelComp: A General Method for Computation by Communications
paper_authors: Saeed Razavikia, José Mairton Barros Da Silva Júnior, Carlo Fischione
For: The paper proposes a new digital channel computing method named ChannelComp, which can use digital as well as analog modulations, and can achieve arbitrary function computation over-the-air.* Methods: The proposed method uses a feasibility optimization problem to ascertain the optimal modulation for computing arbitrary functions over-the-air, and proposes pre-coders to adapt existing digital modulation schemes for computing the function over the multiple access channel.* Results: The simulation results show that ChannelComp outperforms AirComp, particularly for product functions, with more than 10 dB improvement of the computation error.Here is the text in Simplified Chinese:* For: 本文提出了一种新的数字通道计算方法 named ChannelComp,可以使用数字和分布式模ulation,实现空中进行任意函数计算。* Methods: 提议的方法使用一个可行优化问题确定空中计算任意函数的最佳模ulation,并提出适应现有数字模ulation schemes来计算函数 sobre 多重存取通道。* Results: 实验结果表明,ChannelComp 比 AirComp 更高效,特别是对乘积函数,计算错误下降了More than 10 dB。Abstract
Over-the-air computation (AirComp) is a well-known technique by which several wireless devices transmit by analog amplitude modulation to achieve a sum of their transmit signals at a common receiver. The underlying physical principle is the superposition property of the radio waves. Since such superposition is analog and in amplitude, it is natural that AirComp uses analog amplitude modulations. Unfortunately, this is impractical because most wireless devices today use digital modulations. It would be highly desirable to use digital communications because of their numerous benefits, such as error correction, synchronization, acquisition of channel state information, and widespread use. However, when we use digital modulations for AirComp, a general belief is that the superposition property of the radio waves returns a meaningless overlapping of the digital signals. In this paper, we break through such beliefs and propose an entirely new digital channel computing method named ChannelComp, which can use digital as well as analog modulations. We propose a feasibility optimization problem that ascertains the optimal modulation for computing arbitrary functions over-the-air. Additionally, we propose pre-coders to adapt existing digital modulation schemes for computing the function over the multiple access channel. The simulation results verify the superior performance of ChannelComp compared to AirComp, particularly for the product functions, with more than 10 dB improvement of the computation error.
摘要
频段计算(AirComp)是一种已知的技术,多个无线设备通过模拟幅度模ulation来实现一个共同接收器上的 signals的总和。物理原理是无线波的积加性质。由于这种积加是Analog的,因此AirComp通常使用Analog幅度模ulation。然而,这是不实用的,因为今天大多数无线设备使用数字模ulation。使用数字模ulation可以获得许多优点,如错误恢复、同步、通道状态信息获取和广泛使用。但是,当用数字模ulation进行频段计算时,通常认为无线波的积加性质返回无意义的数字信号的重叠。在这篇论文中,我们突破这些信念,并提出一种全新的数字通道计算方法,名为ChannelComp,可以使用数字和Analog模ulation。我们提出一个可行优化问题,以确定在空中计算任意函数的最佳模ulation。此外,我们还提出适应器,以适应现有的数字模ulation方案来计算函数在多接收器通道上。实验结果表明ChannelComp比AirComp表现更优,特别是对于乘法函数,错误率下降了10dB以上。
Plane Constraints Aided Multi-Vehicle Cooperative Positioning Using Factor Graph Optimization
results: 提高了定位性能,特别是在车辆间距测量受到中断时表现出色。Abstract
The development of vehicle-to-vehicle (V2V) communication facil-itates the study of cooperative positioning (CP) techniques for vehicular applications. The CP methods can improve the posi-tioning availability and accuracy by inter-vehicle ranging and data exchange between vehicles. However, the inter-vehicle rang-ing can be easily interrupted due to many factors such as obsta-cles in-between two cars. Without inter-vehicle ranging, the other cooperative data such as vehicle positions will be wasted, leading to performance degradation of range-based CP methods. To fully utilize the cooperative data and mitigate the impact of inter-vehicle ranging loss, a novel cooperative positioning method aided by plane constraints is proposed in this paper. The positioning results received from cooperative vehicles are used to construct the road plane for each vehicle. The plane parameters are then introduced into CP scheme to impose constraints on positioning solutions. The state-of-art factor graph optimization (FGO) algo-rithm is employed to integrate the plane constraints with raw data of Global Navigation Satellite Systems (GNSS) as well as inter-vehicle ranging measurements. The proposed CP method has the ability to resist the interruptions of inter-vehicle ranging since the plane constraints are computed by just using position-related data. A vehicle can still benefit from the position data of cooperative vehicles even if the inter-vehicle ranging is unavaila-ble. The experimental results indicate the superiority of the pro-posed CP method in positioning performance over the existing methods, especially when the inter-ranging interruptions occur.
摘要
发展交通自动化技术的车辆到车辆通信(V2V)技术为汽车应用增加了可靠性和精度。但是,在两辆车之间的距离测量中,可能会有各种障碍物,导致距离测量中断。在没有距离测量的情况下,其他合作数据,如车辆位置,将被浪费,从而导致距离基于CP方法的性能下降。为了充分利用合作数据和减少距离测量中断的影响,本文提出了一种基于平面约束的新型合作定位方法。received from cooperative vehicles are used to construct the road plane for each vehicle. The plane parameters are then introduced into CP scheme to impose constraints on positioning solutions. The state-of-art factor graph optimization (FGO) algorithm is employed to integrate the plane constraints with raw data of Global Navigation Satellite Systems (GNSS) as well as inter-vehicle ranging measurements. The proposed CP method has the ability to resist the interruptions of inter-vehicle ranging since the plane constraints are computed by just using position-related data. A vehicle can still benefit from the position data of cooperative vehicles even if the inter-vehicle ranging is unavailable. The experimental results indicate the superiority of the proposed CP method in positioning performance over the existing methods, especially when the inter-ranging interruptions occur.Here's the translation in Traditional Chinese:随着交通自动化技术的发展,车辆通信(V2V)技术对汽车应用增加了可靠性和精度。但是,在两辆车之间的距离测量中,可能会有各种障碍物,导致距离测量中断。在没有距离测量的情况下,其他合作数据,如车辆位置,将被浪费,从而导致距离基于CP方法的性能下降。为了充分利用合作数据和减少距离测量中断的影响,本文提出了一种基于平面约束的新型合作定位方法。received from cooperative vehicles are used to construct the road plane for each vehicle. The plane parameters are then introduced into CP scheme to impose constraints on positioning solutions. The state-of-art factor graph optimization (FGO) algorithm is employed to integrate the plane constraints with raw data of Global Navigation Satellite Systems (GNSS) as well as inter-vehicle ranging measurements. The proposed CP method has the ability to resist the interruptions of inter-vehicle ranging since the plane constraints are computed by just using position-related data. A vehicle can still benefit from the position data of cooperative vehicles even if the inter-vehicle ranging is unavailable. The experimental results indicate the superiority of the proposed CP method in positioning performance over the existing methods, especially when the inter-ranging interruptions occur.
ISAC 4D Imaging System Based on 5G Downlink Millimeter Wave Signal
results: 我们的 simulations 表明,我们的提议的方法可以提供更好的成像结果。代码可以在https://github.com/MrHaobolu/ISAC_4D_Imaging.git中获取。Abstract
Integrated Sensing and Communication(ISAC) has become a key technology for the 5th generation (5G) and 6th generation (6G) wireless communications due to its high spectrum utilization efficiency. Utilizing infrastructure such as 5G Base Stations (BS) to realize environmental imaging and reconstruction is important for promoting the construction of smart cities. Current 4D imaging methods utilizing Frequency Modulated Continuous Wave (FMCW) based Fast Fourier Transform (FFT) are not suitable for ISAC scenarios due to the higher bandwidth occupation and lower resolution. We propose a 4D (3D-Coordinates, Velocity) imaging method with higher sensing accuracy based on 2D-FFT with 2D-MUSIC utilizing standard 5G Downlink (DL) millimeter wave (mmWave) signals. To improve the sensing precision we also design a transceiver antenna array element arrangement scheme based on MIMO virtual aperture technique. We further propose a target detection algorithm based on multi-dimensional Constant False Alarm (CFAR) detection, which optimizes the ISAC imaging signal processing flow and reduces the computational pressure of signal processing. Simulation results show that our proposed method has better imaging results. The code is publicly available at https://github.com/MrHaobolu/ISAC\_4D\_IMaging.git.
摘要
Integrated Sensing and Communication(ISAC) 已成为 fifth generation (5G) 和 sixth generation (6G) 无线通信技术的关键因素,具有高频率使用效率。通过使用 5G 基站 (BS) 实现环境成像和重建是推动智能城市建设的关键。现有的 4D 成像方法使用 Frequency Modulated Continuous Wave (FMCW) 基于 Fast Fourier Transform (FFT) 不适合 ISAC 场景,因为它们占用更高频率和分辨率较低。我们提议一种基于 2D-FFT 和 2D-MUSIC 的 4D (3D-坐标、速度) 成像方法,可以提高成像精度。我们还设计了一种基于 MIMO 虚拟天线技术的天线元件顺序安排方案,以提高探测精度。此外,我们还提出了一种基于多维度 Constant False Alarm (CFAR) 检测算法的目标检测算法,可以优化 ISAC 成像信号处理流程,并减少信号处理的计算压力。实验结果表明,我们的提议方法可以获得更好的成像效果。代码可以在 上获取。
Mutual Information Metrics for Uplink MIMO-OFDM Integrated Sensing and Communication System
results: Monte Carlo 仿真结果显示,与其他波形优化方案相比,提议的ISAC方案在总性性能方面表现最佳。Abstract
As the uplink sensing has the advantage of easy implementation, it attracts great attention in integrated sensing and communication (ISAC) system. This paper presents an uplink ISAC system based on multi-input multi-output orthogonal frequency division multiplexing (MIMO-OFDM) technology. The mutual information (MI) is introduced as a unified metric to evaluate the performance of communication and sensing. In this paper, firstly, the upper and lower bounds of communication and sensing MI are derived in details based on the interaction between communication and sensing. And the ISAC waveform is optimized by maximizing the weighted sum of sensing and communication MI. The Monte Carlo simulation results show that, compared with other waveform optimization schemes, the proposed ISAC scheme has the best overall performance.
摘要
《这个发送感知系统(ISAC)基于多入多出多频分配(MIMO-OFDM)技术,具有易于实现的优点。这篇论文首先 derive了通信和感知之间的互动的上下限,然后将ISAC波形优化为最大化混合通信和感知的干扰情况。实验结果显示,与其他波形优化方案相比,提案的ISAC方案在总性能方面表现最佳。》Here's the translation breakdown:* 这个 (tī gò) - this* 发送感知系统 (fā sòng gǎn jí xìng zhì) - uplink sensing and communication system* 基于 (jī yú) - based on* 多入多出多频分配 (duō rù duō chū duō fēn fāng) - multi-input multi-output orthogonal frequency division multiplexing* 技术 (jì shù) - technology* 具有 (gù yǒu) - has* 易于 (qǐ yú) - easy* 实现 (shí jì) - implementation* 这篇 (zhè běn) - this paper* 论文 (lùn wén) - paper* 首先 (chū xiān) - firstly* derive (dì zhì) - derive* 通信 (tōng xìn) - communication* 感知 (gǎn jí) - sensing* 之间 (zhī jiān) - between* 上下限 (shàng xià jiàn) - upper and lower bounds* 然后 (rán hái) - then* 将ISAC波形 (jiāng ISAC bā xíng) - optimize the ISAC waveform* 优化 (yōu zuò) - optimize* 为 (wèi) - for* 最大化 (zuò dà jiā) - maximize* 混合 (hù hé) - mixed* 通信 (tōng xìn) - communication* 感知 (gǎn jí) - sensing* 干扰 (gān jí) - interference* 情况 (qíng jì) - situation* 实验 (shí yan) - experiment* 结果 (jié guō) - results* 显示 (xiǎn shi) - show* 与 (yǔ) - with* 其他 (qí tè) - other* 波形 (bā xíng) - waveform* 优化方案 (yōu zuò fāng àn) - optimization schemes* 相比 (xiāng bǐ) - compared with* 表现 (biǎo xiǎng) - performance* 最佳 (zuò jiā) - best* 总性能 (zǒng xìng néng) - overall performance
HoloFed: Environment-Adaptive Positioning via Multi-band Reconfigurable Holographic Surfaces and Federated Learning
results: 57% 低于基准值的位置误差变量(57% lower positioning error variance compared to a beam-scanning baseline)Abstract
Positioning is an essential service for various applications and is expected to be integrated with existing communication infrastructures in 5G and 6G. Though current Wi-Fi and cellular base stations (BSs) can be used to support this integration, the resulting precision is unsatisfactory due to the lack of precise control of the wireless signals. Recently, BSs adopting reconfigurable holographic surfaces (RHSs) have been advocated for positioning as RHSs' large number of antenna elements enable generation of arbitrary and highly-focused signal beam patterns. However, existing designs face two major challenges: i) RHSs only have limited operating bandwidth, and ii) the positioning methods cannot adapt to the diverse environments encountered in practice. To overcome these challenges, we present HoloFed, a system providing high-precision environment-adaptive user positioning services by exploiting multi-band(MB)-RHS and federated learning (FL). For improving the positioning performance, a lower bound on the error variance is obtained and utilized for guiding MB-RHS's digital and analog beamforming design. For better adaptability while preserving privacy, an FL framework is proposed for users to collaboratively train a position estimator, where we exploit the transfer learning technique to handle the lack of position labels of the users. Moreover, a scheduling algorithm for the BS to select which users train the position estimator is designed, jointly considering the convergence and efficiency of FL. Our simulation results confirm that HoloFed achieves a 57% lower positioning error variance compared to a beam-scanning baseline and can effectively adapt to diverse environments.
摘要
positioning是一种重要的服务,用于各种应用程序,预计在5G和6G中与现有的通信基础设施集成。尽管当前的Wi-Fi和mobile基站可以用来支持这种集成,但由于无线信号的精度控制的缺失,所得到的精度不够高。随后,使用可重新配置的干扰表面(RHS)的基站被提议用于位置服务,因为RHS的大量天线元素可以生成自由的信号扫描方式。然而,现有的设计遇到了两个主要挑战:一是RHS只有有限的运作频率,二是位置方法无法适应实际中遇到的多样化环境。为了解决这些挑战,我们提出了HoloFed系统,该系统通过多频(MB)-RHS和联合学习(FL)技术提供高精度环境适应用户位置服务。为了提高位置性能,我们 obtener un lower bound on the error variance and utilizarlo para guiar el diseño de la formación digital y analógica de MB-RHS。另外,我们提出了一个用户协作 trains a position estimator的FL框架,其中我们利用了传输学习技术来处理用户没有位置标签的问题。此外,我们设计了一种BS选择用户训练位置估计器的分配算法,同时考虑了FL的 converges和效率。our simulation results confirm that HoloFed achieves a 57% lower positioning error variance compared to a beam-scanning baseline and can effectively adapt to diverse environments.
Streaming Probabilistic PCA for Missing Data with Heteroscedastic Noise
paper_authors: Kyle Gilman, David Hong, Jeffrey A. Fessler, Laura Balzano
for: This paper aims to develop a novel algorithm for principal component analysis (PCA) in streaming data with missing entries and heteroscedastic noise.
methods: The proposed algorithm, called Streaming Heteroscedastic Algorithm for PCA (SHASTA-PCA), uses a stochastic alternating expectation maximization approach to jointly learn the low-rank latent factors and the unknown noise variances from streaming data.
results: Numerical experiments show that SHASTA-PCA outperforms state-of-the-art streaming PCA algorithms in the heteroscedastic setting, and it is applied to highly-heterogeneous real data from astronomy.Here is the summary in Traditional Chinese:
results: 数据实验显示,SHASTA-PCA在不均势设定下超过现有的流动PCA算法,并应用于天文学中高度不均势的实际数据。Abstract
Streaming principal component analysis (PCA) is an integral tool in large-scale machine learning for rapidly estimating low-dimensional subspaces of very high dimensional and high arrival-rate data with missing entries and corrupting noise. However, modern trends increasingly combine data from a variety of sources, meaning they may exhibit heterogeneous quality across samples. Since standard streaming PCA algorithms do not account for non-uniform noise, their subspace estimates can quickly degrade. On the other hand, the recently proposed Heteroscedastic Probabilistic PCA Technique (HePPCAT) addresses this heterogeneity, but it was not designed to handle missing entries and streaming data, nor does it adapt to non-stationary behavior in time series data. This paper proposes the Streaming HeteroscedASTic Algorithm for PCA (SHASTA-PCA) to bridge this divide. SHASTA-PCA employs a stochastic alternating expectation maximization approach that jointly learns the low-rank latent factors and the unknown noise variances from streaming data that may have missing entries and heteroscedastic noise, all while maintaining a low memory and computational footprint. Numerical experiments validate the superior subspace estimation of our method compared to state-of-the-art streaming PCA algorithms in the heteroscedastic setting. Finally, we illustrate SHASTA-PCA applied to highly-heterogeneous real data from astronomy.
摘要
流动主成分分析(PCA)是大规模机器学习中不可或缺的工具,用于快速估计高维数据中的低维子空间,该数据可能含有缺失项和噪声。然而,当现代趋势尝试将不同来源的数据组合起来时,这些数据可能会具有不同的质量水平。标准的流动PCA算法不会考虑非均匀噪声,因此其子空间估计可能很快地下降。相反,最近提出的随机概率PCA技术(HePPCAT)可以处理这种不同质量的数据,但它没有考虑流动数据和缺失项。这篇论文提出了流动随机预期最大化算法(SHASTA-PCA),用于融合这些因素。 SHASTA-PCA使用了随机 alternate expectation maximization方法,同时学习流动数据中缺失项和不同质量噪声的低维 latent factor和未知噪声方差,并保持低的内存和计算负担。数学实验表明,我们的方法在不同质量的噪声情况下比state-of-the-art streaming PCA算法更好地估计子空间。最后,我们示例了 SHASTA-PCA 应用于天文学中的高度不同质量数据。
Multiscale information fusion for fault detection and localization of battery energy storage systems
results: 实验结果表明,提出的多尺度检测指标可以快速和准确地检测锂离子电池系统中的短路异常,并且可以准确地定位异常的电池单元。Abstract
Battery energy storage system (BESS) has great potential to combat global warming. However, internal abnormalities in the BESS may develop into thermal runaway, causing serious safety incidents. In this study, the multiscale information fusion is proposed for thermal abnormality detection and localization in BESSs. We introduce the concept of dissimilarity entropy as a means to identify anomalies for lumped variables, whereas spatial and temporal entropy measures are presented for the detection of anomalies for distributed variables. Through appropriate parameter optimization, these three entropy functions are integrated into the comprehensive multiscale detection index, which outperforms traditional single-scale detection methods. The proposed multiscale statistic has good interpretability in terms of system energy concentration. Battery system internal short circuit (ISC) experiments have demonstrated that our proposed method can swiftly identify ISC abnormalities and accurately pinpoint the problematic battery cells.
摘要
锂离子电池能量存储系统(BESS)具有潜在的气候变化防控潜力。然而,BESS中的内部异常可能会导致热跑away,引起严重的安全事件。本研究提议了多级信息融合,用于探测和定位BESS中的热异常。我们引入了异常指标基于杂合变量的异同熵概念,而空间和时间异同熵度量则用于探测分布变量上的热异常。通过合适的参数优化,这三种异同熵函数被集成为了全面的多级检测指标,超越了传统单级检测方法。提议的多级统计具有系统能量集中的良好解释性。锂离子电池内部短路(ISC)实验表明,我们提议的方法可快速检测到ISC异常情况,并准确地确定问题的电池单元。
Rate Compatible LDPC Neural Decoding Network: A Multi-Task Learning Approach
paper_authors: Yukun Cheng, Wei Chen, Lun Li, Bo Ai
for: 提高LDPC编码器的decoding性能
methods: 使用多任务学习和Structure of raptor-like LDPC codes
results: 可以处理多个代码速率,无需牺牲帧错误率性能Abstract
Deep learning based decoding networks have shown significant improvement in decoding LDPC codes, but the neural decoders are limited by rate-matching operations such as puncturing or extending, thus needing to train multiple decoders with different code rates for a variety of channel conditions. In this correspondence, we propose a Multi-Task Learning based rate-compatible LDPC ecoding network, which utilizes the structure of raptor-like LDPC codes and can deal with multiple code rates. In the proposed network, different portions of parameters are activated to deal with distinct code rates, which leads to parameter sharing among tasks. Numerical experiments demonstrate the effectiveness of the proposed method. Training the specially designed network under multiple code rates makes the decoder compatible with multiple code rates without sacrificing frame error rate performance.
摘要
深度学习基于解码网络已经显著提高了LDPC码解码性能,但神经解码器受限于比率匹配操作如抽割或扩展,因此需要训练多个解码器以适应不同的通道条件。在这封通信中,我们提出了基于多任务学习的Rate-Compatible LDPC编码网络,该网络利用了飞行鸟式LDPC码的结构,可以处理多个代码速率。在我们的提案中,不同的参数部分会在不同的代码速率下被激活,从而实现参数共享。数字实验证明我们的方法的有效性。通过特地设计的网络在多个代码速率下进行训练,使解码器与多个代码速率兼容无需牺牲帧错误率性能。
for: The paper is written for researchers and developers working on emotional speech synthesis and related areas, as well as those interested in exploring the use of large language models for script generation.
methods: The paper proposes an automatic script generation method that uses a large language model (ChatGPT) and prompt engineering to produce emotional scripts with nonverbal vocalizations (NVs).
results: The paper demonstrates the effectiveness of the proposed method by showing that the generated scripts have better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora, and also highlights the challenges of synthesizing emotional speech with NVs.Abstract
We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.
摘要
我们介绍JVNV,一个日本语言情感演讲集合,包含语言内容和非语言声音表达的脚本,这些脚本由大规模语言模型生成。现有的情感演讲集合缺乏不仅有正确的情感脚本,还缺乏非语言声音表达(NV),这些表达是 spoken language 中表达情感的重要组成部分。我们提出一种自动脚本生成方法,通过提供带有情感方向和非语言声音表达的缓解词汇,使用 ChatGPT 的提问工程来生成情感脚本。我们选择了514个脚本,以保证干扰词汇的覆盖率均匀。我们示出JVNV的效果,并证明JVNV在情感演讲Synthesize 中的表达效果更好,而且JVNV 的情感可识别性也更高。然后,我们对JVNV进行了情感文本到语音合成测试,并发现在添加NV后,合成语音的任务变得更加困难,这带来了新的挑战。根据我们所知,JVNV 是第一个使用大型语言模型自动生成的语音演讲集合。
Audio compression-assisted feature extraction for voice replay attack detection
results: 经过大量数据增强和三种分类器的测试,该方法在ASVspoof 2021的物理访问(PA)集上达到了最低的EER值为22.71%。Abstract
Replay attack is one of the most effective and simplest voice spoofing attacks. Detecting replay attacks is challenging, according to the Automatic Speaker Verification Spoofing and Countermeasures Challenge 2021 (ASVspoof 2021), because they involve a loudspeaker, a microphone, and acoustic conditions (e.g., background noise). One obstacle to detecting replay attacks is finding robust feature representations that reflect the channel noise information added to the replayed speech. This study proposes a feature extraction approach that uses audio compression for assistance. Audio compression compresses audio to preserve content and speaker information for transmission. The missed information after decompression is expected to contain content- and speaker-independent information (e.g., channel noise added during the replay process). We conducted a comprehensive experiment with a few data augmentation techniques and 3 classifiers on the ASVspoof 2021 physical access (PA) set and confirmed the effectiveness of the proposed feature extraction approach. To the best of our knowledge, the proposed approach achieves the lowest EER at 22.71% on the ASVspoof 2021 PA evaluation set.
摘要
<>输入文本转换为简化字体中文。<>声重播攻击是voice spoofing最有效 simplest的一种,但检测声重播攻击具有挑战性,根据2021年自动说话人识别骗ichi Spoofing和Countermeasures Challenge (ASVspoof 2021),因为它们需要外壳speaker、 Microphone和听录条件(如背景噪音)。一个检测声重播攻击的障碍是找到Robust的特征表示,以反映在重播过程中添加的频率噪音信息。本研究提议一种特征提取方法,利用音频压缩。音频压缩将音频压缩到保持内容和说话人信息,以便进行传输。压缩后的信息缺失将包含内容和说话人独立的信息(如在重播过程中添加的频率噪音)。我们在一些数据增强技术和3种分类器的帮助下,对ASVspoof 2021physical access(PA)集进行了全面的实验,并证实了提议的特征提取方法的效iveness。根据我们所知,该方法在ASVspoof 2021 PA评估集上的最低EER为22.71%。
results: 研究结果表明,耳膜的径向振荡和横立波是由耳膜的形状和表面非均匀性引起的,并且 Scala media作为信息采集和增强系统,在耳膜旋转轴上具有重要作用。Abstract
In this work, we investigate the phenomenon of transverse resonance and transverse standing waves that occur within the cochlea of living organisms. It is demonstrated that the predisposing factor for their occurrence is the cochlear shape, which resembles a conical acoustic tube coiled into a spiral and exhibits non-uniformities on its internal surface. This cochlear structure facilitates the analysis of constituent sound signals akin to a spectrum analyzer, with a corresponding interpretation of the physical processes occurring in the auditory system. Additionally, we conclude that the cochlear duct's scala media, composed of a system of membranes and the organ of Corti, functions primarily as an information collection and amplification system along the cochlear spiral. Collectively, these findings enable the development of a novel, highly realistic wave model of the auditory system in living organisms based on a technocratic approach within the scientific context.
摘要
在这项研究中,我们研究了生物体内耳膜中的横向振荡和横向站立波。我们示出了耳膜形状是这些现象的导火索,耳膜形状类似于梭形声学管,内部表面存在非均匀性。这种耳膜结构使得对听音信号的分析与听音系统物理过程的解释更加容易。此外,我们还得到结论, scala media在耳膜管中主要作为听音信号采集和增强系统,即耳膜管沿着听音螺旋的方向进行信息采集和增强。总之,这些发现可以基于科技方法,在科学上建立一种高度实际的听音系统模型。
Super Denoise Net: Speech Super Resolution with Noise Cancellation in Low Sampling Rate Noisy Environments
results: 在 DNS 2020 无投射测试集上,SDNet 模型与基eline 语音噪声和超解析模型相比,得到了更高的 объек тив和主观分数。Abstract
Speech super-resolution (SSR) aims to predict a high resolution (HR) speech signal from its low resolution (LR) corresponding part. Most neural SSR models focus on producing the final result in a noise-free environment by recovering the spectrogram of high-frequency part of the signal and concatenating it with the original low-frequency part. Although these methods achieve high accuracy, they become less effective when facing the real-world scenario, where unavoidable noise is present. To address this problem, we propose a Super Denoise Net (SDNet), a neural network for a joint task of super-resolution and noise reduction from a low sampling rate signal. To that end, we design gated convolution and lattice convolution blocks to enhance the repair capability and capture information in the time-frequency axis, respectively. The experiments show our method outperforms baseline speech denoising and SSR models on DNS 2020 no-reverb test set with higher objective and subjective scores.
摘要
<>TRANSLATE_TEXTSpeech super-resolution (SSR) aims to predict a high resolution (HR) speech signal from its low resolution (LR) corresponding part. Most neural SSR models focus on producing the final result in a noise-free environment by recovering the spectrogram of high-frequency part of the signal and concatenating it with the original low-frequency part. Although these methods achieve high accuracy, they become less effective when facing the real-world scenario, where unavoidable noise is present. To address this problem, we propose a Super Denoise Net (SDNet), a neural network for a joint task of super-resolution and noise reduction from a low sampling rate signal. To that end, we design gated convolution and lattice convolution blocks to enhance the repair capability and capture information in the time-frequency axis, respectively. The experiments show our method outperforms baseline speech denoising and SSR models on DNS 2020 no-reverb test set with higher objective and subjective scores.TRANSLATE_TEXT
Thech. Report: Genuinization of Speech waveform PMF for speaker detection spoofing and countermeasures
results: 实验结果显示,将伪造化算法应用于伪造攻击后,可以大幅提高伪动检测性能,并且在不同的实验情况下均有良好的表现。Abstract
In the context of spoofing attacks in speaker recognition systems, we observed that the waveform probability mass function (PMF) of genuine speech differs significantly from the PMF of speech resulting from the attacks. This is true for synthesized or converted speech as well as replayed speech. We also noticed that this observation seems to have a significant impact on spoofing detection performance. In this article, we propose an algorithm, denoted genuinization, capable of reducing the waveform distribution gap between authentic speech and spoofing speech. Our genuinization algorithm is evaluated on ASVspoof 2019 challenge datasets, using the baseline system provided by the challenge organization. We first assess the influence of genuinization on spoofing performance. Using genuinization for the spoofing attacks degrades spoofing detection performance by up to a factor of 10. Next, we integrate the genuinization algorithm in the spoofing countermeasures and we observe a huge spoofing detection improvement in different cases. The results of our experiments show clearly that waveform distribution plays an important role and must be taken into account by anti-spoofing systems.
摘要
在声音权限系统中的假声攻击中,我们发现了声波概率质量函数(PMF)的真实声音和假声音之间存在巨大的差异。这种差异适用于合成或转换的声音以及重播声音。我们还注意到,这一观察对假声检测性能产生了重要的影响。在本文中,我们提出了一种算法,称为真实化,可以减少真实声音和假声音之间的声波分布差异。我们的真实化算法在ASVspoof 2019挑战数据集上进行了评估,使用了挑战组织提供的基线系统。我们首先评估了假声攻击后真实化的影响。使用真实化对假声攻击减少了假声检测性能,最多减少了10倍。接着,我们将真实化算法 интеGRATED INTO spoofing countermeasures,并观察到了不同情况下的巨大假声检测改善。我们的实验结果显示,声波分布在反假检测系统中扮演着重要的角色。
AdvSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification
results: 研究发现,使用这种抗击攻击方法可以在语音识别系统中引入攻击,并且可以在不同的声音环境下进行模拟。Abstract
It is known that deep neural networks are vulnerable to adversarial attacks. Although Automatic Speaker Verification (ASV) built on top of deep neural networks exhibits robust performance in controlled scenarios, many studies confirm that ASV is vulnerable to adversarial attacks. The lack of a standard dataset is a bottleneck for further research, especially reproducible research. In this study, we developed an open-source adversarial attack dataset for speaker verification research. As an initial step, we focused on the over-the-air attack. An over-the-air adversarial attack involves a perturbation generation algorithm, a loudspeaker, a microphone, and an acoustic environment. The variations in the recording configurations make it very challenging to reproduce previous research. The AdvSV dataset is constructed using the Voxceleb1 Verification test set as its foundation. This dataset employs representative ASV models subjected to adversarial attacks and records adversarial samples to simulate over-the-air attack settings. The scope of the dataset can be easily extended to include more types of adversarial attacks. The dataset will be released to the public under the CC-BY license. In addition, we also provide a detection baseline for reproducible research.
摘要
Deep neural networks 是容易受到敌意攻击的。尽管基于深度神经网络的自动说话识别(ASV)在控制场景下表现出了可靠性,但许多研究证明ASV对敌意攻击很敏感。数据集的缺乏标准化是研究的一大障碍,尤其是可重复性研究。在这项研究中,我们开发了一个开源的敌意攻击数据集 для说话识别研究。作为初始步骤,我们专注于无线电攻击。无线电攻击包括一个杂乱生成算法、一个喇叭、一个麦克风和一个声学环境。记录配置的变化使得前期研究的重现非常困难。 AdvSV 数据集基于 Voxceleb1 验证测试集作为基础,这个数据集使用了 Representative ASV 模型在敌意攻击下录制的样本,以模拟无线电攻击场景。数据集的范围可以轻松扩展到更多的敌意攻击类型。数据集将会在 CC-BY license 下公开发布。此外,我们还提供了一个可重复性的检测基线。
An Initial Investigation of Neural Replay Simulator for Over-the-Air Adversarial Perturbations to Automatic Speaker Verification
paper_authors: Jiaqi Li, Li Wang, Liumeng Xue, Lei Wang, Zhizheng Wu
for: 防御敏感语音识别系统免受 Physical Access 攻击
methods: 使用神经网络播放模拟器提高 Over-the-air 攻击 robustness
results: 使用神经网络播放模拟器可以大幅提高 Over-the-air 攻击成功率,提高 Physical Access 应用中语音识别系统的安全性问题Abstract
Deep Learning has advanced Automatic Speaker Verification (ASV) in the past few years. Although it is known that deep learning-based ASV systems are vulnerable to adversarial examples in digital access, there are few studies on adversarial attacks in the context of physical access, where a replay process (i.e., over the air) is involved. An over-the-air attack involves a loudspeaker, a microphone, and a replaying environment that impacts the movement of the sound wave. Our initial experiment confirms that the replay process impacts the effectiveness of the over-the-air attack performance. This study performs an initial investigation towards utilizing a neural replay simulator to improve over-the-air adversarial attack robustness. This is achieved by using a neural waveform synthesizer to simulate the replay process when estimating the adversarial perturbations. Experiments conducted on the ASVspoof2019 dataset confirm that the neural replay simulator can considerably increase the success rates of over-the-air adversarial attacks. This raises the concern for adversarial attacks on speaker verification in physical access applications.
摘要
results: 实验结果表明,我们的架构、以及我们的transformer基于的提案可以在CUB、ILSVRC、OpenImages和TelDrone等 dataset上获得更高的本地化性能,并且可以在无监督下进行图像本地化和分类任务。Abstract
Self-supervised vision transformers (SSTs) have shown great potential to yield rich localization maps that highlight different objects in an image. However, these maps remain class-agnostic since the model is unsupervised. They often tend to decompose the image into multiple maps containing different objects while being unable to distinguish the object of interest from background noise objects. In this paper, Discriminative Pseudo-label Sampling (DiPS) is introduced to leverage these class-agnostic maps for weakly-supervised object localization (WSOL), where only image-class labels are available. Given multiple attention maps, DiPS relies on a pre-trained classifier to identify the most discriminative regions of each attention map. This ensures that the selected ROIs cover the correct image object while discarding the background ones, and, as such, provides a rich pool of diverse and discriminative proposals to cover different parts of the object. Subsequently, these proposals are used as pseudo-labels to train our new transformer-based WSOL model designed to perform classification and localization tasks. Unlike standard WSOL methods, DiPS optimizes performance in both tasks by using a transformer encoder and a dedicated output head for each task, each trained using dedicated loss functions. To avoid overfitting a single proposal and promote better object coverage, a single proposal is randomly selected among the top ones for a training image at each training step. Experimental results on the challenging CUB, ILSVRC, OpenImages, and TelDrone datasets indicate that our architecture, in combination with our transformer-based proposals, can yield better localization performance than state-of-the-art methods.
摘要
自适应视Transformers(SSTs)已经展示了很大的潜力,可以生成各种不同的对象映射,但这些映射是无监督的,因此无法区分识别对象和背景噪音。在这篇论文中,我们引入了Discriminative Pseudo-label Sampling(DiPS)方法,利用这些无监督映射来实现弱监督对象定位(WSOL),只需要图像类别标签。给出多个注意力映射后,DiPS使用预训练的分类器来确定每个注意力映射中最有挑战性的区域。这确保了选择的ROIs覆盖了正确的图像对象,而不是背景对象,并提供了丰富多样的可靠提案,以便覆盖不同部分的对象。然后,这些提案被用作 Pseudo-标签来训练我们新的 transformer 基于 WSOL 模型,用于进行分类和定位任务。不同于标准 WSOL 方法,DiPS 可以在两个任务之间优化性能,使用 transformer 编码器和专门的输出头,每个任务都使用专门的损失函数进行训练。为了避免单个提案过拟合和促进更好的对象覆盖,在每个训练图像上选择其中一个随机的提案。实验结果表明,我们的架构,与我们的 transformer 基于提案相结合,可以在复杂的 CUB、ILSVRC、OpenImages 和 TelDrone datasets 上达到更高的定位性能。
HydraViT: Adaptive Multi-Branch Transformer for Multi-Label Disease Classification from Chest X-ray Images
paper_authors: Şaban Öztürk, M. Yiğit Turalı, Tolga Çukur
for: 这个论文是用于提高胸部X射线图像的多标签分类性能的研究。
methods: 该方法 synergistically combines a transformer backbone with a multi-branch output module,使用自我注意机制来适应任务关键区域,并 dedicates an independent branch to each disease label 以及一个汇总 Branch across labels。
results: 实验表明,Compared to competing attention-guided methods, region-guided methods, and semantic-guided methods, HydraViT 的多标签分类性能平均提高1.2%, 1.4%, 和1.0%。Abstract
Chest X-ray is an essential diagnostic tool in the identification of chest diseases given its high sensitivity to pathological abnormalities in the lungs. However, image-driven diagnosis is still challenging due to heterogeneity in size and location of pathology, as well as visual similarities and co-occurrence of separate pathology. Since disease-related regions often occupy a relatively small portion of diagnostic images, classification models based on traditional convolutional neural networks (CNNs) are adversely affected given their locality bias. While CNNs were previously augmented with attention maps or spatial masks to guide focus on potentially critical regions, learning localization guidance under heterogeneity in the spatial distribution of pathology is challenging. To improve multi-label classification performance, here we propose a novel method, HydraViT, that synergistically combines a transformer backbone with a multi-branch output module with learned weighting. The transformer backbone enhances sensitivity to long-range context in X-ray images, while using the self-attention mechanism to adaptively focus on task-critical regions. The multi-branch output module dedicates an independent branch to each disease label to attain robust learning across separate disease classes, along with an aggregated branch across labels to maintain sensitivity to co-occurrence relationships among pathology. Experiments demonstrate that, on average, HydraViT outperforms competing attention-guided methods by 1.2%, region-guided methods by 1.4%, and semantic-guided methods by 1.0% in multi-label classification performance.
摘要
骨肉X光是诊断肺病的非常重要工具,可以快速发现肺部疾病的病理变化。然而,基于图像的诊断仍然是一项挑战,因为肺部疾病的形态特征可能具有不同的大小和位置,同时也可能具有相似的视觉特征和共存的疾病。由于疾病相关的区域通常占用图像的相对较小的部分,基于传统的卷积神经网络(CNN)的分类模型受到了地方性偏好的影响。在以前,人们使用了注意力地图或空间mask来引导注意力于可能关键的区域,但在疾病分布的空间分布不均匀的情况下,学习局部化导向是挑战。为了改进多标签分类性能,我们在这里提出了一种新的方法:HydraViT。HydraViT synergistically combines a transformer backbone with a multi-branch output module with learned weighting。 transformer backbone 增强了X光图像的长距离上下文敏感度,并使用自注意机制来自适应任务关键区域。 multi-branch output module 各自设置了独立的分支来每个疾病标签,以实现稳定的学习 Across separate disease classes, along with an aggregated branch across labels to maintain sensitivity to co-occurrence relationships among pathology。实验表明,HydraViT 在多标签分类性能中,平均与竞争对手注意力引导方法比出得1.2%,region-guided方法比出得1.4%,semantic-guided方法比出得1.0%。
WinSyn: A High Resolution Testbed for Synthetic Data
results: 提供了一个域对齐的光idel模型,可以进行多种参数分布和工程方法的实验,并提供了21290个synthetic图像的第二个数据集。这个数据集是用于研究synthetic数据生成领域的一个重要的测试平台。Abstract
We present WinSyn, a dataset consisting of high-resolution photographs and renderings of 3D models as a testbed for synthetic-to-real research. The dataset consists of 75,739 high-resolution photographs of building windows, including traditional and modern designs, captured globally. These include 89,318 cropped subimages of windows, of which 9,002 are semantically labeled. Further, we present our domain-matched photorealistic procedural model which enables experimentation over a variety of parameter distributions and engineering approaches. Our procedural model provides a second corresponding dataset of 21,290 synthetic images. This jointly developed dataset is designed to facilitate research in the field of synthetic-to-real learning and synthetic data generation. WinSyn allows experimentation into the factors that make it challenging for synthetic data to compete with real-world data. We perform ablations using our synthetic model to identify the salient rendering, materials, and geometric factors pertinent to accuracy within the labeling task. We chose windows as a benchmark because they exhibit a large variability of geometry and materials in their design, making them ideal to study synthetic data generation in a constrained setting. We argue that the dataset is a crucial step to enable future research in synthetic data generation for deep learning.
摘要
我们介绍WinSyn数据集,包含高分辨率照片和3D模型渲染的集合,用于实验式数据生成研究的测试平台。该数据集包含75739张高分辨率照片建筑窗户,包括传统和现代设计,全球摄取。这些数据包括89318张窗户截取图,其中9022个是semantically标注。此外,我们提供域匹配的高品质渲染模型,允许在多种参数分布和工程方法之间进行实验。我们的渲染模型生成了21290张 sintetic图像。这些数据集在实验式数据生成领域的研究中提供了一个jointly开发的测试平台。WinSyn允许我们对于实际数据和 sintetic数据之间的差异进行实验,并通过我们的 sintetic模型进行ablation来确定窗户渲染、材料和几何因素对于标签任务的精度有优先的作用。我们选择窗户作为标准 benchmark,因为它们在设计上具有广泛的几何和材料变化,使其成为研究 sintetic数据生成在限定的设定下的理想选择。我们认为这些数据集是未来研究 sintetic数据生成的深度学习的关键一步。
Factorized Tensor Networks for Multi-Task and Multi-Domain Learning
paper_authors: Yash Garg, Nebiyou Yismaw, Rakib Hyder, Ashley Prater-Bennette, M. Salman Asif for:FTN 是一种多任务多Domain学习方法,可以通过单一的网络来学习多个任务和多个Domain。methods:FTN 使用一个冻结的背景网络,然后逐渐添加任务/Domain特定的低维度tensor因子来共享的冻结网络中。results:FTN 可以在多个目标Domain和任务上达到类似于独立单任务/Domain网络的准确率,但需要的额外参数比较少。我们在多个广泛使用的多Domain和多任务数据集上进行了实验,并观察到 FTN 可以在不同的卷积架构和转换架构上达到类似的准确率。Abstract
Multi-task and multi-domain learning methods seek to learn multiple tasks/domains, jointly or one after another, using a single unified network. The key challenge and opportunity is to exploit shared information across tasks and domains to improve the efficiency of the unified network. The efficiency can be in terms of accuracy, storage cost, computation, or sample complexity. In this paper, we propose a factorized tensor network (FTN) that can achieve accuracy comparable to independent single-task/domain networks with a small number of additional parameters. FTN uses a frozen backbone network from a source model and incrementally adds task/domain-specific low-rank tensor factors to the shared frozen network. This approach can adapt to a large number of target domains and tasks without catastrophic forgetting. Furthermore, FTN requires a significantly smaller number of task-specific parameters compared to existing methods. We performed experiments on widely used multi-domain and multi-task datasets. We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture. We observed that FTN achieves similar accuracy as single-task/domain methods while using only a fraction of additional parameters per task.
摘要
QR-Tag: Angular Measurement and Tracking with a QR-Design Marker
results: simulations show that the proposed method is computationally efficient and has high accuracy in measuring angular information.Abstract
Directional information measurement has many applications in domains such as robotics, virtual and augmented reality, and industrial computer vision. Conventional methods either require pre-calibration or necessitate controlled environments. The state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely. However, it is still not a fully QR code design. To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate. The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera. The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy.
摘要
方向信息测量在 роботиCS、虚拟和增强现实以及工业计算机视觉领域有广泛的应用。传统方法 Either require pre-calibration 或者需要控制环境。 current state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely. However, it is still not a fully QR code design. To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate. The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera. The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy.Here's the breakdown of the translation:方向信息测量 (directional information measurement) - 方向信息测量 (directional information measurement)领域 (domains) - 领域 (domains)such as robotics, virtual and augmented reality, and industrial computer vision. - 如 robotics, 虚拟和增强现实, 和工业计算机视觉Conventional methods either require pre-calibration or necessitate controlled environments. - 传统方法 Either require pre-calibration 或者需要控制环境。The state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely. - current state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely.However, it is still not a fully QR code design. - However, it is still not a fully QR code design.To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate. - To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate.The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera. - The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera.The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy. - The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy.
Developing and Refining a Multifunctional Facial Recognition System for Older Adults with Cognitive Impairments: A Journey Towards Enhanced Quality of Life
results: 该系统的实现和评估表明,它可以帮助老年人更好地完成日常任务,例如识别家人和朋友、记录日常活动和照片,以及帮助老年人找回失去的物品。Abstract
In an era where the global population is aging significantly, cognitive impairments among the elderly have become a major health concern. The need for effective assistive technologies is clear, and facial recognition systems are emerging as promising tools to address this issue. This document discusses the development and evaluation of a new Multifunctional Facial Recognition System (MFRS), designed specifically to assist older adults with cognitive impairments. The MFRS leverages face_recognition [1], a powerful open-source library capable of extracting, identifying, and manipulating facial features. Our system integrates the face recognition and retrieval capabilities of face_recognition, along with additional functionalities to capture images and record voice memos. This combination of features notably enhances the system's usability and versatility, making it a more user-friendly and universally applicable tool for end-users. The source code for this project can be accessed at https://github.com/Li-8023/Multi-function-face-recognition.git.
摘要
在老龄化的时代,老年人的认知障碍成为了主要的健康问题。需要有效的助助技术,面Recognition系统正在迅速发展。这份文档介绍了一种新的多功能面Recognition系统(MFRS),特意为老年人 WITH cognitive impairments设计。MFRS利用face_recognition[1],一个强大的开源库,可以提取、识别和修改面部特征。我们的系统结合了面Recognition和检索功能,并添加了捕捉图像和录音笔记功能。这种结合使得系统的使用和通用性得到了显著提高,使得它成为更加用户友好和普遍适用的工具。相关代码可以在https://github.com/Li-8023/Multi-function-face-recognition.git中下载。
Advancing Diagnostic Precision: Leveraging Machine Learning Techniques for Accurate Detection of Covid-19, Pneumonia, and Tuberculosis in Chest X-Ray Images
results: 研究使用公共可用的多类Kaggle数据集和NIH数据集进行了严格测试,并取得了COVID-19的AUC值为0.95、TB的AUC值为0.99和肺炎的AUC值为0.98,以及Recall和精度等高水平。Abstract
Lung diseases such as COVID-19, tuberculosis (TB), and pneumonia continue to be serious global health concerns that affect millions of people worldwide. In medical practice, chest X-ray examinations have emerged as the norm for diagnosing diseases, particularly chest infections such as COVID-19. Paramedics and scientists are working intensively to create a reliable and precise approach for early-stage COVID-19 diagnosis in order to save lives. But with a variety of symptoms, medical diagnosis of these disorders poses special difficulties. It is essential to address their identification and timely diagnosis in order to successfully treat and prevent these illnesses. In this research, a multiclass classification approach using state-of-the-art methods for deep learning and image processing is proposed. This method takes into account the robustness and efficiency of the system in order to increase diagnostic precision of chest diseases. A comparison between a brand-new convolution neural network (CNN) and several transfer learning pre-trained models including VGG19, ResNet, DenseNet, EfficientNet, and InceptionNet is recommended. Publicly available and widely used research datasets like Shenzen, Montogomery, the multiclass Kaggle dataset and the NIH dataset were used to rigorously test the model. Recall, precision, F1-score, and Area Under Curve (AUC) score are used to evaluate and compare the performance of the proposed model. An AUC value of 0.95 for COVID-19, 0.99 for TB, and 0.98 for pneumonia is obtained using the proposed network. Recall and precision ratings of 0.95, 0.98, and 0.97, respectively, likewise met high standards.
摘要
肺病如COVID-19、肺 tubercular (TB) 和肺炎病综合症仍然是全球健康问题,影响了数百万人。在医疗实践中,胸部X射线检测成为诊断疾病的标准方法,特别是肺部感染如COVID-19。 paramedics 和科学家在努力创造一种可靠和精准的早期COVID-19诊断方法,以拯救生命。但这些疾病的症状多样,医学诊断呈现特殊困难。因此,有必要解决其识别和早期诊断,以成功地治疗和预防这些疾病。在这项研究中,我们提出了一种多类分类方法,使用现代深度学习和图像处理技术。这种方法考虑了系统的稳定性和效率,以提高胸部疾病诊断的精度。我们建议对多个转移学习预训练模型,包括VGG19、ResNet、DenseNet、EfficientNet和InceptionNet进行比较。使用公共可用的和广泛使用的研究数据集,如深圳、Montgomery、多类Kaggle数据集和NIH数据集,对模型进行严格测试。我们使用Recall、精度、F1分数和报道曲线(AUC)分数来评估和比较提案模型的性能。我们获得了COVID-19的AUC值为0.95,TB的AUC值为0.99,肺炎病的AUC值为0.98,同时Recall和精度分数都达到了高标准。
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing
paper_authors: Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He
results: 实验结果表明,我们的方法可以减少不一致性,并在现有的文本到视频编辑 benchmark 上达到新的州OF-THE-ART 性能。Abstract
Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.
摘要
文本到视频编辑是编辑源视频的视觉外观,根据文本提示进行编辑。该任务的主要挑战是保证所有视频帧的视觉一致性。大多数最新的工作通过增加文本到图像扩散模型来解决这个问题,其中包括在U-Net中增加2D空间注意力。虽然通过空间-时间注意力可以添加时间上下文,但可能会在每个patch中添加无关信息,从而导致编辑视频中的不一致性。在这篇论文中,我们首次在U-Net中引入摩尔变换来解决文本到视频编辑中的不一致性问题。我们的方法,FLATTEN,在注意力模块中使用同一个流动路径的patch来attend于彼此,从而提高编辑视频中的视觉一致性。此外,我们的方法是训练自由的,可以轻松地与任何扩散基于文本到视频编辑方法结合使用,提高其视觉一致性。实验结果表明,我们的提议方法在现有的文本到视频编辑标准准测试集上达到了新的状态纪录性。尤其是,我们的方法在编辑视频中维持视觉一致性的表现出色。
SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation
paper_authors: Duy-Kien Nguyen, Martin R. Oswald, Cees G. M. Snoek
for: This paper aims to improve object detection in images by removing the need for feature pyramids and multi-scale feature maps, which are commonly used in modern object detectors.
methods: The paper proposes a transformer-based detector with scale-aware attention, which allows the detector to operate on single-scale features.
results: The proposed method, called SimPLR, achieves strong performance compared to other object detectors, including end-to-end detectors and plain-backbone detectors, while being faster.Abstract
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing handcrafted components using transformers, multi-scale feature maps remain a key factor for their empirical success, even with a plain backbone like the Vision Transformer (ViT). In this paper, we show that this reliance on feature pyramids is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head both operate on single-scale features. The plain architecture allows SimPLR to effectively take advantages of self-supervised learning and scaling approaches with ViTs, yielding strong performance compared to multi-scale counterparts. We demonstrate through our experiments that when scaling to larger backbones, SimPLR indicates better performance than end-to-end detectors (Mask2Former) and plain-backbone detectors (ViTDet), while consistently being faster. The code will be released.
摘要
现代物体检测器的设计中,检测对象在不同比例的图像中的能力具有重要作用。尽管已经做出了许多进步,使用变换器来消除手工组件,但是多个比例特征图仍然是现有的静态成功的关键因素,即使使用简单的背bone like Vision Transformer (ViT)。在这篇论文中,我们表明这种依赖于特征阶段是不必要的,并使用缩放意识来替代特征阶段,从而实现了简单的检测器 `SimPLR`。该简单的架构允许SimPLR有效地利用自我超vised学习和缩放方法,并在ViTs上表现出优于多比例对手。我们通过实验表明,当扩大到更大的背bone时,SimPLR表现 mejor于端到端检测器(Mask2Former)和平面背bone检测器(ViTDet),而且一直快。代码将被发布。
Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input
results: 该论文的实验结果显示,使用N-ICP算法可以高效地生成具有协调动作和真实的裤子和衣服动画,并且可以在新的测试环境中进行普适化。Abstract
Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well as body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm that can efficiently track the coarse garment shape given sparse depth input. Given the coarse tracking results, the input RGB-D images are then remapped to texel-aligned features, which are fed into the drivable avatar models to faithfully reconstruct appearance details. We evaluate our method against recent image-driven synthesis baselines, and conduct a comprehensive analysis of the N-ICP algorithm. We demonstrate that our method can generalize to a novel testing environment, while preserving the ability to produce high-fidelity and faithful clothing dynamics and appearance.
摘要
<> clothings 是人类外表的重要组成部分,但是模拟 photorealistic avatars 中的 clothings 具有挑战性。在这项工作中,我们展示了可以受到 sparse RGB-D 输入的动态摆动粗布衣的 avatars,并且可以通过 body 和 face 运动来详细描述 clothings 的形状。我们提出了一种基于 Neural Iterative Closest Point 算法(N-ICP)的方法,可以高效地跟踪粗布衣的抽象形状,只需要 sparse depth 输入。给出了车 tracking 结果后,输入 RGB-D 图像将被重新映射到 texel-aligned 特征上,然后被 fed 到 drivable avatar 模型中,以实现高质量的形状和外观重建。我们对最近的 image-driven synthesis 基准进行了评估,并进行了 N-ICP 算法的全面分析。我们示出了我们的方法可以在新的测试环境中广泛应用,同时保持高度准确的 clothings 动态和外观重建。
CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird’s-Eye View Fusion
results: 对OPV2V数据集进行两项探测任务(BEVsemantic segmentation和3D物体检测)的实验结果表明,我们的DWCA LiDAR-camera融合模型在单模态数据和现有BEV融合模型之上具有显著性能优势,而我们的总共同探测架构CoBEVFusion也实现了与其他共同探测模型相当的性能Abstract
Autonomous Vehicles (AVs) use multiple sensors to gather information about their surroundings. By sharing sensor data between Connected Autonomous Vehicles (CAVs), the safety and reliability of these vehicles can be improved through a concept known as cooperative perception. However, recent approaches in cooperative perception only share single sensor information such as cameras or LiDAR. In this research, we explore the fusion of multiple sensor data sources and present a framework, called CoBEVFusion, that fuses LiDAR and camera data to create a Bird's-Eye View (BEV) representation. The CAVs process the multi-modal data locally and utilize a Dual Window-based Cross-Attention (DWCA) module to fuse the LiDAR and camera features into a unified BEV representation. The fused BEV feature maps are shared among the CAVs, and a 3D Convolutional Neural Network is applied to aggregate the features from the CAVs. Our CoBEVFusion framework was evaluated on the cooperative perception dataset OPV2V for two perception tasks: BEV semantic segmentation and 3D object detection. The results show that our DWCA LiDAR-camera fusion model outperforms perception models with single-modal data and state-of-the-art BEV fusion models. Our overall cooperative perception architecture, CoBEVFusion, also achieves comparable performance with other cooperative perception models.
摘要
自动驾驶车(AV)使用多种感知器来收集它们周围环境的信息。通过Connected Autonomous Vehicles(CAVs)之间分享感知器数据,可以提高自动驾驶车的安全性和可靠性。然而,现有的合作感知方法只是分享单一感知器数据,如摄像头或LiDAR。在这项研究中,我们探索了多感知器数据源的融合,并提出了一个名为CoBEVFusion的框架。CoBEVFusion框架将LiDAR和摄像头数据融合到一起,创建一个鸟瞰视图(BEV)表示。CAVs在本地处理多模态数据,并使用双窗口基于cross-attention(DWCA)模块将LiDAR和摄像头特征融合到一起。融合后的BEV特征地图被CAVs中共享,并将其传输给3D卷积神经网络进行汇聚。我们的CoBEVFusion框架在OPV2V合作感知数据集上进行了两种感知任务的评估:BEV semantic segmentation和3D объек物检测。结果表明,我们的DWCA LiDAR-camera融合模型在单模态数据和现有BEV融合模型之上具有优势。总的来说,我们的CoBEVFusion框架在合作感知领域也实现了相对于其他合作感知模型的比较好的性能。
Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models
paper_authors: Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James Kwok
for: 提高Diffusion模型的生成质量,通过个性化数据集进行微调。
methods: 使用额外的可访问分类器或检测器模型,将图像中的隐式概念编码到文本领域。
results: 成功 removaling隐式概念,显示较 существенный改进于现有方法。Abstract
Fine-tuning diffusion models through personalized datasets is an acknowledged method for improving generation quality across downstream tasks, which, however, often inadvertently generates unintended concepts such as watermarks and QR codes, attributed to the limitations in image sources and collecting methods within specific downstream tasks. Existing solutions suffer from eliminating these unintentionally learned implicit concepts, primarily due to the dependency on the model's ability to recognize concepts that it actually cannot discern. In this work, we introduce Geom-Erasing, a novel approach that successfully removes the implicit concepts with either an additional accessible classifier or detector model to encode geometric information of these concepts into text domain. Moreover, we propose Implicit Concept, a novel image-text dataset imbued with three implicit concepts (i.e., watermarks, QR codes, and text) for training and evaluation. Experimental results demonstrate that Geom-Erasing not only identifies but also proficiently eradicates implicit concepts, revealing a significant improvement over the existing methods. The integration of geometric information marks a substantial progression in the precise removal of implicit concepts in diffusion models.
摘要
fino-tuning Diffusion 模型通过个性化数据集进行改进生成质量的方法是普遍认可的,但它们经常不小心学习不良的概念,如水印和二维码,这是因为图像来源和收集方法的局限性。现有的解决方案受到模型认知概念的能力的限制,因此很难减少这些不良学习的概念。在这种情况下,我们介绍了 Geom-Erasing,一种新的方法,可以成功地从图像中除除不良的概念,通过添加一个可访问的分类器或检测器模型来编码图像中的几何信息到文本领域。此外,我们提出了 Implicit Concept,一个新的图像-文本数据集,含有三种隐式概念(即水印、二维码和文本),用于训练和评估。实验结果表明,Geom-Erasing不仅可以识别,还可以高效地除除隐式概念,显示与现有方法相比有显著的改善。图像几何信息的集成表明了在精准除除隐式概念方面升级了很大的进步。
Domain-wise Invariant Learning for Panoptic Scene Graph Generation
results: 实验显示,我们的方法可以显著提高 benchmark 模型的性能,达到新的state-of-the-art性能水平,并在 PSG 数据集上显示出优秀的泛化和效果。Abstract
Panoptic Scene Graph Generation (PSG) involves the detection of objects and the prediction of their corresponding relationships (predicates). However, the presence of biased predicate annotations poses a significant challenge for PSG models, as it hinders their ability to establish a clear decision boundary among different predicates. This issue substantially impedes the practical utility and real-world applicability of PSG models. To address the intrinsic bias above, we propose a novel framework to infer potentially biased annotations by measuring the predicate prediction risks within each subject-object pair (domain), and adaptively transfer the biased annotations to consistent ones by learning invariant predicate representation embeddings. Experiments show that our method significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on PSG dataset.
摘要
泛opepticScene Graph生成(PSG)含有物体探测和相应关系预测(预言)。然而,存在偏见 predicate 注释会对 PSG 模型造成很大障碍,因为它阻碍了模型建立清晰的决策边界。这个问题对 PSG 模型的实际用途和应用程度产生了很大的障碍。为了解决上述内在偏见,我们提出了一种新的框架,通过在每个主体- объек 对(Domain)中测量预测 predicate 的风险,并将偏见的预测转移到一致的 predicate 表示嵌入中。实验表明,我们的方法可以显著提高基准模型的性能,实现新的状态级表现,并在 PSG 数据集上展现了优秀的泛化和效果。
A Real-time Method for Inserting Virtual Objects into Neural Radiance Fields
results: 比革命技术 superior 的照明和阴影效果,可以在实时 inserting 虚拟物体中使用,有很大的应用前途。Abstract
We present the first real-time method for inserting a rigid virtual object into a neural radiance field, which produces realistic lighting and shadowing effects, as well as allows interactive manipulation of the object. By exploiting the rich information about lighting and geometry in a NeRF, our method overcomes several challenges of object insertion in augmented reality. For lighting estimation, we produce accurate, robust and 3D spatially-varying incident lighting that combines the near-field lighting from NeRF and an environment lighting to account for sources not covered by the NeRF. For occlusion, we blend the rendered virtual object with the background scene using an opacity map integrated from the NeRF. For shadows, with a precomputed field of spherical signed distance field, we query the visibility term for any point around the virtual object, and cast soft, detailed shadows onto 3D surfaces. Compared with state-of-the-art techniques, our approach can insert virtual object into scenes with superior fidelity, and has a great potential to be further applied to augmented reality systems.
摘要
我们提出了首个实时方法,可以在神经辐射场中插入固定形式的虚拟物体,以生成真实的照明和阴影效果,同时允许交互式地修改物体。通过利用神经辐射场中的辐射信息和几何信息,我们的方法超越了虚拟物体插入增强现实系统中的许多挑战。对照明计算,我们生成了准确、可靠和三维空间变化的入射照明,将神经辐射中的近场照明和环境照明结合起来,以 compte для未被神经辐射覆盖的源。对遮挡,我们使用神经辐射中的 opacity map 混合背景场景,以实现虚拟物体与背景的混合。对阴影,我们使用预计算的球面正弦距离场来查询任意点附近虚拟物体的可见性,并投射出软、细节rich的阴影 onto 3D 表面。相比之前的技术,我们的方法可以在场景中插入虚拟物体的精度更高,并具有潜在的应用于增强现实系统。
Revisiting the Temporal Modeling in Spatio-Temporal Predictive Learning under A Unified View
paper_authors: Cheng Tan, Jue Wang, Zhangyang Gao, Siyuan Li, Lirong Wu, Jun Xia, Stan Z. Li
for: 这篇论文旨在探讨预测学习中的空间时间预测学习,具有广泛应用的应用领域。
methods: 论文提出了两种主流的时间模型方法,分别是回归基本的方法和回归自由的方法。
results: 论文透过实验证明,USTEP(统一空间时间预测学习)方法可以在各种预测学习任务中实现重要的改进,并成为一个可靠的解决方案。Abstract
Spatio-temporal predictive learning plays a crucial role in self-supervised learning, with wide-ranging applications across a diverse range of fields. Previous approaches for temporal modeling fall into two categories: recurrent-based and recurrent-free methods. The former, while meticulously processing frames one by one, neglect short-term spatio-temporal information redundancies, leading to inefficiencies. The latter naively stack frames sequentially, overlooking the inherent temporal dependencies. In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, offering a unified perspective. Building upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning demonstrate that USTEP achieves significant improvements over existing temporal modeling approaches, thereby establishing it as a robust solution for a wide range of spatio-temporal applications.
摘要
In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, providing a unified perspective. Building on this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that integrates both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning tasks show that USTEP achieves significant improvements over existing temporal modeling approaches, establishing it as a robust solution for a wide range of spatio-temporal applications.
results: 我们的方法可以在数据上实现高质量的解决,而不需要证明保证。在实验中,我们发现我们的方法可以超越之前的反对方法的数值问题。Abstract
An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.
摘要
新的一个思想是通过深度学习学习一个正则化函数从数据中学习,这会导致高质量的解决方案,但通常会额外付出可证明保证。在这个工作中,我们证明了在几何非几何(CNC)框架中的正则化和稳定性。我们介绍了一种新的输入弱有界神经网络(IWCNN)构建方法,以适应学习反对抗散射正则化的CNC框架。我们的方法在数据处理中超越了之前的反对抗散射方法的数学问题。
Joint object detection and re-identification for 3D obstacle multi-camera systems
results: 对比传统非最大值选择(NMS)技术,提出的方法在汽车类划分领域提高了超过5%的性能。Abstract
In recent years, the field of autonomous driving has witnessed remarkable advancements, driven by the integration of a multitude of sensors, including cameras and LiDAR systems, in different prototypes. However, with the proliferation of sensor data comes the pressing need for more sophisticated information processing techniques. This research paper introduces a novel modification to an object detection network that uses camera and lidar information, incorporating an additional branch designed for the task of re-identifying objects across adjacent cameras within the same vehicle while elevating the quality of the baseline 3D object detection outcomes. The proposed methodology employs a two-step detection pipeline: initially, an object detection network is employed, followed by a 3D box estimator that operates on the filtered point cloud generated from the network's detections. Extensive experimental evaluations encompassing both 2D and 3D domains validate the effectiveness of the proposed approach and the results underscore the superiority of this method over traditional Non-Maximum Suppression (NMS) techniques, with an improvement of more than 5\% in the car category in the overlapping areas.
摘要
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching
paper_authors: Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, Yang You for: 这个论文的目的是怎样的?methods: 这个论文使用了哪些方法?results: 这个论文得到了什么结果?Here are the answers in Simplified Chinese:for: 这个论文的目的是寻找一种能够实现完全无损数据压缩的方法,使得一个基于小样本集的模型可以与基于实际数据集的模型相比。methods: 这个论文使用了一种基于轨迹匹配的方法,即通过优化合成数据集来让模型在训练过程中学习类似的长期趋势。results: 这个论文成功地实现了完全无损数据压缩,即基于小样本集的模型可以与基于实际数据集的模型相比。此外,论文还发现了现有方法在生成大型合成数据集时失效的原因,并提出了一种根据数据集大小调整轨迹匹配的方法。Abstract
The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.
摘要
最终目标 OF dataset distillation 是将一个小型合成数据集 synthesized 到一个模型在这个合成集上训练后表现与实际数据集上训练后表现相同。直到现在,没有任何dataset distillation方法达到了完全无损的目标,其中一个原因是前一些方法只有在非常小的合成样本数量下保持有效。由于只有一定量的信息可以包含在这么少样本中,因此我们认为,为实现真正无损的dataset distillation,我们必须开发一种可以随合成数据集的大小而变的效果的混合方法。在这项工作中,我们提出了一种算法,并解释了现有方法失败的原因。现有的state-of-the-art方法通过轨迹匹配来实现dataset distillation,即通过优化合成数据来让教师网络在实际数据中学习的长期训练动力类似。我们实验发现,在我们选择的轨迹阶段(例如,早期或晚期)对匹配的轨迹有着很大的影响。特别是,早期的轨迹(当教师网络学习易Patterns)适用于小 cardinality的合成数据集,因为这里有 fewer examples 可以分配必要的信息。相反,晚期的轨迹(当教师网络学习困难Patterns)在更大的合成数据集中提供更好的信号,因为现在有 enough samples 来表示必要的复杂Patterns。根据我们的发现,我们提议将合成数据集中的模式难度与合成数据集的大小进行对应。通过这种方式,我们成功地扩展了轨迹匹配基于方法到更大的合成数据集,实现了无损的dataset distillation,这是第一次。可以在 获取我们的代码和混合数据集。
3D tomatoes’ localisation with monocular cameras using histogram filters
paper_authors: Sandro Costa Magalhães, Filipe Neves dos Santos, António Paulo Moreira, Jorge Dias
for: Tomatoes position estimation in open-field environments
methods: Histogram Filters (Bayesian Discrete Filters) with square kernel and Gaussian kernel
results: Mean absolute error lower than 10mm in simulation and 20mm in laboratory testbed at assessing distance of about 0.5m, viable for real environments but need improvement at closer distances.Here’s the breakdown of each point:1. Tomatoes position estimation in open-field environments: The paper is focused on developing a method for estimating the position of tomatoes in open-field environments, which is a challenging task due to lighting interferences.2. Histogram Filters (Bayesian Discrete Filters) with square kernel and Gaussian kernel: The paper proposes using Histogram Filters (Bayesian Discrete Filters) with two different kernel functions (square kernel and Gaussian kernel) to estimate the tomatoes’ positions.3. Mean absolute error lower than 10mm in simulation and 20mm in laboratory testbed: The proposed method was tested in simulation and in a laboratory testbed, and the results showed a mean absolute error lower than 10mm in simulation and 20mm in the testbed, indicating that the method is effective for estimating tomatoes’ positions in open-field environments. However, the results also suggest that the method needs improvement at closer distances.Abstract
Performing tasks in agriculture, such as fruit monitoring or harvesting, requires perceiving the objects' spatial position. RGB-D cameras are limited under open-field environments due to lightning interferences. Therefore, in this study, we approach the use of Histogram Filters (Bayesian Discrete Filters) to estimate the position of tomatoes in the tomato plant. Two kernel filters were studied: the square kernel and the Gaussian kernel. The implemented algorithm was essayed in simulation, with and without Gaussian noise and random noise, and in a testbed at laboratory conditions. The algorithm reported a mean absolute error lower than 10 mm in simulation and 20 mm in the testbed at laboratory conditions with an assessing distance of about 0.5 m. So, the results are viable for real environments and should be improved at closer distances.
摘要
Unleashing the power of Neural Collapse for Transferability Estimation
results: Results 表明 FaCe 在不同的预训练分类模型、不同的网络架构、源数据集和训练损失函数上都有出色的表现,并且在图像分类、 semantic segmentation 和文本分类等多种任务上达到了state-of-the-art的性能,这说明了 FaCe 的效果和普遍性。Abstract
Transferability estimation aims to provide heuristics for quantifying how suitable a pre-trained model is for a specific downstream task, without fine-tuning them all. Prior studies have revealed that well-trained models exhibit the phenomenon of Neural Collapse. Based on a widely used neural collapse metric in existing literature, we observe a strong correlation between the neural collapse of pre-trained models and their corresponding fine-tuned models. Inspired by this observation, we propose a novel method termed Fair Collapse (FaCe) for transferability estimation by comprehensively measuring the degree of neural collapse in the pre-trained model. Typically, FaCe comprises two different terms: the variance collapse term, which assesses the class separation and within-class compactness, and the class fairness term, which quantifies the fairness of the pre-trained model towards each class. We investigate FaCe on a variety of pre-trained classification models across different network architectures, source datasets, and training loss functions. Results show that FaCe yields state-of-the-art performance on different tasks including image classification, semantic segmentation, and text classification, which demonstrate the effectiveness and generalization of our method.
摘要
<>对文本进行简化中文翻译。<>转移性估计旨在提供预训练模型下游任务适用性的启发,无需细化所有。先前的研究表明,良好预训练模型会出现神经塌陷现象。根据现有文献中广泛使用的神经塌陷指标,我们发现预训练模型和其相应的细化模型之间存在强相关性。 inspirited by this observation, we propose a novel method termed Fair Collapse (FaCe) for transferability estimation by comprehensively measuring the degree of neural collapse in the pre-trained model. Typically, FaCe comprises two different terms: the variance collapse term, which assesses the class separation and within-class compactness, and the class fairness term, which quantifies the fairness of the pre-trained model towards each class. We investigate FaCe on a variety of pre-trained classification models across different network architectures, source datasets, and training loss functions. Results show that FaCe yields state-of-the-art performance on different tasks including image classification, semantic segmentation, and text classification, which demonstrate the effectiveness and generalization of our method.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation
results: 实验结果显示,本方法可以与现有的音频驱动的话头生成方法相比,提供更真实、高品质和 lip 同步的视觉化面孔生成。Abstract
Talking face generation has a wide range of potential applications in the field of virtual digital humans. However, rendering high-fidelity facial video while ensuring lip synchronization is still a challenge for existing audio-driven talking face generation approaches. To address this issue, we propose HyperLips, a two-stage framework consisting of a hypernetwork for controlling lips and a high-resolution decoder for rendering high-fidelity faces. In the first stage, we construct a base face generation network that uses the hypernetwork to control the encoding latent code of the visual face information over audio. First, FaceEncoder is used to obtain latent code by extracting features from the visual face information taken from the video source containing the face frame.Then, HyperConv, which weighting parameters are updated by HyperNet with the audio features as input, will modify the latent code to synchronize the lip movement with the audio. Finally, FaceDecoder will decode the modified and synchronized latent code into visual face content. In the second stage, we obtain higher quality face videos through a high-resolution decoder. To further improve the quality of face generation, we trained a high-resolution decoder, HRDecoder, using face images and detected sketches generated from the first stage as input.Extensive quantitative and qualitative experiments show that our method outperforms state-of-the-art work with more realistic, high-fidelity, and lip synchronization. Project page: https://semchan.github.io/HyperLips Project/
摘要
《带讲话脸生成》有广泛的应用前途在虚拟数字人类领域。然而,基于音频的 talking face生成方法仍然面临高精度脸部视频生成和同步脸部运动的挑战。为解决这个问题,我们提出了 HyperLips,它是一个两个阶段的框架,包括跨域网络(HyperNet)控制脸部运动的 lips 控制器和高分辨率解码器。在第一阶段,我们构建了基础脸部生成网络,使用 HyperNet 控制脸部信息的编码缓存码。首先, FaceEncoder 从视频源中提取脸部特征,然后 HyperConv 使用 HyperNet 更新参数,将编码缓存码与音频特征进行同步。最后, FaceDecoder 将修改和同步的编码缓存码解码为脸部内容。在第二阶段,我们通过使用高分辨率解码器(HRDecoder)来提高脸部生成质量。为了进一步提高脸部生成质量,我们在第一阶段使用 FaceImages 和 DetectedSketches 作为输入训练 HRDecoder。经过广泛的量化和质量测试,我们的方法在质量和同步性方面都能够超越当前的状态 искусственного智能。项目页面:https://semchan.github.io/HyperLips-Project/
EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders
results: 我们的实验表明,使用EdVAE模型可以避免抽象缺失问题,提高重建性能,并提高代码库使用率,相比dVAE和VQ-VAE基于模型。Abstract
Codebook collapse is a common problem in training deep generative models with discrete representation spaces like Vector Quantized Variational Autoencoders (VQ-VAEs). We observe that the same problem arises for the alternatively designed discrete variational autoencoders (dVAEs) whose encoder directly learns a distribution over the codebook embeddings to represent the data. We hypothesize that using the softmax function to obtain a probability distribution causes the codebook collapse by assigning overconfident probabilities to the best matching codebook elements. In this paper, we propose a novel way to incorporate evidential deep learning (EDL) instead of softmax to combat the codebook collapse problem of dVAE. We evidentially monitor the significance of attaining the probability distribution over the codebook embeddings, in contrast to softmax usage. Our experiments using various datasets show that our model, called EdVAE, mitigates codebook collapse while improving the reconstruction performance, and enhances the codebook usage compared to dVAE and VQ-VAE based models.
摘要代码表册塌陷是训练深度生成模型时的一个常见问题,特别是在使用柱状量编码的Vector Quantized Variational Autoencoders (VQ-VAEs) 中。我们发现,同样的问题也出现在 alternatively designed discrete variational autoencoders (dVAEs) 中,其Encoder直接学习一个分布来表示数据。我们认为,使用 softmax 函数获取概率分布会导致代码表册塌陷,因为它会将最佳匹配的代码表册元素授予过于自信的概率。在这篇论文中,我们提议使用 evidential deep learning (EDL) 来代替 softmax,以战胜 dVAE 中的代码表册塌陷问题。我们在不同的数据集上进行了实验,发现我们的模型(EdVAE)可以减轻代码表册塌陷,提高重建性能,并在 VQ-VAE 和 dVAE 基于模型中提高代码表 usage。Note that Simplified Chinese is a writing system that uses Chinese characters, but omits the traditional Chinese punctuation and grammatical markers. The translation is written in the Simplified Chinese format.
results: 实验表明,Uni3DETR在室内和室外场景中都能够显示出优秀的性能,与特定场景的探测器不同,Uni3DETR具有强大的泛化能力。Abstract
Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones. Because of the substantial differences in object distribution and point density within point clouds collected from various environments, coupled with the intricate nature of 3D metrics, there is still a lack of a unified network architecture that can accommodate diverse scenes. In this paper, we propose Uni3DETR, a unified 3D detector that addresses indoor and outdoor 3D detection within the same framework. Specifically, we employ the detection transformer with point-voxel interaction for object prediction, which leverages voxel features and points for cross-attention and behaves resistant to the discrepancies from data. We then propose the mixture of query points, which sufficiently exploits global information for dense small-range indoor scenes and local information for large-range sparse outdoor ones. Furthermore, our proposed decoupled IoU provides an easy-to-optimize training target for localization by disentangling the xy and z space. Extensive experiments validate that Uni3DETR exhibits excellent performance consistently on both indoor and outdoor 3D detection. In contrast to previous specialized detectors, which may perform well on some particular datasets but suffer a substantial degradation on different scenes, Uni3DETR demonstrates the strong generalization ability under heterogeneous conditions (Fig. 1). Codes are available at \href{https://github.com/zhenyuw16/Uni3DETR}{https://github.com/zhenyuw16/Uni3DETR}.
摘要
现有的点云基于3D检测器是为特定场景设计的,可以是indoor或outdoor场景。由于点云中对象的分布和点密度在不同环境中存在substantial differences,加之3D度量的复杂性,导致现有的网络架构无法涵盖多样化的场景。在这篇论文中,我们提出了Uni3DETR,一种通用的3D检测器,可以同时处理indoor和outdoor场景。我们使用检测转换器和点云交互来预测 объек,这里利用了点云的特征和 voxel 进行跨注意力和抗衰假。我们还提出了对于 dense small-range indoor scenes和large-range sparse outdoor scenes的mixture of query points, sufficient exploits 全球信息。此外,我们提出的分离 IoU 提供了一个容易优化的训练目标 для Localization,通过分离 xy 和 z 空间。我们的Uni3DETR在不同场景下表现出了很好的性能,与之前的特циализирован检测器不同,Uni3DETR 在多样化的环境下具有强大的泛化能力(图1)。代码可以在 上获取。
Combining recurrent and residual learning for deforestation monitoring using multitemporal SAR images
results: 实验分析表明,使用多时序SAR数据可以更好地检测森林侵蚀。特别是,RRCNN-1模型在所有测试网络中显示出最高精度,同时具有半个处理时间的提高。Abstract
With its vast expanse, exceeding that of Western Europe by twice, the Amazon rainforest stands as the largest forest of the Earth, holding immense importance in global climate regulation. Yet, deforestation detection from remote sensing data in this region poses a critical challenge, often hindered by the persistent cloud cover that obscures optical satellite data for much of the year. Addressing this need, this paper proposes three deep-learning models tailored for deforestation monitoring, utilizing SAR (Synthetic Aperture Radar) multitemporal data moved by its independence on atmospheric conditions. Specifically, the study proposes three novel recurrent fully convolutional network architectures-namely, RRCNN-1, RRCNN-2, and RRCNN-3, crafted to enhance the accuracy of deforestation detection. Additionally, this research explores replacing a bitemporal with multitemporal SAR sequences, motivated by the hypothesis that deforestation signs quickly fade in SAR images over time. A comprehensive assessment of the proposed approaches was conducted using a Sentinel-1 multitemporal sequence from a sample site in the Brazilian rainforest. The experimental analysis confirmed that analyzing a sequence of SAR images over an observation period can reveal deforestation spots undetectable in a pair of images. Notably, experimental results underscored the superiority of the multitemporal approach, yielding approximately a five percent enhancement in F1-Score across all tested network architectures. Particularly the RRCNN-1 achieved the highest accuracy and also boasted half the processing time of its closest counterpart.
摘要
Amazon雨林是地球上最大的雨林之一,其盛装面积超过西欧的西欧面积之二倍,具有全球气候调节的重要性。然而,在这个区域中的森林开发探测从远程感应数据受到挑战,因为大部分年份受到云层覆盖,使光学卫星数据无法得到有效处理。为了解决这个需求,本研究提出了三个深度学习模型,用于森林开发监控,并利用Synthetic Aperture Radar(SAR)多时间序数据,不受大气情况的限制。具体来说,这些研究提出了三个新的循环型条件网络架构,称为RRCNN-1、RRCNN-2和RRCNN-3,以提高森林开发探测的精度。此外,这些研究还探讨了使用多时间序SAR序列取代双时间序列,这是因为开发痕迹在SAR图像中很快消失。经过了一系列的实验分析,发现可以通过分析多时间序SAR图像序列来探测森林开发痕迹,并且这种方法可以提高精度约五成。特别是RRCNN-1最高精度和处理时间的半倍,与其他架构相比。
Climate-sensitive Urban Planning through Optimization of Tree Placements
paper_authors: Simon Schrodi, Ferdinand Briegel, Max Argus, Andreas Christen, Thomas Brox
for: Mitigating heat stress in urban areas through optimal placement of urban trees.
methods: Using neural networks to simulate point-wise mean radiant temperatures and an iterated local search framework with tailored adaptations to optimize tree placements.
results: Empirical efficacy of the approach across a wide spectrum of study areas and time scales, demonstrating the potential of urban trees to mitigate heat stress.Abstract
Climate change is increasing the intensity and frequency of many extreme weather events, including heatwaves, which results in increased thermal discomfort and mortality rates. While global mitigation action is undoubtedly necessary, so is climate adaptation, e.g., through climate-sensitive urban planning. Among the most promising strategies is harnessing the benefits of urban trees in shading and cooling pedestrian-level environments. Our work investigates the challenge of optimal placement of such trees. Physical simulations can estimate the radiative and thermal impact of trees on human thermal comfort but induce high computational costs. This rules out optimization of tree placements over large areas and considering effects over longer time scales. Hence, we employ neural networks to simulate the point-wise mean radiant temperatures--a driving factor of outdoor human thermal comfort--across various time scales, spanning from daily variations to extended time scales of heatwave events and even decades. To optimize tree placements, we harness the innate local effect of trees within the iterated local search framework with tailored adaptations. We show the efficacy of our approach across a wide spectrum of study areas and time scales. We believe that our approach is a step towards empowering decision-makers, urban designers and planners to proactively and effectively assess the potential of urban trees to mitigate heat stress.
摘要
климат变化会增加多种极端天气事件的Intensity和频率,包括热波,这会导致增加的热舒适度和死亡率。 global的气候适应措施是必要的,例如通过气候敏感的城市规划。 among the most promising strategies is to harness the benefits of urban trees in shading and cooling pedestrian-level environments. our work investigates the challenge of optimal placement of such trees. physical simulations can estimate the radiative and thermal impact of trees on human thermal comfort, but induce high computational costs. This rules out optimization of tree placements over large areas and considering effects over longer time scales. Hence, we employ neural networks to simulate the point-wise mean radiant temperatures--a driving factor of outdoor human thermal comfort--across various time scales, spanning from daily variations to extended time scales of heatwave events and even decades. To optimize tree placements, we harness the innate local effect of trees within the iterated local search framework with tailored adaptations. We show the efficacy of our approach across a wide spectrum of study areas and time scales. We believe that our approach is a step towards empowering decision-makers, urban designers and planners to proactively and effectively assess the potential of urban trees to mitigate heat stress.
Analysis of Rainfall Variability and Water Extent of Selected Hydropower Reservoir Using Google Earth Engine (GEE): A Case Study from Two Tropical Countries, Sri Lanka and Vietnam
paper_authors: Punsisi Rajakaruna, Surajit Ghosh, Bunyod Holmatov for: 这项研究旨在 investigate 瑞典和斯里兰卡两国热带季风地区降水模式和选择的水电库水域面积的关系。methods: 该研究使用高分辨率光学图像和Sentinel-1 Synthetic Aperture Radar(SAR)数据观察和监测不同天气情况下水体的变化,特别是在雨季期间。采用 Climate Hazards Group InfraRed Precipitation with Station(CHIRPS)数据进行降水年际变化的分析,并对选择的水库区域进行水域面积的 derivation。results: 研究结果显示,雨季期间的降水量带来水库水域面积的增加,而非雨季期间的降水量则导致水库水域面积的减少。这些结果表明降水模式对水库水资源的影响,并可以帮助这两个国家决策 relativity 水电、洪水管理和灌溉等方面的政策。Abstract
This study presents a comprehensive remote sensing analysis of rainfall patterns and selected hydropower reservoir water extent in two tropical monsoon countries, Vietnam and Sri Lanka. The aim is to understand the relationship between remotely sensed rainfall data and the dynamic changes (monthly) in reservoir water extent. The analysis utilizes high-resolution optical imagery and Sentinel-1 Synthetic Aperture Radar (SAR) data to observe and monitor water bodies during different weather conditions, especially during the monsoon season. The average annual rainfall for both countries is determined, and spatiotemporal variations in monthly average rainfall are examined at regional and reservoir basin levels using the Climate Hazards Group InfraRed Precipitation with Station (CHIRPS) dataset from 1981 to 2022. Water extents are derived for selected reservoirs using Sentinel-1 SAR Ground Range Detected (GRD) images in Vietnam and Sri Lanka from 2017 to 2022. The images are pre-processed and corrected using terrain correction and refined Lee filter. An automated thresholding algorithm, OTSU, distinguishes water and land, taking advantage of both VV and VH polarization data. The connected pixel count threshold is applied to enhance result accuracy. The results indicate a clear relationship between rainfall patterns and reservoir water extent, with increased precipitation during the monsoon season leading to higher water extents in the later months. This study contributes to understanding how rainfall variability impacts reservoir water resources in tropical monsoon regions. The preliminary findings can inform water resource management strategies and support these countries' decision-making processes related to hydropower generation, flood management, and irrigation.
摘要
本研究通过远程感知技术分析了越南和斯里兰卡两国热带雨季的降雨模式和选择的水电库水域面积。研究的目的是了解远程感知降雨数据和水库月度变化的关系。研究使用高分辨率光学图像和Sentinel-1Synthetic Aperture Radar(SAR)数据观察和监测不同天气情况下的水体,特别是在雨季期间。研究计算了两国的年平均降雨量,并对不同地区和水库流域的月平均降雨变化进行了空间时间分析使用CHIRPS数据集从1981年到2022年。水域面积 derive 从越南和斯里兰卡2017年到2022年的Sentinel-1 SAR Ground Range Detected(GRD)图像。图像进行了地形 corrections和精细的李 filter 修正。使用OTSU自动分割算法,利用VV和VH极化数据,将水和陆分开。应用连接 pixel 计数阈值以提高结果准确性。结果表明,降雨模式和水库水域面积之间存在明显的关系,雨季期间的降雨量增加会导致 later 月的水域面积增加。这项研究对热带雨季地区水库水资源的变化产生了重要影响,可以为这些国家的水资源管理策略和水电生产、洪水管理和灌溉决策提供参考。
Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection
results: 对于 MS COCO 测试 dataset,该模型的实验结果表明,与基eline RetinaNet 和 GFL 方法相比,该模型可以在无任何额外配置下达到 $\sim$2.4 和 $\sim$1.2 AP 的提升。Abstract
Anchor-based detectors have been continuously developed for object detection. However, the individual anchor box makes it difficult to predict the boundary's offset accurately. Instead of taking each bounding box as a closed individual, we consider using multiple boxes together to get prediction boxes. To this end, this paper proposes the \textbf{Box Decouple-Couple(BDC) strategy} in the inference, which no longer discards the overlapping boxes, but decouples the corner points of these boxes. Then, according to each corner's score, we couple the corner points to select the most accurate corner pairs. To meet the BDC strategy, a simple but novel model is designed named the \textbf{Anchor-Intermediate Detector(AID)}, which contains two head networks, i.e., an anchor-based head and an anchor-free \textbf{Corner-aware head}. The corner-aware head is able to score the corners of each bounding box to facilitate the coupling between corner points. Extensive experiments on MS COCO show that the proposed anchor-intermediate detector respectively outperforms their baseline RetinaNet and GFL method by $\sim$2.4 and $\sim$1.2 AP on the MS COCO test-dev dataset without any bells and whistles. Code is available at: https://github.com/YilongLv/AID.
摘要
锚点基于的检测器一直在不断发展,但是每个锚点盒子的偏移精度预测却受到限制。而不是单独处理每个 bounding box,我们可以考虑使用多个盒子共同预测。为此,本文提出了 \textbf{Box Decouple-Couple(BDC)策略},不再产生 overlap 的盒子,而是解couple 盒子的角点。然后,根据每个角点的得分,我们couple 角点选择最准确的角点对。为满足 BDC 策略,我们设计了一种简单 yet novel的模型,即 \textbf{锚点-中间检测器(AID)},它包含两个头网络,即 anchor-based 头和 anchor-free \textbf{角点检测头}。角点检测头可以为每个 bounding box 的角点分配得分,以便 coupling 角点。经验表明,我们的锚点-中间检测器在 MS COCO 测试数据集上分别超过了基eline RetinaNet 和 GFL 方法的 $\sim$2.4 和 $\sim$1.2 AP。代码可以在:https://github.com/YilongLv/AID 找到。
Exploiting Manifold Structured Data Priors for Improved MR Fingerprinting Reconstruction
results: 实验结果显示,我们的方法可以在非 carteesian 探测场景下减少计算时间,并提高重建性能,比州前方法有显著提升Abstract
Estimating tissue parameter maps with high accuracy and precision from highly undersampled measurements presents one of the major challenges in MR fingerprinting (MRF). Many existing works project the recovered voxel fingerprints onto the Bloch manifold to improve reconstruction performance. However, little research focuses on exploiting the latent manifold structure priors among fingerprints. To fill this gap, we propose a novel MRF reconstruction framework based on manifold structured data priors. Since it is difficult to directly estimate the fingerprint manifold structure, we model the tissue parameters as points on a low-dimensional parameter manifold. We reveal that the fingerprint manifold shares the same intrinsic topology as the parameter manifold, although being embedded in different Euclidean spaces. To exploit the non-linear and non-local redundancies in MRF data, we divide the MRF data into spatial patches, and the similarity measurement among data patches can be accurately obtained using the Euclidean distance between the corresponding patches in the parameter manifold. The measured similarity is then used to construct the graph Laplacian operator, which represents the fingerprint manifold structure. Thus, the fingerprint manifold structure is introduced in the reconstruction framework by using the low-dimensional parameter manifold. Additionally, we incorporate the locally low-rank prior in the reconstruction framework to further utilize the local correlations within each patch for improved reconstruction performance. We also adopt a GPU-accelerated NUFFT library to accelerate reconstruction in non-Cartesian sampling scenarios. Experimental results demonstrate that our method can achieve significantly improved reconstruction performance with reduced computational time over the state-of-the-art methods.
摘要
估计组织参数地图具有高精度和精度从高度受损量测量中进行估计是MR fingerprinting(MRF)的一个主要挑战。许多现有的方法将recovered voxel fingerprints projection onto Bloch manifold以提高重建性能。然而, littleresearch关注在挖掘指纹 manifold的尚未知识之间的 latent manifold structure priors。为了填这个空白,我们提出了一种基于 manifold 结构数据约束的新的MRF重建框架。由于直接估计指纹 manifold 结构困难,我们模型了组织参数为低维度参数 manifold 上的点。我们发现,指纹 manifold 和参数 manifold 具有同一个内在结构,虽然在不同的欧几何空间中嵌入。为了利用MRF数据中的非线性和非本地征 redundancy,我们将MRF数据分成空间块,并通过在参数 manifold 上计算块之间的Euclidean距离来准确地获得数据块之间的相似度。这个测量的相似度后来用于构建图 Laplacian 算子,该算子表示指纹 manifold 结构。因此,指纹 manifold 结构通过使用参数 manifold 被引入到重建框架中。此外,我们还在重建框架中采用了本地低级别 prior,以利用每个块中的本地相关性来提高重建性能。我们还使用了加速非极化 sampling enario的GPU加速的NUFFT库来加速重建。实验结果表明,我们的方法可以在计算时间和重建性能两个方面具有显著改进,与现有方法相比。
Diagnosing Catastrophe: Large parts of accuracy loss in continual learning can be accounted for by readout misalignment
paper_authors: Daniel Anthes, Sushrut Thorat, Peter König, Tim C. Kietzmann
for: This paper investigates the representational changes that underlie the phenomenon of catastrophic forgetting in artificial neural networks (ANNs) when trained on changing data distributions.
methods: The paper uses a combination of theoretical analysis and experimental studies to identify the three distinct processes that contribute to catastrophic forgetting.
results: The study finds that the largest component of catastrophic forgetting is a misalignment between hidden representations and readout layers, which causes internal representations to shift. Additionally, the study shows that representational geometry is partially conserved under this misalignment, but a small part of the information is irrecoverably lost. The findings have implications for deep learning applications that need to be continuously updated.Here’s the Chinese version of the information points:
results: 研究发现,快速忘记的主要组成部分是隐藏层和输出层之间的不同,导致内部表示的变化。此外,研究还发现,表示的几何结构在这种不同下仍有一定保留,但有一小部分信息丢失不可回收。这些发现对深度学习应用,需要不断更新,有益。Abstract
Unlike primates, training artificial neural networks on changing data distributions leads to a rapid decrease in performance on old tasks. This phenomenon is commonly referred to as catastrophic forgetting. In this paper, we investigate the representational changes that underlie this performance decrease and identify three distinct processes that together account for the phenomenon. The largest component is a misalignment between hidden representations and readout layers. Misalignment occurs due to learning on additional tasks and causes internal representations to shift. Representational geometry is partially conserved under this misalignment and only a small part of the information is irrecoverably lost. All types of representational changes scale with the dimensionality of hidden representations. These insights have implications for deep learning applications that need to be continuously updated, but may also aid aligning ANN models to the rather robust biological vision.
摘要
The largest component is a misalignment between hidden representations and readout layers. This misalignment occurs due to learning on additional tasks, causing internal representations to shift. Despite this misalignment, representational geometry is partially conserved, and only a small part of the information is irrecoverably lost.All types of representational changes scale with the dimensionality of hidden representations. These insights have implications for deep learning applications that need to be continuously updated, and may also aid in aligning ANN models with the robust biological vision.
High Accuracy and Cost-Saving Active Learning 3D WD-UNet for Airway Segmentation
results: 在3D肺空气道CT扫描图像中进行医疗分割,使用不确定度度量(参数化为查询策略的输入),导致更准确的预测结果,比如3DUNet和3D CEUNet等状态当前的深度学习超vised模型。相比之下,WD-UNet不仅节省了诊断员的标注成本,还节省了计算资源。WD-UNet使用有限的标注数据(35%的总数),实现更好的预测 метриク。Abstract
We propose a novel Deep Active Learning (DeepAL) model-3D Wasserstein Discriminative UNet (WD-UNet) for reducing the annotation effort of medical 3D Computed Tomography (CT) segmentation. The proposed WD-UNet learns in a semi-supervised way and accelerates learning convergence to meet or exceed the prediction metrics of supervised learning models. Our method can be embedded with different Active Learning (AL) strategies and different network structures. The model is evaluated on 3D lung airway CT scans for medical segmentation and show that the use of uncertainty metric, which is parametrized as an input of query strategy, leads to more accurate prediction results than some state-of-the-art Deep Learning (DL) supervised models, e.g.,3DUNet and 3D CEUNet. Compared to the above supervised DL methods, our WD-UNet not only saves the cost of annotation for radiologists but also saves computational resources. WD-UNet uses a limited amount of annotated data (35% of the total) to achieve better predictive metrics with a more efficient deep learning model algorithm.
摘要
我们提出了一种新的深度活动学习(DeepAL)模型——3D Wasserstein Discriminative UNet(WD-UNet),用于降低医疗3D计算机Tomography(CT) segmentation的注释努力。我们的WD-UNet在半upervised的方式学习,并加速学习征 Stern 到达或超越supervised学习模型的预测 метри。我们的方法可以与不同的活动学习(AL)策略和不同的网络结构结合使用。我们的模型在3D肺空气道CT扫描图中进行医疗分 segmentation的测试,并显示了使用uncertainty度量(作为查询策略的输入参数)可以获得更准确的预测结果,比如State-of-the-art的深度学习(DL)supervised模型,如3DUNet和3D CEUNet。相比这些supervised DL方法,我们的WD-UNet不仅可以为放射学家节省注释成本,还可以节省计算资源。WD-UNet使用limited amount of annotated data(35% of the total)可以达到更好的预测 метри,同时使用更高效的深度学习算法。
results: 该框架在 previous 的普适 INR 上显著提高表达能力,并验证了本地化隐藏的有用性 для 下游任务 such as 图像生成。Abstract
Generalizable implicit neural representation (INR) enables a single continuous function, i.e., a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.
摘要
通用隐藏神经表示(INR)可以使一个连续函数,即坐标基本神经网络,表示多个数据实例。通过调整其权重或中间特征使用隐藏代码来模块化其权重。然而,现有的模块化表达能力有限,因为它无法当地化和捕捉数据实体细节,如特定像素和射线。为解决这个问题,我们提出了一种新的框架 для通用INR,该框架将transformer编码器与本地性感知INR解码器结合。transformer编码器预测一个数据实例的latent token集,从而对每个latent token进行本地信息编码。本地性感知INR解码器通过对坐标输入进行交叉注意力选择性聚合latent token,然后逐渐解码为多个频率带宽,进行进一步的坐标解码和特征修饰。这种选择性的token聚合和多频特征修饰使得我们可以学习本地特征表示,并在空间和频率方面进行本地化。我们的框架在前期INR的表达能力方面显著超越了现有的方法,并证明了隐藏代码的有用性 для下游任务,如图像生成。
ASM: Adaptive Sample Mining for In-The-Wild Facial Expression Recognition
results: 经验证明,该方法可以有效地挖掘不确定性和噪声,并在 synthetic noisy 和原始数据集上超越 State-of-the-Art (SOTA) 方法。Abstract
Given the similarity between facial expression categories, the presence of compound facial expressions, and the subjectivity of annotators, facial expression recognition (FER) datasets often suffer from ambiguity and noisy labels. Ambiguous expressions are challenging to differentiate from expressions with noisy labels, which hurt the robustness of FER models. Furthermore, the difficulty of recognition varies across different expression categories, rendering a uniform approach unfair for all expressions. In this paper, we introduce a novel approach called Adaptive Sample Mining (ASM) to dynamically address ambiguity and noise within each expression category. First, the Adaptive Threshold Learning module generates two thresholds, namely the clean and noisy thresholds, for each category. These thresholds are based on the mean class probabilities at each training epoch. Next, the Sample Mining module partitions the dataset into three subsets: clean, ambiguity, and noise, by comparing the sample confidence with the clean and noisy thresholds. Finally, the Tri-Regularization module employs a mutual learning strategy for the ambiguity subset to enhance discrimination ability, and an unsupervised learning strategy for the noise subset to mitigate the impact of noisy labels. Extensive experiments prove that our method can effectively mine both ambiguity and noise, and outperform SOTA methods on both synthetic noisy and original datasets. The supplement material is available at https://github.com/zzzzzzyang/ASM.
摘要
由于表情表达category之间的相似性,合成表情和评分器的 Subjectivity, facial expression recognition(FER)数据集经常受到模糊和噪声的影响。模糊的表情和噪声标签降低FER模型的稳定性。此外,不同表情类别的识别难度不同,这使得一种一致的方法不公平对所有表情。在这篇论文中,我们提出了一种新的方法called Adaptive Sample Mining(ASM),用于动态地处理表情中的模糊和噪声。首先,Adaptive Threshold Learning模块生成了每个类别的两个阈值:净和噪声阈值。这些阈值基于每个训练epoch的类别mean probability。接着,Sample Mining模块将数据集分为三个子集:净、模糊和噪声,根据样本信息与净和噪声阈值进行比较。最后,Tri-Regularization模块使用了一种互学习策略来增强模糊子集的推理能力,并使用了一种无监督学习策略来抑制噪声标签的影响。我们的方法能够有效地挖掘模糊和噪声,并在 synthetic noisy和原始数据集上超越了State-of-the-Art(SOTA)方法。详细实验结果可以在https://github.com/zzzzzzyang/ASM中找到。
Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care Environments
paper_authors: Michael G. Adam, Sebastian Eger, Martin Piccolrovazzi, Maged Iskandar, Joern Vogel, Alexander Dietrich, Seongjien Bien, Jon Skerlj, Abdeldjallil Naceri, Eckehard Steinbach, Alin Albu-Schaeffer, Sami Haddadin, Wolfram Burgard
for: 为了弥补医疗领域人才匮乏的问题,这篇短文提供了一个帮助Robotics开发的注释数据集。
methods: 本文使用了真实环境中的捕捉数据,以及一个医疗机器人内部的真实环境进行描述。
results: 本文提供了一个可靠的SLAM算法评估数据集,以便在医疗机器人上运行SLAM算法。Abstract
As labor shortage increases in the health sector, the demand for assistive robotics grows. However, the needed test data to develop those robots is scarce, especially for the application of active 3D object detection, where no real data exists at all. This short paper counters this by introducing such an annotated dataset of real environments. The captured environments represent areas which are already in use in the field of robotic health care research. We further provide ground truth data within one room, for assessing SLAM algorithms running directly on a health care robot.
摘要
随着医疗领域的劳动力短缺增加,需求 для协助 роботикс也在增长。然而,为开发这些机器人所需的测试数据却缺乏,尤其是在活动3D对象检测方面,无现实数据存在。这短篇论文解决了这个问题,通过发布真实环境的注释数据集。这些捕捉的环境代表了现在在医疗机器人研究领域已经使用的区域。我们还提供了一个房间内的真实数据,用于评估运行直接在医疗机器人上的SLAM算法。
Perceptual Artifacts Localization for Image Synthesis Tasks
paper_authors: Lingzhi Zhang, Zhengjie Xu, Connelly Barnes, Yuqian Zhou, Qing Liu, He Zhang, Sohrab Amirghodsi, Zhe Lin, Eli Shechtman, Jianbo Shi
for: 这个论文主要是为了研究图像生成模型中的感知缺陷,以及如何自动修复这些缺陷。
methods: 该论文使用了一种新的数据集,并提出了一种基于分割模型的感知缺陷地图生成方法。
results: 该论文的实验结果显示,该方法可以有效地检测和修复图像生成中的感知缺陷,并且可以适应不同的图像生成模型。Abstract
Recent advancements in deep generative models have facilitated the creation of photo-realistic images across various tasks. However, these generated images often exhibit perceptual artifacts in specific regions, necessitating manual correction. In this study, we present a comprehensive empirical examination of Perceptual Artifacts Localization (PAL) spanning diverse image synthesis endeavors. We introduce a novel dataset comprising 10,168 generated images, each annotated with per-pixel perceptual artifact labels across ten synthesis tasks. A segmentation model, trained on our proposed dataset, effectively localizes artifacts across a range of tasks. Additionally, we illustrate its proficiency in adapting to previously unseen models using minimal training samples. We further propose an innovative zoom-in inpainting pipeline that seamlessly rectifies perceptual artifacts in the generated images. Through our experimental analyses, we elucidate several practical downstream applications, such as automated artifact rectification, non-referential image quality evaluation, and abnormal region detection in images. The dataset and code are released.
摘要
中文翻译:现代深度生成模型的发展使得可以生成高品质的图像,但是这些生成图像经常在特定区域出现感知 artifacts,需要手动修正。在这项研究中,我们对感知artifacts的地方进行了广泛的实验性研究,涵盖了多种图像生成任务。我们提出了一个新的数据集,包含10,168个生成图像,每个图像都有每像素的感知artifact标签。我们训练了一个分割模型,并在我们提出的数据集上进行了训练。我们发现这个模型可以在多种任务上有效地 lokalisieren artifacts。此外,我们还展示了它可以使用最小的训练样本适应未经见过的模型。我们还提出了一种innovative的缩放填充框架,可以轻松地修正生成图像中的感知 artifacts。通过我们的实验分析,我们描述了一些实用的下游应用,例如自动修正感知 artifacts、非参照图像质量评价和图像中异常区域检测。我们的数据集和代码都已经发布。
A review of uncertainty quantification in medical image analysis: probabilistic and non-probabilistic methods
results: 文章提供了一个全面的审视,涵盖了各种医学图像任务中机器学习模型的不确定性评估方法。Abstract
The comprehensive integration of machine learning healthcare models within clinical practice remains suboptimal, notwithstanding the proliferation of high-performing solutions reported in the literature. A predominant factor hindering widespread adoption pertains to an insufficiency of evidence affirming the reliability of the aforementioned models. Recently, uncertainty quantification methods have been proposed as a potential solution to quantify the reliability of machine learning models and thus increase the interpretability and acceptability of the result. In this review, we offer a comprehensive overview of prevailing methods proposed to quantify uncertainty inherent in machine learning models developed for various medical image tasks. Contrary to earlier reviews that exclusively focused on probabilistic methods, this review also explores non-probabilistic approaches, thereby furnishing a more holistic survey of research pertaining to uncertainty quantification for machine learning models. Analysis of medical images with the summary and discussion on medical applications and the corresponding uncertainty evaluation protocols are presented, which focus on the specific challenges of uncertainty in medical image analysis. We also highlight some potential future research work at the end. Generally, this review aims to allow researchers from both clinical and technical backgrounds to gain a quick and yet in-depth understanding of the research in uncertainty quantification for medical image analysis machine learning models.
摘要
文章概要:随着机器学习模型在医疗实践中的普及,其完整性仍然受到限制,尽管文献中报道了高性能解决方案。主要阻碍广泛采用的因素是证据不足,证明机器学习模型的可靠性。最近,不确定性评估方法得到了关注,以量化机器学习模型中的不确定性。本文提供了各种不确定性评估方法的总览,包括非概率方法,以提供更全面的研究报告。分析医疗影像的总结和讨论,以及相应的不确定性评估协议,将注重医疗影像分析中的特殊挑战。此外,我们还提出了未来研究的可能性。本文的目标是为医疗和技术背景的研究人员提供快速但深入了解机器学习模型在医疗影像分析中的不确定性评估研究。
A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers
paper_authors: Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, Etienne Decencière for:This paper aims to address the challenge of cross-modality image segmentation, where a single model can perform well on multiple types of images.methods:The proposed method uses a simple framework that adapts normalization layers based on the input type, trained with non-registered interleaved mixed data.results:The proposed method outperforms other cross-modality segmentation methods on the Multi-Modality Whole Heart Segmentation Challenge, with an improvement of up to 6.87% in Dice accuracy. Additionally, the proposed Conditional Vision Transformer (C-ViT) encoder brings significant improvements to the resulting segmentation.Abstract
When it comes to clinical images, automatic segmentation has a wide variety of applications and a considerable diversity of input domains, such as different types of Magnetic Resonance Images (MRIs) and Computerized Tomography (CT) scans. This heterogeneity is a challenge for cross-modality algorithms that should equally perform independently of the input image type fed to them. Often, segmentation models are trained using a single modality, preventing generalization to other types of input data without resorting to transfer learning techniques. Furthermore, the multi-modal or cross-modality architectures proposed in the literature frequently require registered images, which are not easy to collect in clinical environments, or need additional processing steps, such as synthetic image generation. In this work, we propose a simple framework to achieve fair image segmentation of multiple modalities using a single conditional model that adapts its normalization layers based on the input type, trained with non-registered interleaved mixed data. We show that our framework outperforms other cross-modality segmentation methods, when applied to the same 3D UNet baseline model, on the Multi-Modality Whole Heart Segmentation Challenge. Furthermore, we define the Conditional Vision Transformer (C-ViT) encoder, based on the proposed cross-modality framework, and we show that it brings significant improvements to the resulting segmentation, up to 6.87\% of Dice accuracy, with respect to its baseline reference. The code to reproduce our experiments and the trained model weights are available at https://github.com/matteo-bastico/MI-Seg.
摘要
当来到临床图像时,自动分割有很多应用场景和各种输入领域的多样性,如不同类型的核磁共振成像(MRI)和计算机成像(CT)扫描。这种多样性是跨模态算法的挑战,这些算法需要独立于输入图像类型来运行。常见的分割模型在训练时使用单一模态,这使得它们无法泛化到其他输入数据类型,需要使用传输学习技术。此外,在文献中提出的多模态或跨模态架构常需要已经注册的图像,这些图像在临床环境中很难获得,或需要额外的处理步骤,如生成synthetic图像。在这种工作中,我们提出了一个简单的框架,可以实现多模态图像分割的公平性,使用单个条件模型,该模型根据输入类型调整正规化层。我们表明,我们的框架在与同一个3D UNet基础模型进行比较时,已经超越了其他跨模态分割方法。此外,我们定义了 Conditional Vision Transformer(C-ViT)Encoder,基于我们的跨模态框架,并证明它对结果的分割带来了显著改进,达到6.87%的Dice准确率,相比其参照基eline。我们在GitHub上提供了实验代码和训练模型参数,请参考https://github.com/matteo-bastico/MI-Seg。
M3FPolypSegNet: Segmentation Network with Multi-frequency Feature Fusion for Polyp Localization in Colonoscopy Images
paper_authors: Ju-Hyeon Nam, Seo-Hyeong Park, Nur Suriza Syazwany, Yerim Jung, Yu-Han Im, Sang-Chul Lee
For: 降低抑制肠癌的风险,通过自动segmentation of polyps* Methods: 使用深度学习, Specifically, a novel frequency-based fully convolutional neural network (M3FPolypSegNet) that decomposes the input image into low/high/full-frequency components and uses multiple independent multi-frequency encoders to map the input image into a high-dimensional feature space.* Results: 比较多种segmentation模型,取得性能提升6.92%和7.52%的平均提升值, indicating that the proposed model outperformed various segmentation models.Abstract
Polyp segmentation is crucial for preventing colorectal cancer a common type of cancer. Deep learning has been used to segment polyps automatically, which reduces the risk of misdiagnosis. Localizing small polyps in colonoscopy images is challenging because of its complex characteristics, such as color, occlusion, and various shapes of polyps. To address this challenge, a novel frequency-based fully convolutional neural network, Multi-Frequency Feature Fusion Polyp Segmentation Network (M3FPolypSegNet) was proposed to decompose the input image into low/high/full-frequency components to use the characteristics of each component. We used three independent multi-frequency encoders to map multiple input images into a high-dimensional feature space. In the Frequency-ASPP Scalable Attention Module (F-ASPP SAM), ASPP was applied between each frequency component to preserve scale information. Subsequently, scalable attention was applied to emphasize polyp regions in a high-dimensional feature space. Finally, we designed three multi-task learning (i.e., region, edge, and distance) in four decoder blocks to learn the structural characteristics of the region. The proposed model outperformed various segmentation models with performance gains of 6.92% and 7.52% on average for all metrics on CVC-ClinicDB and BKAI-IGH-NeoPolyp, respectively.
摘要
《多脉冲分割是预防结肠癌的关键,深度学习已经用于自动分割脉冲,从而减少误诊风险。在colonoscopy图像中 lokalisir small脉冲具有复杂的特征,如颜色、 occlusion 和不同形状的脉冲。为解决这个挑战,我们提出了一种基于频率的全 convolutional neural network,即Multi-Frequency Feature Fusion Polyp Segmentation Network (M3FPolypSegNet)。我们将输入图像分解成低/高/全频率组成部分,并使用每个组成部分的特征来进行分割。在Frequency-ASPP Scalable Attention Module (F-ASPP SAM)中,我们应用了ASPP между每个频率组成部分,以保持尺度信息。然后,我们应用了可扩展的注意力 Mechanism来强调脉冲区域在高维特征空间中。最后,我们设计了三种多任务学习(即区域、边缘和距离),并在四个解码块中进行学习区域的结构特征。我们的模型在CVC-ClinicDB和BKAI-IGH-NeoPolyp上的各种评价指标上表现出了6.92%和7.52%的性能提升,相对于其他分割模型。》
Bi-directional Deformation for Parameterization of Neural Implicit Surfaces
results: 我们的方法可以快速渲染编辑后的текxture图像,无需重新训练神经网络。此外,我们的方法还支持多对象共参数化和Texture传输。我们在人头和人工对象的图像上进行了实验,并将源代码公开发布。Abstract
The growing capabilities of neural rendering have increased the demand for new techniques that enable the intuitive editing of 3D objects, particularly when they are represented as neural implicit surfaces. In this paper, we present a novel neural algorithm to parameterize neural implicit surfaces to simple parametric domains, such as spheres, cubes or polycubes, where 3D radiance field can be represented as a 2D field, thereby facilitating visualization and various editing tasks. Technically, our method computes a bi-directional deformation between 3D objects and their chosen parametric domains, eliminating the need for any prior information. We adopt a forward mapping of points on the zero level set of the 3D object to a parametric domain, followed by a backward mapping through inverse deformation. To ensure the map is bijective, we employ a cycle loss while optimizing the smoothness of both deformations. Additionally, we leverage a Laplacian regularizer to effectively control angle distortion and offer the flexibility to choose from a range of parametric domains for managing area distortion. Designed for compatibility, our framework integrates seamlessly with existing neural rendering pipelines, taking multi-view images as input to reconstruct 3D geometry and compute the corresponding texture map. We also introduce a simple yet effective technique for intrinsic radiance decomposition, facilitating both view-independent material editing and view-dependent shading editing. Our method allows for the immediate rendering of edited textures through volume rendering, without the need for network re-training. Moreover, our approach supports the co-parameterization of multiple objects and enables texture transfer between them. We demonstrate the effectiveness of our method on images of human heads and man-made objects. We will make the source code publicly available.
摘要
“对于内部运算的需求增加,特别是对于内部运算表示为对� Neurolayers 的需求。在这篇文章中,我们提出一个新的对� Neurolayers 的对� Parametric 表示方法,使得可以将 3D 颜色场表示为 2D 场,并且方便了视觉化和各种修改任务。技术上,我们的方法通过两个方向的扭曲来将 3D 物体与选择的对� Parametric 类型(例如:球体、立方体或多边形)进行对�映射,从而实现了对� 3D 物体的可视化和修改。我们运用了零层Set 的点映射到对� Parametric 类型,然后透过反对映对映射。为确保映射是对�的,我们运用了一个周期损失函数,并且对映射进行均匀调整。此外,我们还运用了一个 Laplacian 调整器,以控制角度扭曲和面积扭曲。我们的框架可以与现有的内部运算架构集成,并且可以将多视图像作为输入,以实现 3D 几何学的重建和相应的纹理图。我们还提出了一个简单又有效的纹理分解方法,以便进行视� Independent 材质修改和视� Dependent 颜色修改。我们的方法可以立即将修改过的纹理显示出来, без需要网络重训。此外,我们的方法支持多个物体的共同对�映射,并且允许纹理转移 между它们。我们在人类头部和人工物体的图像上进行了试验,并且将源代码公开。”
Proposal-based Temporal Action Localization with Point-level Supervision
results: 我们的方法在四个 benchmark 上实现了比或superior的性能,比如 ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID 数据集。Abstract
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.
摘要
点水平监督时间动作Localization(PTAL)目标是在没有时间标注的视频中识别和地图动作实例。在训练数据中只有每个动作实例中的一帧图像被标注。大多数先前的工作采用多个实例学习(MIL)框架,将输入视频切分成不重叠的短片段,然后独立地对每个短片段进行动作分类。我们认为MIL框架不适合PTAL,因为它只处理每个短片段中含有有限的时间信息,导致分类器只关注一些易于分辨的短片段,而不是找到整个动作实例没有遗弃任何相关的短片段。为解决这个问题,我们提议一种新的方法,通过生成和评估动作提案的灵活时间长度来地图动作。此外,我们引入高效的归一化算法,以生成高密度的假标签,并引入细化的对比损失来进一步改进假标签的质量。实验结果表明,我们的提议方法可以与状态艺术方法和一些完全监督方法在四个标准测试集(ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID 数据集)上达到竞争性或超过性能。
Colmap-PCD: An Open-source Tool for Fine Image-to-point cloud Registration
paper_authors: Chunge Bai, Ruijie Fu, Xiang Gao for: 本研究旨在解决单目camera重建中存在的缺乏级别信息问题,通过利用预设置的LiDAR地图作为固定约束,实现高精度的重建。methods: 本研究提出了一种新的成本效果重建管道,通过不需要同步拍摄相机和LiDAR数据的注册图像到点云地图,实现了不同区域的重建细节水平的管理。基于Colmap算法,我们开发了一个开源工具Colmap-PCD${^{3}$,以便进一步研究在这个领域中。results: 本研究实现了在不同区域中管理重建细节水平的可能性,并且不需要同步拍摄相机和LiDAR数据。通过与现有的重建方法进行比较,我们发现我们的方法可以提供更高的精度和更多的细节。Abstract
State-of-the-art techniques for monocular camera reconstruction predominantly rely on the Structure from Motion (SfM) pipeline. However, such methods often yield reconstruction outcomes that lack crucial scale information, and over time, accumulation of images leads to inevitable drift issues. In contrast, mapping methods based on LiDAR scans are popular in large-scale urban scene reconstruction due to their precise distance measurements, a capability fundamentally absent in visual-based approaches. Researchers have made attempts to utilize concurrent LiDAR and camera measurements in pursuit of precise scaling and color details within mapping outcomes. However, the outcomes are subject to extrinsic calibration and time synchronization precision. In this paper, we propose a novel cost-effective reconstruction pipeline that utilizes a pre-established LiDAR map as a fixed constraint to effectively address the inherent scale challenges present in monocular camera reconstruction. To our knowledge, our method is the first to register images onto the point cloud map without requiring synchronous capture of camera and LiDAR data, granting us the flexibility to manage reconstruction detail levels across various areas of interest. To facilitate further research in this domain, we have released Colmap-PCD${^{3}$, an open-source tool leveraging the Colmap algorithm, that enables precise fine-scale registration of images to the point cloud map.
摘要
现代单目镜头重建技术主要基于结构 FROM 运动(SfM)管道。然而,这些方法经常导致重建结果缺乏重要的尺度信息,而且随着图像的增加,时间滤波问题会变得不可避免。相比之下,基于 LiDAR 扫描的映射方法在大规模城市场景重建中广泛应用,因为它们可以提供精确的距离测量,这是视觉基本缺乏的。研究人员已经尝试将同时 captured LiDAR 和镜头数据用于精确的涂抹和颜色细节,但结果受到外部尺度和时间同步精度的限制。在本文中,我们提出了一种新的成本效果重建管道,利用预先建立的 LiDAR 地图作为固定的约束,有效地解决单目镜头重建中的自然尺度挑战。与传统方法不同的是,我们的方法不需要同步捕捉镜头和 LiDAR 数据,这 grant us 更多的灵活性来管理重建细节水平。为了促进这个领域的进一步研究,我们已经发布了 Colmap-PCD${^{3}$,一款开源工具,基于 Colmap 算法,可以精确地将图像注射到点云地图上。
Semi-Supervised Object Detection with Uncurated Unlabeled Data for Remote Sensing Images
results: 实验结果显示,这篇 paper 的方法在两个 widely 使用的离散感知物体标注 dataset(DIOR 和 DOTA)上具有较高的效率和准确性,并且在面对不同的变化和噪音下也能够保持良好的表现。Abstract
Annotating remote sensing images (RSIs) presents a notable challenge due to its labor-intensive nature. Semi-supervised object detection (SSOD) methods tackle this issue by generating pseudo-labels for the unlabeled data, assuming that all classes found in the unlabeled dataset are also represented in the labeled data. However, real-world situations introduce the possibility of out-of-distribution (OOD) samples being mixed with in-distribution (ID) samples within the unlabeled dataset. In this paper, we delve into techniques for conducting SSOD directly on uncurated unlabeled data, which is termed Open-Set Semi-Supervised Object Detection (OSSOD). Our approach commences by employing labeled in-distribution data to dynamically construct a class-wise feature bank (CFB) that captures features specific to each class. Subsequently, we compare the features of predicted object bounding boxes with the corresponding entries in the CFB to calculate OOD scores. We design an adaptive threshold based on the statistical properties of the CFB, allowing us to filter out OOD samples effectively. The effectiveness of our proposed method is substantiated through extensive experiments on two widely used remote sensing object detection datasets: DIOR and DOTA. These experiments showcase the superior performance and efficacy of our approach for OSSOD on RSIs.
摘要
annotating remote sensing images (RSIs) 是一项具有很大挑战性的任务,尤其是由于它的劳动 INTENSIVE 性。半supervised object detection (SSOD) 方法可以解决这个问题,通过生成 pseudo-labels для没有标签数据,假设所有在没有标签数据中出现的类也出现在标签数据中。然而,现实中的情况可能会出现在不符合预期的样本(OOD)被混合到了没有标签数据中。在这篇论文中,我们深入探讨了如何直接在没有检查的数据上进行 SSOD,我们称之为 Open-Set Semi-Supervised Object Detection (OSSOD)。我们的方法开始是使用标注的 Distribution 数据来动态构建一个类别特定的特征银行 (CFB),以捕捉每个类的特定特征。然后,我们将预测的对象 bounding box 的特征与 CFB 中相应的项目进行比较,以计算 OOD 分数。我们设计了基于 CFB 的适应阈值,以有效地过滤 OOD 样本。我们的提议的方法在 DIOR 和 DOTA 两个广泛使用的 remote sensing 对象检测数据集上进行了广泛的实验,这些实验证明了我们的方法在 OSSOD 中的超越性和效果。
Geometry-Guided Ray Augmentation for Neural Surface Reconstruction with Sparse Views
results: 我们的方法,名为 Ray Augmentation (RayAug),在 DTU 和 Blender 数据集上实现了无需前期训练的超过优result,证明了其在缺乏视图重建问题中的效iveness。我们的管道可以与其他隐式神经重建方法结合使用,以满足不同的应用需求。Abstract
In this paper, we propose a novel method for 3D scene and object reconstruction from sparse multi-view images. Different from previous methods that leverage extra information such as depth or generalizable features across scenes, our approach leverages the scene properties embedded in the multi-view inputs to create precise pseudo-labels for optimization without any prior training. Specifically, we introduce a geometry-guided approach that improves surface reconstruction accuracy from sparse views by leveraging spherical harmonics to predict the novel radiance while holistically considering all color observations for a point in the scene. Also, our pipeline exploits proxy geometry and correctly handles the occlusion in generating the pseudo-labels of radiance, which previous image-warping methods fail to avoid. Our method, dubbed Ray Augmentation (RayAug), achieves superior results on DTU and Blender datasets without requiring prior training, demonstrating its effectiveness in addressing the problem of sparse view reconstruction. Our pipeline is flexible and can be integrated into other implicit neural reconstruction methods for sparse views.
摘要
在这篇论文中,我们提出了一种新的方法用于从笔粗多视图图像中重建3D场景和物体。与前一些方法不同,我们的方法不使用Extra信息如深度或跨场景通用特征,而是利用多视图输入中嵌入的场景特性来创建精确的pseudo标签,而无需任何前期训练。我们引入了一种几何学导向的方法,通过利用球面幂函数预测新的反射灯度,同时兼容所有场景颜色观察,以提高从笔粗视图中的表面重建精度。此外,我们的管道利用代理几何和正确处理 occlusion,以生成 pseudo标签的灯度。我们的方法,命名为Ray Augmentation(RayAug),在DTU和Blender数据集上实现了无需前期训练的superior结果,证明了其在笔粗视图重建问题中的有效性。我们的管道可以与其他隐式神经重建方法结合使用。
AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential Cross Attention
results: 对多个数据集进行了比较性试验,结果表明,提出的方法可以在视觉质量和量化指标两个方面超过现有的状态 искусственный智能方法。此外,对折扣函数和融合策略进行了可视化检验,证明了提出的方法的有效性。Abstract
Multi-modal medical image fusion is essential for the precise clinical diagnosis and surgical navigation since it can merge the complementary information in multi-modalities into a single image. The quality of the fused image depends on the extracted single modality features as well as the fusion rules for multi-modal information. Existing deep learning-based fusion methods can fully exploit the semantic features of each modality, they cannot distinguish the effective low and high frequency information of each modality and fuse them adaptively. To address this issue, we propose AdaFuse, in which multimodal image information is fused adaptively through frequency-guided attention mechanism based on Fourier transform. Specifically, we propose the cross-attention fusion (CAF) block, which adaptively fuses features of two modalities in the spatial and frequency domains by exchanging key and query values, and then calculates the cross-attention scores between the spatial and frequency features to further guide the spatial-frequential information fusion. The CAF block enhances the high-frequency features of the different modalities so that the details in the fused images can be retained. Moreover, we design a novel loss function composed of structure loss and content loss to preserve both low and high frequency information. Extensive comparison experiments on several datasets demonstrate that the proposed method outperforms state-of-the-art methods in terms of both visual quality and quantitative metrics. The ablation experiments also validate the effectiveness of the proposed loss and fusion strategy.
摘要
多模式医疗图像融合是致 preciseness 临床诊断和手术导航中不可或缺的,因为它可以将多种多样的信息融合到单一的图像中。图像融合的质量取决于提取出的单一模式特征以及多模式信息融合规则。现有的深度学习基于的融合方法可以充分利用每种模式的语义特征,但是它们无法区分每种模式的有效低频和高频信息,并将其 adaptively 融合。为解决这个问题,我们提出了 AdaFuse,它通过基于傅ри散变换的频谱导向注意力机制来逐渐融合多模式图像信息。具体来说,我们提出了交叉注意力融合(CAF)块,它可以在空间和频谱Domain中逐渐融合两种模式的特征,然后计算交叉注意力分数以更好地导航图像的空间频谱信息融合。CAF块可以增强不同模式的高频特征,以保留融合图像中的细节。此外,我们设计了一种新的损失函数,它包括结构损失和内容损失,以保持两种模式的低频和高频信息。广泛比较实验证明,我们提出的方法在多个数据集上表现出色,超过了当前状态的方法。我们还进行了精细的拆分实验,以证明提出的损失函数和融合策略的效果。
Memory-Assisted Sub-Prototype Mining for Universal Domain Adaptation
paper_authors: Yuxiang Lai, Xinghong Liu, Tao Zhou, Yi Zhou for: 这个论文的目的是提出一种新的记忆驱动的子类别探索方法,以解决当存在类别内的概念变化时,普通的预测模型对类别的适应性不足的问题。methods: 这个方法使用了记忆驱动的子类别探索技术,可以从类别内的内部结构中学习到更加细致的特征空间,从而提高预测模型的适应性和精度。results: 这个方法在多个enario中实现了州际预测模型的提升,包括UniDA、OSDA和PDA等四个benchmark,在大多数情况下都可以 дости得顶尖的性能。Abstract
Universal domain adaptation aims to align the classes and reduce the feature gap between the same category of the source and target domains. The target private category is set as the unknown class during the adaptation process, as it is not included in the source domain. However, most existing methods overlook the intra-class structure within a category, especially in cases where there exists significant concept shift between the samples belonging to the same category. When samples with large concept shift are forced to be pushed together, it may negatively affect the adaptation performance. Moreover, from the interpretability aspect, it is unreasonable to align visual features with significant differences, such as fighter jets and civil aircraft, into the same category. Unfortunately, due to such semantic ambiguity and annotation cost, categories are not always classified in detail, making it difficult for the model to perform precise adaptation. To address these issues, we propose a novel Memory-Assisted Sub-Prototype Mining (MemSPM) method that can learn the differences between samples belonging to the same category and mine sub-classes when there exists significant concept shift between them. By doing so, our model learns a more reasonable feature space that enhances the transferability and reflects the inherent differences among samples annotated as the same category. We evaluate the effectiveness of our MemSPM method over multiple scenarios, including UniDA, OSDA, and PDA. Our method achieves state-of-the-art performance on four benchmarks in most cases.
摘要
通用领域适应目标是调整源领域和目标领域之间的分类和特征差异。目标私有类别设置为未知类别,因为它不包括源领域中。然而,大多数现有方法忽略了同一类别内的结构,特别是当存在严重概念变化的情况下。当这些样本被迫推 together 时,可能会对适应性产生负面影响。此外,从解释方面来说,将不同的特征,如战斗机和民用飞机,迫使其推入同一类别中,是不合理的。实际上,这种概念混乱和标签成本问题,使得类别不一定严格分类,对模型进行精确适应很困难。为解决这些问题,我们提出了一种新的记忆帮助子型别挖掘(MemSPM)方法。这个方法可以学习同一类别内的差异,并在存在严重概念变化的情况下挖掘子类别。这样,我们的模型可以学习一个更合理的特征空间,增强转移性和反映内部类别之间的差异。我们在多个情况下评估了我们的MemSPM方法,包括UniDA、OSDA和PDA四个标准库。我们的方法在大多数情况下实现了顶尖性能。
Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection
results: 通过对当前方法进行深入分析和评估,提供一系列可重复性的结果,并解决一些开放问题,如 KITTI-3D 和 nuScenes 数据集上的差异,以便促进未来的图像基于3D物体检测研究。Abstract
In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. In particular, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in image-based 3D object detection. Our codes will be released at \url{https://github.com/OpenGVLab/3dodi}
摘要
在这项工作中,我们建立了模块化设计的代码基础,制定了强大的训练荷重,设计了错误诊断工具箱,并讨论了现有的图像基于3D物体检测方法。特别是,与其他已经成熟的任务,如2D物体检测,社区的图像基于3D物体检测仍在发展中,其方法经常采用不同的训练荷重和技巧,导致不公正的评估和比较。甚至,这些技巧可能会超越其提出的设计,导致错误的结论。为解决这个问题,我们建立了模块化设计的代码基础,并制定了统一的训练标准 для社区。另外,我们还设计了错误诊断工具箱,以测量检测模型的细化特征。使用这些工具,我们在不同的设置下进行了深入的分析,并提供了一些开放的问题的讨论,如基于KITTI-3D和nuScenes dataset的结论不一致问题。我们希望这项工作能够促进图像基于3D物体检测的未来研究。我们的代码将在 GitHub 上发布,请参考 \url{https://github.com/OpenGVLab/3dodi}。
results: 本研究在两个公开可用的数据集上训练和验证 RetSeg,并且在多个公开数据集上显示了 RetSeg 的出色表现。尤其是在colonoscopy 影像上,RetSeg 能够提高肿瘤分类的精度和速度。Abstract
Vision Transformers (ViTs) have revolutionized medical imaging analysis, showcasing superior efficacy compared to conventional Convolutional Neural Networks (CNNs) in vital tasks such as polyp classification, detection, and segmentation. Leveraging attention mechanisms to focus on specific image regions, ViTs exhibit contextual awareness in processing visual data, culminating in robust and precise predictions, even for intricate medical images. Moreover, the inherent self-attention mechanism in Transformers accommodates varying input sizes and resolutions, granting an unprecedented flexibility absent in traditional CNNs. However, Transformers grapple with challenges like excessive memory usage and limited training parallelism due to self-attention, rendering them impractical for real-time disease detection on resource-constrained devices. In this study, we address these hurdles by investigating the integration of the recently introduced retention mechanism into polyp segmentation, introducing RetSeg, an encoder-decoder network featuring multi-head retention blocks. Drawing inspiration from Retentive Networks (RetNet), RetSeg is designed to bridge the gap between precise polyp segmentation and resource utilization, particularly tailored for colonoscopy images. We train and validate RetSeg for polyp segmentation employing two publicly available datasets: Kvasir-SEG and CVC-ClinicDB. Additionally, we showcase RetSeg's promising performance across diverse public datasets, including CVC-ColonDB, ETIS-LaribPolypDB, CVC-300, and BKAI-IGH NeoPolyp. While our work represents an early-stage exploration, further in-depth studies are imperative to advance these promising findings.
摘要
医学影像分析领域中,视野变换器(ViT)已经引领了革命性的改变,在重要的任务中,如肿瘤分类、检测和 segmentation 方面,表现出了Superior efficacy compared to traditional Convolutional Neural Networks(CNNs)。通过关注特定的图像区域,ViT 展示了Contextual awareness 在处理视觉数据,从而导致了Robust and precise predictions,甚至 для复杂的医学影像。此外,Transformers 内置的自注意机制可以满足不同的输入大小和分辨率,从而提供了前所未有的灵活性,与传统的 CNNs 不同。然而,Transformers 面临着大量内存使用和自注意机制限制了实时疾病检测的应用,特别是在有限的设备资源下。在本研究中,我们解决这些障碍物,通过调查Retention mechanism 的整合进行肿瘤分 segmentation,提出了RetSeg,一种基于 encoder-decoder 网络的多头保留块。 drawing inspiration from Retentive Networks(RetNet),RetSeg 旨在bridging the gap between precise polyp segmentation and resource utilization,特别适用于 colonoscopy 图像。我们使用两个公共可用的数据集进行训练和验证RetSeg:Kvasir-SEG 和 CVC-ClinicDB。此外,我们还展示了RetSeg 在多个公共数据集上的出色表现,包括 CVC-ColonDB、ETIS-LaribPolypDB、CVC-300 和 BKAI-IGH NeoPolyp。虽然我们的工作只是早期的探索,但更深入的研究是必要的,以进一步发展这些有前途的发现。
AngioMoCo: Learning-based Motion Correction in Cerebral Digital Subtraction Angiography
paper_authors: Ruisheng Su, Matthijs van der Sluijs, Sandra Cornelissen, Wim van Zwam, Aad van der Lugt, Wiro Niessen, Danny Ruijters, Theo van Walsum, Adrian Dalca
For: The paper aims to address the limitations of cerebral X-ray digital subtraction angiography (DSA) by developing a learning-based framework called AngioMoCo, which generates motion-compensated DSA sequences from X-ray angiography.* Methods: AngioMoCo integrates contrast extraction and motion correction, enabling differentiation between patient motion and intensity changes caused by contrast flow. The framework uses a learning-based approach that is substantially faster than iterative elastix-based methods.* Results: The paper demonstrates the effectiveness of AngioMoCo on a large national multi-center dataset (MR CLEAN Registry) of clinically acquired angiographic images through comprehensive qualitative and quantitative analyses. AngioMoCo produces high-quality motion-compensated DSA, removing motion artifacts while preserving contrast flow.Abstract
Cerebral X-ray digital subtraction angiography (DSA) is the standard imaging technique for visualizing blood flow and guiding endovascular treatments. The quality of DSA is often negatively impacted by body motion during acquisition, leading to decreased diagnostic value. Time-consuming iterative methods address motion correction based on non-rigid registration, and employ sparse key points and non-rigidity penalties to limit vessel distortion. Recent methods alleviate subtraction artifacts by predicting the subtracted frame from the corresponding unsubtracted frame, but do not explicitly compensate for motion-induced misalignment between frames. This hinders the serial evaluation of blood flow, and often causes undesired vasculature and contrast flow alterations, leading to impeded usability in clinical practice. To address these limitations, we present AngioMoCo, a learning-based framework that generates motion-compensated DSA sequences from X-ray angiography. AngioMoCo integrates contrast extraction and motion correction, enabling differentiation between patient motion and intensity changes caused by contrast flow. This strategy improves registration quality while being substantially faster than iterative elastix-based methods. We demonstrate AngioMoCo on a large national multi-center dataset (MR CLEAN Registry) of clinically acquired angiographic images through comprehensive qualitative and quantitative analyses. AngioMoCo produces high-quality motion-compensated DSA, removing motion artifacts while preserving contrast flow. Code is publicly available at https://github.com/RuishengSu/AngioMoCo.
摘要
脑血管X射线数字抑减成像(DSA)是现代成像技术的标准,用于评估血液流动和导引内镜治疗。然而,DSA的质量经常受到身体运动的影响,导致诊断价值下降。时间consuming的迭代方法 Addresses motion correction based on non-rigid registration, using sparse key points and non-rigidity penalties to limit vessel distortion。Recent methods predict the subtracted frame from the corresponding unsubtracted frame, but do not explicitly compensate for motion-induced misalignment between frames, leading to serial evaluation of blood flow and undesired vasculature and contrast flow alterations, which hinders clinical practice. To overcome these limitations, we present AngioMoCo, a learning-based framework that generates motion-compensated DSA sequences from X-ray angiography. AngioMoCo integrates contrast extraction and motion correction, allowing for the differentiation between patient motion and intensity changes caused by contrast flow. This strategy improves registration quality while being substantially faster than iterative elastix-based methods. We demonstrate AngioMoCo on a large national multi-center dataset (MR CLEAN Registry) of clinically acquired angiographic images through comprehensive qualitative and quantitative analyses. AngioMoCo produces high-quality motion-compensated DSA, removing motion artifacts while preserving contrast flow. Code is publicly available at https://github.com/RuishengSu/AngioMoCo.
Semantic-aware Temporal Channel-wise Attention for Cardiac Function Assessment
results: 在Standford数据集上达到了状态机器人的性能,与原始模型的改进为0.22 MAE,0.26 RMSE和1.9% $R^2$。Abstract
Cardiac function assessment aims at predicting left ventricular ejection fraction (LVEF) given an echocardiogram video, which requests models to focus on the changes in the left ventricle during the cardiac cycle. How to assess cardiac function accurately and automatically from an echocardiogram video is a valuable topic in intelligent assisted healthcare. Existing video-based methods do not pay much attention to the left ventricular region, nor the left ventricular changes caused by motion. In this work, we propose a semi-supervised auxiliary learning paradigm with a left ventricular segmentation task, which contributes to the representation learning for the left ventricular region. To better model the importance of motion information, we introduce a temporal channel-wise attention (TCA) module to excite those channels used to describe motion. Furthermore, we reform the TCA module with semantic perception by taking the segmentation map of the left ventricle as input to focus on the motion patterns of the left ventricle. Finally, to reduce the difficulty of direct LVEF regression, we utilize an anchor-based classification and regression method to predict LVEF. Our approach achieves state-of-the-art performance on the Stanford dataset with an improvement of 0.22 MAE, 0.26 RMSE, and 1.9% $R^2$.
摘要
Cardiac function assessment aims to predict left ventricular ejection fraction (LVEF) based on an echocardiogram video, which requires models to focus on changes in the left ventricle during the cardiac cycle. Accurately assessing cardiac function from an echocardiogram video is a valuable topic in intelligent assisted healthcare. Existing video-based methods do not pay much attention to the left ventricular region or the left ventricular changes caused by motion.In this work, we propose a semi-supervised auxiliary learning paradigm with a left ventricular segmentation task, which contributes to the representation learning for the left ventricular region. To better model the importance of motion information, we introduce a temporal channel-wise attention (TCA) module to excite those channels used to describe motion. Furthermore, we reform the TCA module with semantic perception by taking the segmentation map of the left ventricle as input to focus on the motion patterns of the left ventricle. Finally, to reduce the difficulty of direct LVEF regression, we utilize an anchor-based classification and regression method to predict LVEF. Our approach achieves state-of-the-art performance on the Stanford dataset with an improvement of 0.22 MAE, 0.26 RMSE, and 1.9% $R^2$.
GradientSurf: Gradient-Domain Neural Surface Reconstruction from RGB Video
paper_authors: Crane He Chen, Joerg Liebelt for: 这个论文提出了一种实时表面重建方法,用于从单色RGB视频中重建场景表面。methods: 该方法基于紧密地关联表面、体积和方向云点云,并在梯度域内解决重建问题。不同于传统的Poisson表面重建方法,该方法在线解决问题,使用神经网络在部分扫描过程中逐步更新。results: 对于室内场景重建任务,视觉和量化实验结果显示,提出的方法可以在弯曲区域和小物体上重建表面的细节,并且比前方法有更高的准确率。Abstract
This paper proposes GradientSurf, a novel algorithm for real time surface reconstruction from monocular RGB video. Inspired by Poisson Surface Reconstruction, the proposed method builds on the tight coupling between surface, volume, and oriented point cloud and solves the reconstruction problem in gradient-domain. Unlike Poisson Surface Reconstruction which finds an offline solution to the Poisson equation by solving a linear system after the scanning process is finished, our method finds online solutions from partial scans with a neural network incrementally where the Poisson layer is designed to supervise both local and global reconstruction. The main challenge that existing methods suffer from when reconstructing from RGB signal is a lack of details in the reconstructed surface. We hypothesize this is due to the spectral bias of neural networks towards learning low frequency geometric features. To address this issue, the reconstruction problem is cast onto gradient domain, where zeroth-order and first-order energies are minimized. The zeroth-order term penalizes location of the surface. The first-order term penalizes the difference between the gradient of reconstructed implicit function and the vector field formulated from oriented point clouds sampled at adaptive local densities. For the task of indoor scene reconstruction, visual and quantitative experimental results show that the proposed method reconstructs surfaces with more details in curved regions and higher fidelity for small objects than previous methods.
摘要
Existing methods suffer from a lack of details in the reconstructed surface, which we attribute to the spectral bias of neural networks towards learning low-frequency geometric features. To address this issue, we cast the reconstruction problem onto the gradient domain, where zeroth-order and first-order energies are minimized. The zeroth-order term penalizes the location of the surface, while the first-order term penalizes the difference between the gradient of the reconstructed implicit function and the vector field formulated from oriented point clouds sampled at adaptive local densities.For the task of indoor scene reconstruction, visual and quantitative experimental results show that the proposed method reconstructs surfaces with more details in curved regions and higher fidelity for small objects than previous methods.
Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers
methods: local attention-based量化模型 + 多层次特征交互 + 自适应生成管道
results: 提高图像生成速度、质量和分辨率Abstract
Vector-quantized image modeling has shown great potential in synthesizing high-quality images. However, generating high-resolution images remains a challenging task due to the quadratic computational overhead of the self-attention process. In this study, we seek to explore a more efficient two-stage framework for high-resolution image generation with improvements in the following three aspects. (1) Based on the observation that the first quantization stage has solid local property, we employ a local attention-based quantization model instead of the global attention mechanism used in previous methods, leading to better efficiency and reconstruction quality. (2) We emphasize the importance of multi-grained feature interaction during image generation and introduce an efficient attention mechanism that combines global attention (long-range semantic consistency within the whole image) and local attention (fined-grained details). This approach results in faster generation speed, higher generation fidelity, and improved resolution. (3) We propose a new generation pipeline incorporating autoencoding training and autoregressive generation strategy, demonstrating a better paradigm for image synthesis. Extensive experiments demonstrate the superiority of our approach in high-quality and high-resolution image reconstruction and generation.
摘要
vector-quantized 图像模型已经展示出Synthesizing 高质量图像的潜力。然而,生成高分辨率图像仍然是一项具有quadratic computational overhead的挑战,因为自我注意机制的计算复杂度 quadratic。在这项研究中,我们想要探索一种更高效的两 stage 框架,以提高以下三个方面:1. 根据我们所观察到的首次量化阶段具有坚实的地方性质,我们采用了地方注意机制来取代之前的全局注意机制,从而提高效率和重建质量。2. 我们强调图像生成中多层次特征之间的互动对于图像质量的影响,并引入了一种高效的注意机制,该机制结合了全局注意(整个图像的长距离semantic consistency)和地方注意(细腻的特征)。这种方法会提高生成速度、生成质量和分辨率。3. 我们提出了一种新的生成管道,其包括自动编码训练和泛化生成策略。我们的方法在高质量和高分辨率图像重建和生成中具有优势。extensive experiments 表明,我们的方法在高质量和高分辨率图像重建和生成中具有优势。
results: 本文在多种视觉任务上进行了广泛的实验,包括分类、物体检测、实例分割和 semantic segmentation。结果显示,HST 方法可以达到最佳性能,其中在 VTAB-1k 测试集上实现了平均 Top-1 准确率为 76.0%,并且只需要 fine-tune 0.78M 参数。在 COCO 测试设置上,HST 方法还超过了全局 fine-tuning,并在 Cascade Mask R-CNN 上获得了更高的权重 AP 值(49.7 和 43.2)。Abstract
Fine-tuning pre-trained Vision Transformers (ViT) has consistently demonstrated promising performance in the realm of visual recognition. However, adapting large pre-trained models to various tasks poses a significant challenge. This challenge arises from the need for each model to undergo an independent and comprehensive fine-tuning process, leading to substantial computational and memory demands. While recent advancements in Parameter-efficient Transfer Learning (PETL) have demonstrated their ability to achieve superior performance compared to full fine-tuning with a smaller subset of parameter updates, they tend to overlook dense prediction tasks such as object detection and segmentation. In this paper, we introduce Hierarchical Side-Tuning (HST), a novel PETL approach that enables ViT transfer to various downstream tasks effectively. Diverging from existing methods that exclusively fine-tune parameters within input spaces or certain modules connected to the backbone, we tune a lightweight and hierarchical side network (HSN) that leverages intermediate activations extracted from the backbone and generates multi-scale features to make predictions. To validate HST, we conducted extensive experiments encompassing diverse visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Notably, our method achieves state-of-the-art average Top-1 accuracy of 76.0% on VTAB-1k, all while fine-tuning a mere 0.78M parameters. When applied to object detection tasks on COCO testdev benchmark, HST even surpasses full fine-tuning and obtains better performance with 49.7 box AP and 43.2 mask AP using Cascade Mask R-CNN.
摘要
可以使用已经训练过的视觉转换器(ViT)进行精细调整,以提高视觉识别的性能。然而,将大型预训练模型应用到不同任务时,会带来很大的计算和内存需求。而最近的参数有效转移学习(PETL)技术已经证明可以在小型参数更新的情况下,超越全部 fine-tuning,实现更好的性能。然而,这些方法通常忽略 dense prediction 任务,如对象检测和分割。在这篇论文中,我们介绍了一种新的 PETL 方法,即层次侧调整(HST)。与现有方法不同,我们不是直接在输入空间或特定模块与背景连接的参数进行 fine-tuning,而是在较重的和层次结构的侧网络(HSN)上进行 fine-tuning。HSN 利用了背景网络中的中间活动,生成了多级特征,以进行预测。为 validate HST,我们进行了广泛的实验,包括多种视觉任务,如分类、对象检测、实例分割和semantic segmentation。特别是,我们的方法在 VTAB-1k 上实现了平均 Top-1 准确率为 76.0%,而且只需要 fine-tune 0.78M 个参数。在对象检测任务中,HST 甚至超过了全部 fine-tuning,在 COCO 测试dev benchmark上实现了49.7个box AP和43.2个mask AP,使用 Cascade Mask R-CNN。
results: 实验表明,LightFC可以实现跟踪器的优化平衡,在性能、参数、Flops和FPS等方面均达到了优秀的水平,并且在 CPU 上运行速度比 MixFormerV2-S 快2倍。Abstract
Although single object trackers have achieved advanced performance, their large-scale models make it difficult to apply them on the platforms with limited resources. Moreover, existing lightweight trackers only achieve balance between 2-3 points in terms of parameters, performance, Flops and FPS. To achieve the optimal balance among these points, this paper propose a lightweight full-convolutional Siamese tracker called LightFC. LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline. The ECM employs an attention-like module design, which conducts spatial and channel linear fusion of fused features and enhances the nonlinearly of the fused features. Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features. The ERH reparameterizes the feature dimensional stage in the standard center head and introduces channel attention to optimize the bottleneck of key feature flows. Comprehensive experiments show that LightFC achieves the optimal balance between performance, parameters, Flops and FPS. The precision score of LightFC outperforms MixFormerV2-S by 3.7 \% and 6.5 \% on LaSOT and TNL2K, respectively, while using 5x fewer parameters and 4.6x fewer Flops. Besides, LightFC runs 2x faster than MixFormerV2-S on CPUs. Our code and raw results can be found at https://github.com/LiYunfengLYF/LightFC
摘要
尽管单个对象跟踪器已经实现了高度的表现,但它们的大型模型使得在有限资源的平台上应用困难。此外,现有的轻量级跟踪器只能达到2-3点的平衡,即参数、性能、 Flops 和 FPS 之间的平衡。为了实现这些点之间的优化平衡,本文提出了一种轻量级全 convolutional Siamese 跟踪器,即 LightFC。LightFC 使用了一种新的高效 cross-correlation module (ECM) 和一种新的高效 reuse center head (ERH),以增强 convolutional 跟踪管道的非线性表达能力。ECM 使用了注意力模块的设计,通过空间和通道线性融合含有融合特征的特征,提高非线性特征的表达能力。此外,它参考了当前轻量级跟踪器的成功因素,并引入了跳过连接和 reuse 搜索区域特征。ERH 重新parameterizes 标准中心头中的特征维度阶段,并引入通道注意力来优化瓶颈特征流。经过全面的实验,LightFC 实现了最佳的性能、参数、 Flops 和 FPS 的平衡。LightFC 的精度分数高于 MixFormerV2-S 的 LaSOT 和 TNL2K 上的精度分数,分别提高了3.7% 和 6.5%,同时使用5x fewer parameters 和 4.6x fewer Flops。此外,LightFC 在 CPU 上运行2x faster than MixFormerV2-S。我们的代码和原始结果可以在 GitHub 上找到:https://github.com/LiYunfengLYF/LightFC
Neural Impostor: Editing Neural Radiance Fields with Explicit Shape Manipulation
results: 提供了一种实用的方法来 modify NeRF,包括deform、composite和generate neural implicit fields,同时维护复杂的零体积表现。Abstract
Neural Radiance Fields (NeRF) have significantly advanced the generation of highly realistic and expressive 3D scenes. However, the task of editing NeRF, particularly in terms of geometry modification, poses a significant challenge. This issue has obstructed NeRF's wider adoption across various applications. To tackle the problem of efficiently editing neural implicit fields, we introduce Neural Impostor, a hybrid representation incorporating an explicit tetrahedral mesh alongside a multigrid implicit field designated for each tetrahedron within the explicit mesh. Our framework bridges the explicit shape manipulation and the geometric editing of implicit fields by utilizing multigrid barycentric coordinate encoding, thus offering a pragmatic solution to deform, composite, and generate neural implicit fields while maintaining a complex volumetric appearance. Furthermore, we propose a comprehensive pipeline for editing neural implicit fields based on a set of explicit geometric editing operations. We show the robustness and adaptability of our system through diverse examples and experiments, including the editing of both synthetic objects and real captured data. Finally, we demonstrate the authoring process of a hybrid synthetic-captured object utilizing a variety of editing operations, underlining the transformative potential of Neural Impostor in the field of 3D content creation and manipulation.
摘要
Three-Stage Cascade Framework for Blurry Video Frame Interpolation
paper_authors: Pengcheng Lei, Zaoming Yan, Tingting Wang, Faming Fang, Guixu Zhang for: 这个论文的目标是提出一种简单的终端三个阶段框架,以全面利用恶化视频中的有用信息,并提高高速清晰视频生成的性能。methods: 该模型采用了三个阶段:帧 interpolate阶段、时间特征融合阶段和锐化阶段。帧 interpolate阶段使用了时间可变网络直接从恶化输入中提取有用信息并生成目标帧的中间帧。时间特征融合阶段通过双向循环可变网络挖掘每个目标帧的长期时间信息。锐化阶段使用了基于 transformer 的 Taylor 近似网络来逐层恢复高频细节。results: 实验结果表明,我们的模型在四个标准测试集上具有较高的性能,并且在真实恶化视频上也有良好的泛化能力。Abstract
Blurry video frame interpolation (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos, is a challenging but important topic in the computer vision community. Blurry videos not only provide spatial and temporal information like clear videos, but also contain additional motion information hidden in each blurry frame. However, existing BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance. In this paper, we propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos. The frame interpolation stage designs a temporal deformable network to directly sample useful information from blurry inputs and synthesize an intermediate frame at an arbitrary time interval. The temporal feature fusion stage explores the long-term temporal information for each target frame through a bi-directional recurrent deformable alignment network. And the deblurring stage applies a transformer-empowered Taylor approximation network to recursively recover the high-frequency details. The proposed three-stage framework has clear task assignment for each module and offers good expandability, the effectiveness of which are demonstrated by various experimental results. We evaluate our model on four benchmarks, including the Adobe240 dataset, GoPro dataset, YouTube240 dataset and Sony dataset. Quantitative and qualitative results indicate that our model outperforms existing SOTA methods. Besides, experiments on real-world blurry videos also indicate the good generalization ability of our model.
摘要
《不清晰视频帧 interpolate (BVFI)》是计算机视觉领域一项重要但具有挑战性的任务,该任务的目标是将低帧率不清晰视频转换为高帧率清晰视频。不清晰视频除了提供空间信息外,还包含隐藏在每帧不清晰图像中的动态信息。然而,现有的BVFI方法通常不能充分利用所有有价信息,这 ultimately 阻碍其性能。在本文中,我们提出了一个简单的三stage框架来完全探索不清晰视频中的有用信息。帧 interpolate stage 使用时间变换可靠网络来直接从不清晰输入中提取有用信息并生成目标帧中的中间帧。帧特征融合 stage 通过双向径向变换可靠网络来探索每个目标帧的长期时间信息。并且,卷积Transformer 加持 Taylor 近似网络来重复回归高频细节。我们的三stage框架具有明确的任务分配和扩展性,其效果由多种实验证明。我们在 Adobe240 数据集、GoPro数据集、YouTube240 数据集和 Sony 数据集等四个标准准 benchmark 上评估了我们的模型,并取得了较高的比较结果。此外,我们还对真实的不清晰视频进行了实验,并证明了我们的模型具有良好的通用性。
IPDreamer: Appearance-Controllable 3D Object Generation with Image Prompts
results: 实验结果表明,IPDreamer可以有效地生成高质量的3D形态,与文本和图像提示相符,达到了控制性和可预测性的目标。Abstract
Recent advances in text-to-3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to supervise 3D generation. These methods, including the variational score distillation proposed by ProlificDreamer, enable the synthesis of detailed and photorealistic textured meshes. However, the appearance of 3D objects generated by these methods is often random and uncontrollable, posing a challenge in achieving appearance-controllable 3D objects. To address this challenge, we introduce IPDreamer, a novel approach that incorporates image prompts to provide specific and comprehensive appearance information for 3D object generation. Our results demonstrate that IPDreamer effectively generates high-quality 3D objects that are consistent with both the provided text and image prompts, demonstrating its promising capability in appearance-controllable 3D object generation.
摘要
近年文本到3D生成技术的发展很remarkable,如DreamFusion等方法利用大规模文本到图像扩散型模型来监督3D生成。这些方法,包括ProlificDreamer提出的变量分数热采样,使得可以生成细节rich和实际图像的纹理体。然而,由这些方法生成的3D对象的外观往往随机和无法控制,这成为实现外观可控3D对象的挑战。为解决这个挑战,我们介绍IPDreamer,一种新的方法,它通过图像提示来提供特定和全面的外观信息,以实现外观可控3D对象的生成。我们的结果表明,IPDreamer能够生成高质量的3D对象,这些对象与提供的文本和图像提示保持一致,这demonstrates its promising capability in appearance-controllable 3D object generation。
Enhancing Prostate Cancer Diagnosis with Deep Learning: A Study using mpMRI Segmentation and Classification
paper_authors: Anil B. Gavade, Neel Kanwal, Priyanka A. Gavade, Rajendra Nerli for: 这个论文主要是为了提高膀肉癌早期诊断和精准诊断,以便提供更好的治疗方案。methods: 这个研究使用了深度学习模型来分类和 segmentation mpMRI 图像,并对不同的抑肿器进行了比较。results: 实验结果表明,结合 U-Net 和 LSTM 模型的ipeline 在分类和 segmentation 任务中表现最佳,其性能超过了所有其他组合。Abstract
Prostate cancer (PCa) is a severe disease among men globally. It is important to identify PCa early and make a precise diagnosis for effective treatment. For PCa diagnosis, Multi-parametric magnetic resonance imaging (mpMRI) emerged as an invaluable imaging modality that offers a precise anatomical view of the prostate gland and its tissue structure. Deep learning (DL) models can enhance existing clinical systems and improve patient care by locating regions of interest for physicians. Recently, DL techniques have been employed to develop a pipeline for segmenting and classifying different cancer types. These studies show that DL can be used to increase diagnostic precision and give objective results without variability. This work uses well-known DL models for the classification and segmentation of mpMRI images to detect PCa. Our implementation involves four pipelines; Semantic DeepSegNet with ResNet50, DeepSegNet with recurrent neural network (RNN), U-Net with RNN, and U-Net with a long short-term memory (LSTM). Each segmentation model is paired with a different classifier to evaluate the performance using different metrics. The results of our experiments show that the pipeline that uses the combination of U-Net and the LSTM model outperforms all other combinations, excelling in both segmentation and classification tasks.
摘要
乳腺癌(PCa)是男性全球最重要的疾病之一。 Early detection and precise diagnosis are crucial for effective treatment. Multi-parametric magnetic resonance imaging(mpMRI)已成为男性乳腺癌诊断的不可或缺的成像方式,可提供乳腺脏组织的精确 Анатомиче视图。深度学习(DL)模型可以增强现有的临床系统,提高患者的护理质量。在最近的研究中,DL技术已被用来开发分类和 segmentation 不同类型的癌病。这些研究表明,DL可以增强诊断的精度,提供 объектив的结果,无差异。本工作使用了一些常见的 DL 模型,用于分类和 segmentation mpMRI 图像,检测 PCa。我们的实现包括四个管道:Semantic DeepSegNet with ResNet50、DeepSegNet with RNN、U-Net with RNN 和 U-Net with LSTM。每个分 segmentation 模型都与不同的分类器相结合,以评估不同的 metric。实验结果表明,使用 U-Net 和 LSTM 模型的组合,在分 segmentation 和分类任务中具有最高的表现。
SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction
results: 实验表明,与新发布的轨迹预测模型一起训练SocialCircle后,可以Quantitatively提高预测性能,同时Qualitatively帮助更好地考虑社交互动when预测人行轨迹,与人类直觉相符。Abstract
Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures, but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes, we build a new anglebased trainable social representation, named SocialCircle, for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models, and experiments show that the SocialCircle not only quantitatively improves the prediction performance, but also qualitatively helps better consider social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions.
摘要
Inspired by marine animals that use echoes to locate their companions underwater, we propose a new angle-based trainable social representation, called SocialCircle, to continuously reflect the context of social interactions at different angular orientations relative to the target agent. We validate the effectiveness of SocialCircle by training it along with several recently released trajectory prediction models, and experiments show that it not only quantitatively improves prediction performance but also qualitatively considers social interactions in a way that is consistent with human intuition when forecasting pedestrian trajectories.
Rotation Matters: Generalized Monocular 3D Object Detection for Various Camera Systems
paper_authors: SungHo Moon, JinWoo Bae, SungHoon Im for: 这篇论文的目的是分析单目3D物体探测性能下降的原因,以及提出一种通用的3D物体探测方法,以提高其在不同摄像头系统上的性能。methods: 该论文采用了广泛的实验分析性能下降的原因,并提出了一种修正模块,用于更正估计的3D bounding box位置和方向。results: 该论文的提出的修正模块可以在大多数最新的3D物体探测网络上应用,提高了AP3D score(KITTI中等,IoU > 70%)约6-10倍于基eline,而且对质量和量表现都有显著提高。Abstract
Research on monocular 3D object detection is being actively studied, and as a result, performance has been steadily improving. However, 3D object detection performance is significantly reduced when applied to a camera system different from the system used to capture the training datasets. For example, a 3D detector trained on datasets from a passenger car mostly fails to regress accurate 3D bounding boxes for a camera mounted on a bus. In this paper, we conduct extensive experiments to analyze the factors that cause performance degradation. We find that changing the camera pose, especially camera orientation, relative to the road plane caused performance degradation. In addition, we propose a generalized 3D object detection method that can be universally applied to various camera systems. We newly design a compensation module that corrects the estimated 3D bounding box location and heading direction. The proposed module can be applied to most of the recent 3D object detection networks. It increases AP3D score (KITTI moderate, IoU $> 70\%$) about 6-to-10-times above the baselines without additional training. Both quantitative and qualitative results show the effectiveness of the proposed method.
摘要
研究监视单目3D对象检测正在活跃进行,因此性能不断提高。然而,当应用于不同的摄像头系统时,3D对象检测性能会受到明显的降低。例如,一个用于汽车摄像头的3D检测器通常无法在公共汽车摄像头上提供准确的3D包围盒。在这篇论文中,我们进行了广泛的实验分析,发现改变摄像头pose,特别是摄像头方向 relative to the road plane,会导致性能下降。此外,我们提议一种通用的3D对象检测方法,可以适用于不同的摄像头系统。我们新设计了一个补偿模块,可以正确地修正估计的3D包围盒位置和方向。该模块可以应用于大多数最近的3D对象检测网络。它可以在不需要再训练的情况下,提高AP3D score(KITTI中等,IoU>70%)约6-10倍。both quantitative and qualitative results表明提案的方法的有效性。
C^2M-DoT: Cross-modal consistent multi-view medical report generation with domain transfer network
results: 实验结果显示,C^2M-DoT在两个公共 benchmark 数据集上均substantially outperform 州际前进行的基elines 在所有指标中。ablation 研究也证实了每个 ком成分的有效性和必要性。Abstract
In clinical scenarios, multiple medical images with different views are usually generated simultaneously, and these images have high semantic consistency. However, most existing medical report generation methods only consider single-view data. The rich multi-view mutual information of medical images can help generate more accurate reports, however, the dependence of multi-view models on multi-view data in the inference stage severely limits their application in clinical practice. In addition, word-level optimization based on numbers ignores the semantics of reports and medical images, and the generated reports often cannot achieve good performance. Therefore, we propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C^2M-DoT). Specifically, (i) a semantic-based multi-view contrastive learning medical report generation framework is adopted to utilize cross-view information to learn the semantic representation of lesions; (ii) a domain transfer network is further proposed to ensure that the multi-view report generation model can still achieve good inference performance under single-view input; (iii) meanwhile, optimization using a cross-modal consistency loss facilitates the generation of textual reports that are semantically consistent with medical images. Extensive experimental studies on two public benchmark datasets demonstrate that C^2M-DoT substantially outperforms state-of-the-art baselines in all metrics. Ablation studies also confirmed the validity and necessity of each component in C^2M-DoT.
摘要
在临床场景下,通常同时生成多个具有不同视图的医疗图像,这些图像之间具有高度相关性。然而,大多数现有的医疗报告生成方法只考虑单视图数据。多视图图像之间的丰富多视图相互信息可以帮助生成更准确的报告,但是在推理阶段,多视图模型对多视图数据的依赖妨碍了它们在临床实践中的应用。此外,基于数字的单词优化 ignore 医疗图像和报告的 semantics,并且生成的报告frequently 无法达到好的性能。因此,我们提出了一种跨模态一致多视图医疗报告生成器(C^2M-DoT)。具体来说,我们采用了semantic-based multi-view contrastive learning医疗报告生成框架,以利用cross-view信息来学习lesion的semantic表示; 其次,我们提出了一种域转移网络,以确保多视图报告生成模型在单视图输入下仍可以实现好的推理性能; 最后,我们使用了一种跨模态一致损失来促进生成的文本报告与医疗图像之间的semantic一致。经过广泛的实验研究,我们发现C^2M-DoT在所有指标上都大幅超越了状态对比baselines。我们还进行了ablation研究,并证明了C^2M-DoT中每个组件的有效性和必要性。
Infrared Small Target Detection Using Double-Weighted Multi-Granularity Patch Tensor Model With Tensor-Train Decomposition
results: 对于不同的复杂场景,该paper的方法表现了更好的鲁棒性和稳定性,并且在不同评价指标下,与其他八种state-of-the-art方法进行比较,得到了更好的检测性能。Abstract
Infrared small target detection plays an important role in the remote sensing fields. Therefore, many detection algorithms have been proposed, in which the infrared patch-tensor (IPT) model has become a mainstream tool due to its excellent performance. However, most IPT-based methods face great challenges, such as inaccurate measure of the tensor low-rankness and poor robustness to complex scenes, which will leadto poor detection performance. In order to solve these problems, this paper proposes a novel double-weighted multi-granularity infrared patch tensor (DWMGIPT) model. First, to capture different granularity information of tensor from multiple modes, a multi-granularity infrared patch tensor (MGIPT) model is constructed by collecting nonoverlapping patches and tensor augmentation based on the tensor train (TT) decomposition. Second, to explore the latent structure of tensor more efficiently, we utilize the auto-weighted mechanism to balance the importance of information at different granularity. Then, the steering kernel (SK) is employed to extract local structure prior, which suppresses background interference such as strong edges and noise. Finally, an efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) is presented to solve the model. Extensive experiments in various challenging scenes show that the proposed algorithm is robust to noise and different scenes. Compared with the other eight state-of-the-art methods, different evaluation metrics demonstrate that our method achieves better detection performance in various complex scenes.
摘要
infrared小目标检测在远程感知领域扮演着重要的角色。因此,许多检测算法已经被提出,其中infrared patch-tensor(IPT)模型因其出色的表现而成为主流工具。然而,大多数IPT基于的方法面临着准确度低、鲁棒性差等问题,这将导致检测性能差。为解决这些问题,本文提出了一种新的双重加权多级infrared patch tensor(DWMGIPT)模型。首先,通过多种模式收集非重叠的patches和tensor增强来捕捉不同粒度信息的tensor。其次,通过自动权重机制来均衡不同粒度信息的重要性。然后,使用执政kernel(SK)来提取本地结构优先项,以抑制干扰性的强边和噪声。最后,基于alternating direction method of multipliers(ADMM)的有效优化算法来解决模型。在不同的复杂场景中进行了广泛的实验,结果显示,提出的算法具有较好的鲁棒性和能够抗干扰性。与其他八种当前状态的方法进行比较,不同的评价指标都表明,我们的方法在多种复杂场景中实现了更好的检测性能。
Anyview: Generalizable Indoor 3D Object Detection with Variable Frames
results: 在ScanNet dataset上进行了广泛的实验,并达到了高的检测精度和普适性,而且该方法的参数量与基eline相似。Abstract
In this paper, we propose a novel network framework for indoor 3D object detection to handle variable input frame numbers in practical scenarios. Existing methods only consider fixed frames of input data for a single detector, such as monocular RGB-D images or point clouds reconstructed from dense multi-view RGB-D images. While in practical application scenes such as robot navigation and manipulation, the raw input to the 3D detectors is the RGB-D images with variable frame numbers instead of the reconstructed scene point cloud. However, the previous approaches can only handle fixed frame input data and have poor performance with variable frame input. In order to facilitate 3D object detection methods suitable for practical tasks, we present a novel 3D detection framework named AnyView for our practical applications, which generalizes well across different numbers of input frames with a single model. To be specific, we propose a geometric learner to mine the local geometric features of each input RGB-D image frame and implement local-global feature interaction through a designed spatial mixture module. Meanwhile, we further utilize a dynamic token strategy to adaptively adjust the number of extracted features for each frame, which ensures consistent global feature density and further enhances the generalization after fusion. Extensive experiments on the ScanNet dataset show our method achieves both great generalizability and high detection accuracy with a simple and clean architecture containing a similar amount of parameters with the baselines.
摘要
在这篇论文中,我们提出了一种新的网络框架 для室内3D对象检测,以适应实际应用场景中的变量输入帧数。现有方法只考虑固定帧的输入数据,如独立的RGB-D图像或者 dense多视图RGB-D图像重建的点云。然而,在实际应用场景中,输入到3D检测器的原始数据是RGB-D图像,而不是重建的场景点云。这些前一些方法只能处理固定帧的输入数据,导致其在变量帧输入下表现不佳。为了使3D对象检测方法适用于实际任务,我们提出了一种名为AnyView的新的3D检测框架。我们认为,在不同的输入帧数下,我们可以通过地理学习器来挖掘每帧RGB-D图像的本地几何特征,并通过设计的空间混合模块来实现本地-全局特征互动。此外,我们还使用动态凭据策略来自适应性地调整每帧提取的特征数量,以保证每帧特征的净度和总特征密度的一致,从而进一步提高总体化性。我们在ScanNet数据集进行了广泛的实验,显示我们的方法可以同时具有优秀的一致性和高检测精度,并且具有简单清晰的架构,与基elines相似的参数量。
results: 发现大型神经网络模型在训练示例上的吸收趋势非常复杂,大多数示例在更大的模型下经受减少吸收,而其余示例则表现出拱形或增加吸收趋势。此外,发现各种 Feldman 吸收度指标不能准确捕捉这些基本趋势。最后,发现知识传递可以减少吸收,同时提高总结性能。Abstract
The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification bench-marks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experience decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman memorization score fail to capture these fundamental trends. Lastly, we find that knowledge distillation, an effective and popular model compression technique, tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation.
摘要
modern neural network 的成功引起了 memorization 和通用化之间的关系的研究:过参数化模型可以通过完美地适应(memorize)完全随机的标签来泛化 well。为了精心研究这个问题,Feldman 提出了一个度量 Memorialization 的个体训练示例的度量,并对一个 ResNet 在图像分类benchmark 上进行了empirical计算。这是一个对实际模型的 memorization 进行了初步的研究,但还留下了一个基本问题:大型神经网络是否能够更好地记忆?我们进行了一项完整的 empirical 分析,发现训练示例在不同的模型大小下 exhibit 一些不同的记忆轨迹:大多数样本在大型模型下表现出减少的记忆,而剩下的一些样本则表现出顶点形或增加的记忆。我们发现了不同的 Feldman 记忆度量的代理都无法捕捉这些基本趋势。最后,我们发现了知识储存,一种效果和流行的模型压缩技术,会妨碍记忆,同时也会提高通用化。具体来说,记忆在大型模型下增加的样本上受到了妨碍,这指出了如何储存提高通用化。
GReAT: A Graph Regularized Adversarial Training Method
results: 比靡�状态方法更有效,提高模型的抗性和通用性Abstract
This paper proposes a regularization method called GReAT, Graph Regularized Adversarial Training, to improve deep learning models' classification performance. Adversarial examples are a well-known challenge in machine learning, where small, purposeful perturbations to input data can mislead models. Adversarial training, a powerful and one of the most effective defense strategies, involves training models with both regular and adversarial examples. However, it often neglects the underlying structure of the data. In response, we propose GReAT, a method that leverages data graph structure to enhance model robustness. GReAT deploys the graph structure of the data into the adversarial training process, resulting in more robust models that better generalize its testing performance and defend against adversarial attacks. Through extensive evaluation on benchmark datasets, we demonstrate GReAT's effectiveness compared to state-of-the-art classification methods, highlighting its potential in improving deep learning models' classification performance.
摘要
Here's the translation in Simplified Chinese:这篇论文提出了一种名为GReAT(图structure regularized adversarial training)的常规化方法,用于改善深度学习模型的分类性能。对于机器学习来说,针对性例子是一种非常知名的挑战,小量、有目的的输入数据修改可以让模型产生错误。对此,我们提出了GReAT,一种利用数据图结构来增强模型的 Robustness。GReAT在对抗训练过程中部署了数据图结构,从而生成更加Robust的模型,能够更好地抗击针对性例子和对抗攻击。通过对标准 benchmark 数据集进行了广泛的评估,我们展示了GReAT的效果,与现有的分类方法相比,强调了它在改善深度学习模型的分类性能的潜在力量。
A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection
paper_authors: Yang Wang, Jiaogen Zhou, Jihong Guan
For: This paper focuses on weakly supervised video anomaly detection, which is useful for effective and intelligent public safety management.* Methods: The proposed model uses an adaptive instance selection strategy and a lightweight multi-level temporal correlation attention module, which reduces the number of model parameters to 0.56% of existing methods.* Results: The proposed model achieves comparable or superior AUC scores compared to state-of-the-art methods, with a significantly reduced number of model parameters, making it suitable for resource-limited scenarios such as edge computing.Here is the simplified Chinese version of the three key points:* For: 这篇论文关注弱类视频异常检测,可以实现有效和智能的公共安全管理。* Methods: 提议的模型使用适应实例选择策略和轻量级多层时间相关注意力模块,这将模型参数减少至0.56%以上方法(如RTFM)。* Results: 提议的模型在两个公共数据集(UCF-Crime和上海理工大学)的广泛实验中,可以与状态的方法相比或超过其AUC分数,同时减少了模型参数的数量,适用于限制资源的场景如边缘计算。Abstract
Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56\% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.
摘要
视频异常检测是确定视频中是否存在任何异常事件、行为或物体,以实现有效和智能公共安全管理。由于视频异常标注是时间费时和昂贵的, większość existing works使用不upervised or weakly supervised learning methods。这篇论文关注弱类视频异常检测,在哪些训练视频被标注为异常或正常,但没有任何Frame异常的信息。然而,弱类标注数据的不确定性和大型模型的尺寸阻碍了现有方法在实际场景中广泛应用,特别是在资源有限的情况下,如边计算。在这篇论文中,我们开发了一个轻量级视频异常检测模型。一方面,我们提出了适应实例选择策略,基于模型当前状态选择信任实例,从而减轻弱类标注数据的不确定性,并使模型性能提高。另一方面,我们设计了轻量级多层时间相关注意力模块和梯形全连接层,构建模型,可以将模型参数减少至0.56%(例如RTFM)。我们对公共数据集UCF-Crime和ShanghaiTech进行了广泛的实验,结果表明,我们的模型可以与状态之artefacts方法相当或超过,同时具有显著减少模型参数的优势。
Edge Computing-Enabled Road Condition Monitoring: System Development and Evaluation
For: Real-time pavement condition monitoring for highway agencies to inform pavement maintenance and rehabilitation policies.* Methods: Utilizes affordable MEMS sensors, edge computing, and deployable machine learning models to stream live pavement condition data and reduce latency.* Results: Demonstrates high accuracy in predicting International Roughness Index and classifying pavement segments based on ride quality, with the potential to provide real-time data to State Highway Agencies and Department of Transportation.Here’s the Chinese version:* For: 提供高速公路机构实时路面状况监测,以便制定路面维护和重建策略。* Methods: 利用可获得的便宜MEMS传感器,边缘计算和可部署机器学习模型,实时流传路面状况数据,降低延迟。* Results: 实验表明,该方法可以高度准确地预测国际护净指数,并基于乘用质量分类路面段, achieved an average accuracy of 96.76% on I-70EB and 63.15% on South Providence.Abstract
Real-time pavement condition monitoring provides highway agencies with timely and accurate information that could form the basis of pavement maintenance and rehabilitation policies. Existing technologies rely heavily on manual data processing, are expensive and therefore, difficult to scale for frequent, networklevel pavement condition monitoring. Additionally, these systems require sending large packets of data to the cloud which requires large storage space, are computationally expensive to process, and results in high latency. The current study proposes a solution that capitalizes on the widespread availability of affordable Micro Electro-Mechanical System (MEMS) sensors, edge computing and internet connection capabilities of microcontrollers, and deployable machine learning (ML) models to (a) design an Internet of Things (IoT)-enabled device that can be mounted on axles of vehicles to stream live pavement condition data (b) reduce latency through on-device processing and analytics of pavement condition sensor data before sending to the cloud servers. In this study, three ML models including Random Forest, LightGBM and XGBoost were trained to predict International Roughness Index (IRI) at every 0.1-mile segment. XGBoost had the highest accuracy with an RMSE and MAPE of 16.89in/mi and 20.3%, respectively. In terms of the ability to classify the IRI of pavement segments based on ride quality according to MAP-21 criteria, our proposed device achieved an average accuracy of 96.76% on I-70EB and 63.15% on South Providence. Overall, our proposed device demonstrates significant potential in providing real-time pavement condition data to State Highway Agencies (SHA) and Department of Transportation (DOTs) with a satisfactory level of accuracy.
摘要
现有技术依赖于手动数据处理,昂贵,难以扩展到频繁的网络级别路面状况监测。此外,这些系统需要将大量数据发送到云服务器,需要大量存储空间,计算昂贵,导致高延迟。本研究提出一种解决方案,利用可得到的便宜MEMS传感器、边缘计算和互联网连接能力,以及可搬移的机器学习(ML)模型,设计一个互联网器(IoT)启用的设备,可以在车辙上附加,实时传输路面状况数据。在本研究中,我们使用Random Forest、LightGBM和XGBoost三种ML模型,对每个0.1英里路段进行预测国际折叠指数(IRI)。XGBoost模型的最小平均差误(RMSE)和平均绝对误差(MAPE)分别为16.89英寸/英里和20.3%。在根据MAP-21标准对路面段的车辙质量进行分类方面,我们的提案设备实现了70.76%的准确率在I-70EB,以及63.15%的准确率在南普罗维登斯。总的来说,我们的提案设备表现出了在提供实时路面状况数据给州公路局(SHA)和交通部(DOTs)的满意水平的可能性。
Understanding the Feature Norm for Out-of-Distribution Detection
methods: 我们通过分析潜在隐藏在网络层次中的分类结构,来解释这一现象。我们发现:1)特征范数是隐藏在网络层次中的类别预测值的信任度,具体来说是最大的 logit。2)特征范数是不同类别模型中的共同特征,可以检测 OOD 样本。3)传统的特征范数不能捕捉隐藏层神经元的启动和停止趋势,可能导致 ID 样本被误认为 OOD 样本。
results: 我们提出了一种新的负意识范数(NAN),可以捕捉隐藏层神经元的启动和停止趋势,并且可以与现有的 OOD 检测器兼容。我们进行了广泛的实验,证明 NAN 的有效性和可靠性,并且可以在无标签环境中使用。Abstract
A neural network trained on a classification dataset often exhibits a higher vector norm of hidden layer features for in-distribution (ID) samples, while producing relatively lower norm values on unseen instances from out-of-distribution (OOD). Despite this intriguing phenomenon being utilized in many applications, the underlying cause has not been thoroughly investigated. In this study, we demystify this very phenomenon by scrutinizing the discriminative structures concealed in the intermediate layers of a neural network. Our analysis leads to the following discoveries: (1) The feature norm is a confidence value of a classifier hidden in the network layer, specifically its maximum logit. Hence, the feature norm distinguishes OOD from ID in the same manner that a classifier confidence does. (2) The feature norm is class-agnostic, thus it can detect OOD samples across diverse discriminative models. (3) The conventional feature norm fails to capture the deactivation tendency of hidden layer neurons, which may lead to misidentification of ID samples as OOD instances. To resolve this drawback, we propose a novel negative-aware norm (NAN) that can capture both the activation and deactivation tendencies of hidden layer neurons. We conduct extensive experiments on NAN, demonstrating its efficacy and compatibility with existing OOD detectors, as well as its capability in label-free environments.
摘要
一种神经网络训练于分类数据集时,常会表现出高向量范围的隐藏层特征值,而对未看过的外部数据(OOD)则产生相对较低的范围值。尽管这一现象已被广泛应用,但它的根本原因尚未得到全面研究。在这种研究中,我们对神经网络中隐藏的权重结构进行了仔细的检查。我们的分析导致以下发现:1. 特征范围是隐藏在神经网络层次中的类ifier的信息量,具体来说是最大的logs。因此,特征范围可以用来分辨OOD和ID的样本。2. 特征范围是不同类型的权重模型中的共同特征,可以检测OOD样本。3. 传统的特征范围不能捕捉隐藏层神经元的抑制趋势,这可能导致ID样本被误分类为OOD样本。为解决这个缺陷,我们提出了一种新的负意识范围(NAN),可以捕捉隐藏层神经元的活动和抑制趋势。我们进行了广泛的实验,证明了NAN的有效性和与现有OOD检测器兼容性,以及它在无标签环境中的可行性。
results: Tokenization scheme 可以提高 masked number prediction 性能,而无需大规模修改语言模型架构。Abstract
Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (\eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number. Previous work suggests that architectural change helps achieve state-of-the-art on number estimation but we find an insightful ablation: changing the model's vocabulary instead (\eg introduce a new token for numbers in range 10-100) is a far better trade-off. In the context of masked number prediction, a carefully designed tokenization scheme is both the simplest to implement and sufficient, \ie with similar performance to the state-of-the-art approach that requires making significant architectural changes. Finally, we report similar trends on the downstream task of numerical fact estimation (for Fermi Problems) and discuss reasons behind our findings.
摘要
尽管最近的语言模型具有了一定的成功,它们对数字的表示仍然不够。人类对数字基于其大小来思考,实际将其投射到数字线上,而分词 Tokenization 则不能明确地捕捉大小。为了解决这个缺陷,有些方法提议在语言模型的不同阶段进行修改。这些方法可以改变(1)数字的notation(例如科学 notation vs 十进制),(2)用于表示数字的词汇,或(3)语言模型的基础结构,以直接预测目标数字。根据我们的研究,对语言模型的结构进行修改可以达到领先的性能,但我们发现一个有趣的ablation:改变模型的词汇(例如引入10-100之间的数字新token)是一个更好的交换。在遮盲数字预测 зада务中,一个精心设计的tokenization scheme是最简单的实现方式,并且具有与采用大量结构修改的状态艺术领先性的相似性。最后,我们报告了相似的趋势在下游任务中(例如数学问题),并讨论了我们的发现的原因。
Look-Up mAI GeMM: Increasing AI GeMMs Performance by Nearly 2.5x via msGeMM
results: 本研究的结果表明,msGeMM算法可以在NVIDIA和AMD的GPU上实现AI模型的训练和执行,并且可以提高模型训练和执行的效率。Abstract
AI models are increasing in size and recent advancement in the community has shown that unlike HPC applications where double precision datatype are required, lower-precision datatypes such as fp8 or int4 are sufficient to bring the same model quality both for training and inference. Following these trends, GPU vendors such as NVIDIA and AMD have added hardware support for fp16, fp8 and int8 GeMM operations with an exceptional performance via Tensor Cores. However, this paper proposes a new algorithm called msGeMM which shows that AI models with low-precision datatypes can run with ~2.5x fewer multiplication and add instructions. Efficient implementation of this algorithm requires special CUDA cores with the ability to add elements from a small look-up table at the rate of Tensor Cores.
摘要
“人工智能模型不断增大,而最新的社区发展表明,不同于高性能计算应用程序(HPC)中需要双精度数据类型,低精度数据类型如fp8或int4却可以提供同等模型质量 Both for training and inference。随着这些趋势,GPU提供者如NVIDIA和AMD已经添加了硬件支持 дляfp16、fp8和int8 GeMM操作,通过tensor核心实现了非常出色的性能。但本文提出了一新的算法called msGeMM,显示低精度数据类型的AI模型可以透过大约2.5倍的multiplication和add指令数量进行运算。实施此算法需要特殊的CUDA核心,能够快速从小look-up表中添加元素,与tensor核心具有相同的速度。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.
Factual and Personalized Recommendations using Language Models and Reinforcement Learning
results: 经过实验 validate,该论文的方法可以在MovieLens 25M 数据集上提供有趣、个性化、有挑战性的电影推荐,并且可以准确地捕捉用户的偏好。Abstract
Recommender systems (RSs) play a central role in connecting users to content, products, and services, matching candidate items to users based on their preferences. While traditional RSs rely on implicit user feedback signals, conversational RSs interact with users in natural language. In this work, we develop a comPelling, Precise, Personalized, Preference-relevant language model (P4LM) that recommends items to users while putting emphasis on explaining item characteristics and their relevance. P4LM uses the embedding space representation of a user's preferences to generate compelling responses that are factually-grounded and relevant w.r.t. the user's preferences. Moreover, we develop a joint reward function that measures precision, appeal, and personalization, which we use as AI-based feedback in a reinforcement learning-based language model framework. Using the MovieLens 25M dataset, we demonstrate that P4LM delivers compelling, personalized movie narratives to users.
摘要
Simplified Chinese translation:推荐系统(RS)在连接用户与内容、产品和服务方面扮演中心角色,根据用户的偏好来匹配候选项。传统的RS通过隐式用户反馈信号来工作,而对话式RS则通过自然语言与用户交互。在这项工作中,我们开发了一个名为P4LM的语言模型,它可以为用户推荐项目,同时强调解释项目特性以及其与用户偏好的相关性。P4LM使用用户偏好的嵌入空间表示来生成有吸引力和个性化的回应,这些回应与用户偏好相关。此外,我们开发了一个共同奖励函数,该函数衡量精度、吸引力和个性化三个方面,并用于在语言模型框架中作为AI-based反馈。使用MovieLens 25M数据集,我们示示了P4LM可以为用户提供有吸引力和个性化的电影情节。
How does prompt engineering affect ChatGPT performance on unsupervised entity resolution?
results: 研究结果显示,提示方法可以很大程度上影响LLM的性能,其中一些指标更加敏感于提示方法的变化。此外,结果还表明了数据集的不同性可能会导致提示方法的不同效果。Abstract
Entity Resolution (ER) is the problem of semi-automatically determining when two entities refer to the same underlying entity, with applications ranging from healthcare to e-commerce. Traditional ER solutions required considerable manual expertise, including feature engineering, as well as identification and curation of training data. In many instances, such techniques are highly dependent on the domain. With recent advent in large language models (LLMs), there is an opportunity to make ER much more seamless and domain-independent. However, it is also well known that LLMs can pose risks, and that the quality of their outputs can depend on so-called prompt engineering. Unfortunately, a systematic experimental study on the effects of different prompting methods for addressing ER, using LLMs like ChatGPT, has been lacking thus far. This paper aims to address this gap by conducting such a study. Although preliminary in nature, our results show that prompting can significantly affect the quality of ER, although it affects some metrics more than others, and can also be dataset dependent.
摘要
entity resolution (er) 是指自动或半自动地确定两个实体是指同一个基础实体,它在医疗、电子商务等领域有广泛的应用。传统的er解决方案需要较大的人工干预,包括特征工程和训练数据的标识和筛选。在许多情况下,这些技术是域特定的。 however, with the recent advent of large language models (LLMs), there is an opportunity to make er much more seamless and domain-independent. unfortunately, it is also well known that LLMs can pose risks, and the quality of their outputs can depend on so-called prompt engineering. to address this gap, this paper aims to conduct a systematic experimental study on the effects of different prompting methods for addressing er, using LLMs like chatgpt. although preliminary in nature, our results show that prompting can significantly affect the quality of er, although it affects some metrics more than others and can also be dataset dependent.
Memory-Consistent Neural Networks for Imitation Learning
paper_authors: Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, James Weimer, Insup Lee
For: + The paper is written for imitation learning applications, specifically to address the problem of compounding errors in policy synthesis. + The authors aim to develop a new method that can learn from expert demonstrations and improve the performance of imitation policies.* Methods: + The proposed method is called “memory-consistent neural network” (MCNN), which is a type of deep neural network that is designed to counter the compounding error phenomenon. + The MCNN outputs are hard-constrained to stay within clearly specified permissible regions anchored to prototypical “memory” training samples. + The authors provide a guaranteed upper bound for the sub-optimality gap induced by MCNN policies.* Results: + The authors test the MCNN method on 9 imitation learning tasks, including dexterous robotic manipulation and driving, proprioceptive inputs and visual inputs, and varying sizes and types of demonstration data. + They find large and consistent gains in performance using the MCNN method, validating that it is better-suited for imitation learning applications than vanilla deep neural networks.Here is the information in Simplified Chinese text:* For: + 这篇论文是为了解决imitition learning应用中的政策合成问题,具体来说是解决错误堆叠问题。 + 作者们想要开发一种可以从专家示例学习并提高imitition政策的方法。* Methods: + 提出的方法是called “memory-consistent neural network” (MCNN),这是一种特殊的深度神经网络,旨在解决错误堆叠问题。 + MCNN输出是固定在明确规定的可能区域内的,这些可能区域是基于”memory”训练样本的概念示例。 + 作者们提供了一个确定的上限 bound дляMCNN政策中的优化性差。* Results: + 作者们对9个imitition learning任务进行测试,包括dexterous robotic manipulation和驾驶、 proprioceptive inputs和视觉输入、以及不同的示例数据大小和类型。 + 他们发现,使用MCNN方法可以获得大量和稳定的性能提升,证明MCNN比vanilla深度神经网络更适合imitition learning应用。Abstract
Imitation learning considerably simplifies policy synthesis compared to alternative approaches by exploiting access to expert demonstrations. For such imitation policies, errors away from the training samples are particularly critical. Even rare slip-ups in the policy action outputs can compound quickly over time, since they lead to unfamiliar future states where the policy is still more likely to err, eventually causing task failures. We revisit simple supervised ``behavior cloning'' for conveniently training the policy from nothing more than pre-recorded demonstrations, but carefully design the model class to counter the compounding error phenomenon. Our ``memory-consistent neural network'' (MCNN) outputs are hard-constrained to stay within clearly specified permissible regions anchored to prototypical ``memory'' training samples. We provide a guaranteed upper bound for the sub-optimality gap induced by MCNN policies. Using MCNNs on 9 imitation learning tasks, with MLP, Transformer, and Diffusion backbones, spanning dexterous robotic manipulation and driving, proprioceptive inputs and visual inputs, and varying sizes and types of demonstration data, we find large and consistent gains in performance, validating that MCNNs are better-suited than vanilla deep neural networks for imitation learning applications. Website: https://sites.google.com/view/mcnn-imitation
摘要
“模仿学习可以大幅简化政策生成比较于其他方法,通过利用专家示范的访问。对于这些模仿政策,错误离开训练样本是非常重要的。甚至 rare 的政策动作输出错误可以快速堆积,因为它们导致未知的未来状态, где政策仍然更有可能出错,最终导致任务失败。我们重新考虑简单的监督式“行为克隆”,通过只使用预录的示范来训练政策,但是注意地设计模型类来对错误堆积现象进行抗衡。我们的“记忆一致神经网络”(MCNN)输出是固定的约束在 clearly specified 的允许区域内, anchored 在“记忆”训练样本上。我们提供了对 MCNN 政策的至少优化差的保证上限。使用 MCNN 在 9 个模仿学习任务上,包括多种各种类型的示范数据,我们发现了大量和一致的性能提升,证明 MCNN 更适合于模仿学习应用。网站:https://sites.google.com/view/mcnn-imitation”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
paper_authors: Lexin Zhou, Pablo A. Moreno-Casares, Fernando Martínez-Plumed, John Burden, Ryan Burnell, Lucy Cheke, Cèsar Ferri, Alexandru Marcoci, Behzad Mehrbakhsh, Yael Moros-Daval, Seán Ó hÉigeartaigh, Danaja Rutar, Wout Schellaert, Konstantinos Voudouris, José Hernández-Orallo
for: 这篇论文旨在探讨 Predictable AI 这一新兴研究领域的基本思想和挑战。
methods: 本论文使用的方法是阐述 Predictable AI 领域的问题、假设和挑战,并呼吁开发者关注 AI 预测性的问题。
results: 本论文认为,在 AI 预测性方面取得积极进展可以帮助建立更加可靠、负责任、控制、对齐和安全的 AI 生态系统,因此应该在性能之上优先考虑预测性。Abstract
We introduce the fundamental ideas and challenges of Predictable AI, a nascent research area that explores the ways in which we can anticipate key indicators of present and future AI ecosystems. We argue that achieving predictability is crucial for fostering trust, liability, control, alignment and safety of AI ecosystems, and thus should be prioritised over performance. While distinctive from other areas of technical and non-technical AI research, the questions, hypotheses and challenges relevant to Predictable AI were yet to be clearly described. This paper aims to elucidate them, calls for identifying paths towards AI predictability and outlines the potential impact of this emergent field.
摘要
我团队介绍了人工智能预测的基本想法和挑战,这是一个新兴的研究领域,探讨了如何预测AI生态系统中的关键指标。我们认为,实现预测性是在AI生态系统中建立信任、责任、控制、对齐和安全的关键,因此应该被优先于性能。尽管与其他技术和非技术AI研究领域不同,Predictable AI中的问题、假设和挑战仍未得到清晰描述。这篇论文的目的是为此领域提供定义,呼吁开发AI预测的道路,并详细说明这个新兴领域的潜在影响。
methods: 这个论文使用了一种简单 yet effective的解决方法,即在Word-level核心参照模型中处理 conjunction mentions,以提高 OntoNotes 测试集的 F1 分数。
results: 这个解决方法可以提高 OntoNotes 测试集的 F1 分数 by 0.9%,使得 Word-level 核心参照解决方法与 expensive SOTA 方法之间的差距缩小了 34.6%。Abstract
State-of-the-art coreference resolutions systems depend on multiple LLM calls per document and are thus prohibitively expensive for many use cases (e.g., information extraction with large corpora). The leading word-level coreference system (WL-coref) attains 96.6% of these SOTA systems' performance while being much more efficient. In this work, we identify a routine yet important failure case of WL-coref: dealing with conjoined mentions such as 'Tom and Mary'. We offer a simple yet effective solution that improves the performance on the OntoNotes test set by 0.9% F1, shrinking the gap between efficient word-level coreference resolution and expensive SOTA approaches by 34.6%. Our Conjunction-Aware Word-level coreference model (CAW-coref) and code is available at https://github.com/KarelDO/wl-coref.
摘要
现代核心投引系统取决于文档中多个LLM调用,因此对许多应用场景(如大量文档提取信息)而言是不可持预算的。领先的单词级投引系统(WL-coref)达到了96.6%的SOTA系统性能,而且非常高效。在这项工作中,我们发现了WL-coref中的一种重要且常见失败情况:处理连接的提及,如“Tom和Mary”。我们提出了一种简单 yet有效的解决方案,在OntoNotes测试集上提高了0.9%的F1分,将高效的单词级投引与昂贵的SOTA方法之间的差距缩小了34.6%。我们的Conjunction-Aware Word-level coreference模型(CAW-coref)和代码可以在GitHub上找到:https://github.com/KarelDO/wl-coref。
Understanding Transfer Learning and Gradient-Based Meta-Learning Techniques
results: 研究结果表明,在相同数据分布下,finetuning的性能较高,而MAML和Reptile在不同数据分布下的性能较差。此外,研究还发现MAML和Reptile在严重数据缺乏情况下特化于快适应,而finetuning可以背景学习。最后,研究发现finetuning学习的特征为得到的特征更加多样和特异。Abstract
Deep neural networks can yield good performance on various tasks but often require large amounts of data to train them. Meta-learning received considerable attention as one approach to improve the generalization of these networks from a limited amount of data. Whilst meta-learning techniques have been observed to be successful at this in various scenarios, recent results suggest that when evaluated on tasks from a different data distribution than the one used for training, a baseline that simply finetunes a pre-trained network may be more effective than more complicated meta-learning techniques such as MAML, which is one of the most popular meta-learning techniques. This is surprising as the learning behaviour of MAML mimics that of finetuning: both rely on re-using learned features. We investigate the observed performance differences between finetuning, MAML, and another meta-learning technique called Reptile, and show that MAML and Reptile specialize for fast adaptation in low-data regimes of similar data distribution as the one used for training. Our findings show that both the output layer and the noisy training conditions induced by data scarcity play important roles in facilitating this specialization for MAML. Lastly, we show that the pre-trained features as obtained by the finetuning baseline are more diverse and discriminative than those learned by MAML and Reptile. Due to this lack of diversity and distribution specialization, MAML and Reptile may fail to generalize to out-of-distribution tasks whereas finetuning can fall back on the diversity of the learned features.
摘要
Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond
for: 这 paper 的目的是将传统RL与LLM研究中使用的RL技术相连接,解释RL在LLM中的优势和应用场景。
methods: 这 paper 使用了RLHF技术,具体来说是在线 inverse RL with offline demonstration data,并与 SFT 进行比较。
results: 这 paper 发现RLHF比 SFT 更为有利,因为它可以减少练习数据中的问题折衔。此外,RLHF 可以应用于其他 LLM 任务,如提问评估和优化,即使它们的反馈也是昂贵的。但RLHF 的策略学习更加具有挑战,因为它们的动作维度很高,并且反馈稀缺。Abstract
Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). In this paper, we aim to link the research in conventional RL to RL techniques used in LLM research. Demystify this technique by discussing why, when, and how RL excels. Furthermore, we explore potential future avenues that could either benefit from or contribute to RLHF research. Highlighted Takeaways: 1. RLHF is Online Inverse RL with Offline Demonstration Data. 2. RLHF $>$ SFT because Imitation Learning (and Inverse RL) $>$ Behavior Cloning (BC) by alleviating the problem of compounding error. 3. The RM step in RLHF generates a proxy of the expensive human feedback, such an insight can be generalized to other LLM tasks such as prompting evaluation and optimization where feedback is also expensive. 4. The policy learning in RLHF is more challenging than conventional problems studied in IRL due to their high action dimensionality and feedback sparsity. 5. The main superiority of PPO over off-policy value-based methods is its stability gained from (almost) on-policy data and conservative policy updates.
摘要
最近的大语言模型(LLM)的进步引起了广泛的关注,如ChatGPT和GPT-4等产品。它们的遵循指令和提供无害、有益和诚实(3H)回复的能力归功于人类反馈学习(RLHF)的技术。在这篇论文中,我们想要将传统RL研究与LLM研究中的RL技术相连。通过解释RLHF的优势和应用场景,我们希望能够启发更多的研究者关注和投入到这一领域。突出的摘要:1. RLHF是在线反RL与离线示例数据的组合。2. RLHF比SFT更高效,因为假设学习(和反RL)比Behavior Cloning(BC)更高效,因为它可以解决复杂的错误问题。3. RM步骤在RLHF中生成了贵重的人类反馈的代理,这种理解可以推广到其他LLM任务,如提问评估和优化,其中Feedback也是贵重的。4. RLHF中策略学习比传统IRL中的问题更加挑战,因为它们具有高动作维度和反馈稀缺性。5. PPO在比值基方法更稳定,因为它在(近乎)在policy上的数据上学习,并且保守地更新策略。
Layout Sequence Prediction From Noisy Mobile Modality
results: 我们的模型在随机遮盲和非常短输入实验中达到了 SOTA Result,证明了我们的方法可以准确地预测行人 bounding box 的 trajectory,并且可以在实际世界中使用感知数据来预测 pedestrian movement。Abstract
Trajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics. Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities. Nevertheless, real-world situations often involve obstructed cameras, missed objects, or objects out of sight due to environmental factors, leading to incomplete or noisy trajectories. To overcome these limitations, we propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories. LTrajDiff utilizes sensor data from mobile phones to surmount out-of-sight constraints, albeit introducing new challenges such as modality fusion, noisy data, and the absence of spatial layout and object size information. We employ a denoising diffusion model to predict precise layout sequences from noisy mobile data using a coarse-to-fine diffusion strategy, incorporating the RMS, Siamese Masked Encoding Module, and MFM. Our model predicts layout sequences by implicitly inferring object size and projection status from a single reference timestamp or significantly obstructed sequences. Achieving SOTA results in randomly obstructed experiments and extremely short input experiments, our model illustrates the effectiveness of leveraging noisy mobile data. In summary, our approach offers a promising solution to the challenges faced by layout sequence and trajectory prediction models in real-world settings, paving the way for utilizing sensor data from mobile phones to accurately predict pedestrian bounding box trajectories. To the best of our knowledge, this is the first work that addresses severely obstructed and extremely short layout sequences by combining vision with noisy mobile modality, making it the pioneering work in the field of layout sequence trajectory prediction.
摘要
atrajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics. Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities. Nevertheless, real-world situations often involve obstructed cameras, missed objects, or objects out of sight due to environmental factors, leading to incomplete or noisy trajectories. To overcome these limitations, we propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories. LTrajDiff utilizes sensor data from mobile phones to surmount out-of-sight constraints, albeit introducing new challenges such as modality fusion, noisy data, and the absence of spatial layout and object size information. We employ a denoising diffusion model to predict precise layout sequences from noisy mobile data using a coarse-to-fine diffusion strategy, incorporating the RMS, Siamese Masked Encoding Module, and MFM. Our model predicts layout sequences by implicitly inferring object size and projection status from a single reference timestamp or significantly obstructed sequences. Achieving SOTA results in randomly obstructed experiments and extremely short input experiments, our model illustrates the effectiveness of leveraging noisy mobile data. In summary, our approach offers a promising solution to the challenges faced by layout sequence and trajectory prediction models in real-world settings, paving the way for utilizing sensor data from mobile phones to accurately predict pedestrian bounding box trajectories. To the best of our knowledge, this is the first work that addresses severely obstructed and extremely short layout sequences by combining vision with noisy mobile modality, making it the pioneering work in the field of layout sequence trajectory prediction.
Learning Layer-wise Equivariances Automatically using Gradients
paper_authors: Tycho F. A. van der Ouderaa, Alexander Immer, Mark van der Wilk
for: 提高神经网络的泛化性能,使其更好地适应不同的输入数据。
methods: 使用权重相互连接结构和梯度下降法自适应地学习层 wise 对称性。
results: 在图像分类任务上,自动学习层 wise 对称性可以达到与固定编码的对称性相同或更好的性能。Abstract
Convolutions encode equivariance symmetries into neural networks leading to better generalisation performance. However, symmetries provide fixed hard constraints on the functions a network can represent, need to be specified in advance, and can not be adapted. Our goal is to allow flexible symmetry constraints that can automatically be learned from data using gradients. Learning symmetry and associated weight connectivity structures from scratch is difficult for two reasons. First, it requires efficient and flexible parameterisations of layer-wise equivariances. Secondly, symmetries act as constraints and are therefore not encouraged by training losses measuring data fit. To overcome these challenges, we improve parameterisations of soft equivariance and learn the amount of equivariance in layers by optimising the marginal likelihood, estimated using differentiable Laplace approximations. The objective balances data fit and model complexity enabling layer-wise symmetry discovery in deep networks. We demonstrate the ability to automatically learn layer-wise equivariances on image classification tasks, achieving equivalent or improved performance over baselines with hard-coded symmetry.
摘要
将文本翻译成简化中文。卷积层可以将等价对称性 encode到神经网络中,导致更好的泛化性表现。然而,对称性提供硬coded的约束,需要在预先指定,并不能适应。我们的目标是让柔性的对称约束,可以通过梯度学习自动从数据中学习。学习对称和相关的权重连接结构从零开始很困难,因为它们需要有效的和灵活的层wise equivariant parameterization。其次,对称性作为约束,因此不会被训练损失奖励。为了解决这些挑战,我们改进了软对称 parameterization,并通过优化 marginal likelihood,使用可微 differentiable Laplace approximations来学习层wise equivariance。该目标平衡数据适应和模型复杂度,使得层wise symmetry discovery在深度网络中自动学习。我们在图像分类任务上展示了自动学习层wise equivariance的能力,与硬编码的对称性具有相同或改进的性能。
On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments
results: 在实际的短信号长度下,TD-Conformers在控制特征维度时表现更高效。 authors proposed subsampling layers to further improve computational efficiency. The best TD-Conformer achieves 14.6 dB and 21.2 dB SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks, respectively.Abstract
Speech separation remains an important topic for multi-speaker technology researchers. Convolution augmented transformers (conformers) have performed well for many speech processing tasks but have been under-researched for speech separation. Most recent state-of-the-art (SOTA) separation models have been time-domain audio separation networks (TasNets). A number of successful models have made use of dual-path (DP) networks which sequentially process local and global information. Time domain conformers (TD-Conformers) are an analogue of the DP approach in that they also process local and global context sequentially but have a different time complexity function. It is shown that for realistic shorter signal lengths, conformers are more efficient when controlling for feature dimension. Subsampling layers are proposed to further improve computational efficiency. The best TD-Conformer achieves 14.6 dB and 21.2 dB SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks, respectively.
摘要
<>传送文本到简化中文。>研究多говор话技术的人们认为,语音分离仍然是一个重要的话题。卷积加强变换器(conformers)在许多语音处理任务中表现良好,但是它们在语音分离方面尚未得到足够的研究。最近的最佳状态(SOTA)分离模型主要是时域音频分离网络(TasNets)。一些成功的模型具有双路(DP)网络,这些网络先后处理本地和全局信息。时域卷积器(TD-Conformers)是DP方法的同义词,它们也在本地和全局上下文中进行顺序处理,但是它们的时间复杂度函数不同。研究发现,对于更加现实的信号长度,TD-Conformers在控制特征维度时更加高效。抽样层被提议来进一步提高计算效率。最佳TD-Conformer在WHAMR和WSJ0-2Mix测试集上分别提高了14.6dB和21.2dB的SISDR指标。
Text-driven Prompt Generation for Vision-Language Models in Federated Learning
results: 我们的实验结果显示,这个方法在九个多标的图像分类任务上比较出色,可以对已知和未知的类别进行更好的普遍化,并且可以应用于新的数据集。Abstract
Prompt learning for vision-language models, e.g., CoOp, has shown great success in adapting CLIP to different downstream tasks, making it a promising solution for federated learning due to computational reasons. Existing prompt learning techniques replace hand-crafted text prompts with learned vectors that offer improvements on seen classes, but struggle to generalize to unseen classes. Our work addresses this challenge by proposing Federated Text-driven Prompt Generation (FedTPG), which learns a unified prompt generation network across multiple remote clients in a scalable manner. The prompt generation network is conditioned on task-related text input, thus is context-aware, making it suitable to generalize for both seen and unseen classes. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods, that achieve overall better generalization on both seen and unseen classes and is also generalizable to unseen datasets.
摘要
Prompt learning for vision-language models, such as CoOp, has shown great success in adapting CLIP to different downstream tasks, making it a promising solution for federated learning due to computational reasons. Existing prompt learning techniques replace hand-crafted text prompts with learned vectors that offer improvements on seen classes, but struggle to generalize to unseen classes. Our work addresses this challenge by proposing Federated Text-driven Prompt Generation (FedTPG), which learns a unified prompt generation network across multiple remote clients in a scalable manner. The prompt generation network is conditioned on task-related text input, thus is context-aware, making it suitable to generalize for both seen and unseen classes. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods, achieving overall better generalization on both seen and unseen classes and is also generalizable to unseen datasets.Here's the translation breakdown:* "Prompt learning" is translated as "提示学习" (tíshì xuéxí)* "Vision-language models" is translated as "视觉语言模型" (wèi jiàn yǔ yán módel)* "CoOp" is translated as "CoOp" (同义词)* "CLIP" is translated as "CLIP" (同义词)* "Federated learning" is translated as "联合学习" (liánhé xuéxí)* "Hand-crafted text prompts" is translated as "手工编写的文本提示" (shǒu gōng biān xī de wén tiě zhǐ)* "Learned vectors" is translated as "学习后的 вектор" (xuéxí hòu de vector)* "Task-related text input" is translated as "任务相关的文本输入" (tâi yè xiāngguān de wén tiě shūrū)* "Context-aware" is translated as "Context-aware" (同义词)* "Federated Text-driven Prompt Generation" is translated as "联合文本驱动提示生成" (liánhé wén tiě qiú xíng chǎng zhǐ jiàn)* "Existing federated prompt learning methods" is translated as "现有的联合提示学习方法" (xiàn yǒu de liánhé zhǐ xuéxí fāngfa)* "Generalize to unseen classes" is translated as "泛化到未见类" (guānghuà dào wèi jiàn lèi)* "Our comprehensive empirical evaluations" is translated as "我们的全面实验评估" (wǒmen de quánxiān shíyìn zhìshì)Note that the translation is based on the standard Simplified Chinese pronunciation and may vary depending on the specific dialect or accent.
Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis
results: 研究发现,现有的 MTS 预测模型在不同的时间和空间特征下表现有很大差异。 BasicTS 可以帮助研究人员选择和设计适合的 MTS 预测模型,并提供了多个可重现的性能和效率比较结果。Abstract
Multivariate Time Series (MTS) widely exists in real-word complex systems, such as traffic and energy systems, making their forecasting crucial for understanding and influencing these systems. Recently, deep learning-based approaches have gained much popularity for effectively modeling temporal and spatial dependencies in MTS, specifically in Long-term Time Series Forecasting (LTSF) and Spatial-Temporal Forecasting (STF). However, the fair benchmarking issue and the choice of technical approaches have been hotly debated in related work. Such controversies significantly hinder our understanding of progress in this field. Thus, this paper aims to address these controversies to present insights into advancements achieved. To resolve benchmarking issues, we introduce BasicTS, a benchmark designed for fair comparisons in MTS forecasting. BasicTS establishes a unified training pipeline and reasonable evaluation settings, enabling an unbiased evaluation of over 30 popular MTS forecasting models on more than 18 datasets. Furthermore, we highlight the heterogeneity among MTS datasets and classify them based on temporal and spatial characteristics. We further prove that neglecting heterogeneity is the primary reason for generating controversies in technical approaches. Moreover, based on the proposed BasicTS and rich heterogeneous MTS datasets, we conduct an exhaustive and reproducible performance and efficiency comparison of popular models, providing insights for researchers in selecting and designing MTS forecasting models.
摘要
多变量时间系列(MTS)广泛存在在实际世界复杂系统中,如交通和能源系统,其预测对这些系统的理解和影响是关键。在最近几年,深度学习基于方法在MTS预测中得到了很多欢迎,特别是在长期时间序列预测(LTSF)和空间-时间预测(STF)中。然而,实际工作中的公平比较问题和技术方法选择问题一直是热点议题。这些争议很大程度上阻碍了我们对这个领域的进步的理解。因此,这篇论文旨在解决这些争议,提供关于领域的进步的新视角。为了解决公平比较问题,我们提出了BasicTS,一个用于公平比较的 benchmark。BasicTS 设计了一个统一的训练管道和合理的评估设置,使得不受偏见的评估了超过30种常见MTS预测模型在18个数据集上。此外,我们还发现了MTS数据集中的多样性,并将其分为了时间和空间特征的两类。我们还证明了忽略多样性是预测技术方法中的主要问题。此外,基于我们提出的BasicTS和丰富的多样化MTS数据集,我们进行了广泛和可重复的性和效率比较,为研究人员提供了选择和设计MTS预测模型的新的指导思想。
Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
results: 在多种复杂的逻辑推理任务中,使用Step-Back Prompting技术可以提高PaLM-2L模型的性能,比如物理和化学知识测验(MMLU Physics和Chemistry)上提高7%和11%,时间问答(TimeQA)上提高27%,以及多步逻辑推理任务(MuSiQue)上提高7%。Abstract
We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide the reasoning steps, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of Step-Back Prompting with PaLM-2L models and observe substantial performance gains on a wide range of challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, Step-Back Prompting improves PaLM-2L performance on MMLU Physics and Chemistry by 7% and 11%, TimeQA by 27%, and MuSiQue by 7%.
摘要
我们提出Step-Back Prompting,一种简单的提示技术,让机器学习模型(LLMs)从具体的实例中抽象出高水平概念和基本假设,然后使用这些概念和假设来引导逻辑步骤,以提高LLMs的解释正确性。我们在PaLM-2L模型上进行实验,并观察到了广泛的应用数据领域中的表现优化,包括STEM、知识问题答案和多步逻辑等。例如,Step-Back Prompting在物理和化学MMLU中提高PaLM-2L表现的比例为7%和11%,在TimeQA中提高27%,在MuSiQue中提高7%。
OptiMUS: Optimization Modeling Using MIP Solvers and large language models
results: 试验表明,OptiMUS可以比基本的LLM提示策略多解决优化问题。Abstract
Optimization problems are pervasive across various sectors, from manufacturing and distribution to healthcare. However, most such problems are still solved heuristically by hand rather than optimally by state-of-the-art solvers, as the expertise required to formulate and solve these problems limits the widespread adoption of optimization tools and techniques. We introduce OptiMUS, a Large Language Model (LLM)-based agent designed to formulate and solve MILP problems from their natural language descriptions. OptiMUS is capable of developing mathematical models, writing and debugging solver code, developing tests, and checking the validity of generated solutions. To benchmark our agent, we present NLP4LP, a novel dataset of linear programming (LP) and mixed integer linear programming (MILP) problems. Our experiments demonstrate that OptiMUS solves nearly twice as many problems as a basic LLM prompting strategy. OptiMUS code and NLP4LP dataset are available at \href{https://github.com/teshnizi/OptiMUS}{https://github.com/teshnizi/OptiMUS}
摘要
优化问题在不同领域广泛存在,从制造和分布到医疗。然而,大多数这些问题仍然通过手动规则来解决而不是使用当前的优化解决方案,因为解决这些问题所需的专业知识限制了优化工具和技术的普及。我们介绍OptiMUS,一个基于大语言模型(LLM)的代理人,可以从自然语言描述中形式化和解决优化问题。OptiMUS可以开发数学模型,编写和调试解决器代码,开发测试,并检查生成的解决方案的有效性。为了评估我们的代理人,我们提出了NLP4LP数据集,一个新的线性 программирова(LP)和混合整数线性程序(MILP)问题的数据集。我们的实验表明,OptiMUS可以比基本的LLM提示策略多 solves一半的问题。OptiMUS代码和NLP4LP数据集可以在获取。
results: 这篇论文的实验结果表明,通过在UniSim中训练高级视觉语言规划和低级强化学习策略,可以在真实世界中展示零批量训练的功能。此外,视频描述模型也可以通过与Simulink进行培训,提高其应用范围。Abstract
Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different axes (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, UniSim can emulate how humans and agents interact with the world by simulating the visual outcome of both high-level instructions such as "open the drawer" and low-level controls such as "move by x, y" from otherwise static scenes and objects. There are numerous use cases for such a real-world simulator. As an example, we use UniSim to train both high-level vision-language planners and low-level reinforcement learning policies, each of which exhibit zero-shot real-world transfer after training purely in a learned real-world simulator. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience in UniSim, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.
摘要
优化模型在互联网数据上进行训练已经革命化了文本、图像和视频内容的创建方式。可能下一个里程碑 для优化模型是模拟人类、机器人和其他交互代理的真实经验,响应于人类和机器人的行为。我们探讨了通过生成模型学习的UniSim universal simulator,以模拟人类和代理之间的互动。我们发现了自然数据集的重要观察:各种数据集在不同的轴上充满着数据(例如图像数据中的充满物体、机器人数据中的紧密的动作和导航数据中的多种运动)。通过综合考虑这些不同的数据集,UniSim可以模拟人类和代理在世界中交互的方式,包括通过高级指令如“打开抽屉”和低级控制如“移动by x, y”来模拟静止场景和物体的视觉结果。这种真实世界模拟器有很多应用场景。例如,我们使用UniSim训练高级视力语言规划和低级强化学习策略,它们在唯一学习的真实世界模拟器中展现出零基础真实世界传递。此外,我们还发现了训练在UniSim中的视频描述模型可以受益于实际经验,开阔了更广泛的应用领域。视频 demo 可以在 找到。
When is Agnostic Reinforcement Learning Statistically Tractable?
results: 显示存在一个 policy class $\Pi$ 的 bounded spanning capacity 可以学习,但是需要 superpolynomial 数量的样本。此外,我们还提出了一种新的算法 called POPLER,可以实现 statistically efficient online RL。Abstract
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $\epsilon$-suboptimal policy with respect to $\Pi$? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set $\Pi$ and is independent of the MDP dynamics. With a generative model, we show that for any policy class $\Pi$, bounded spanning capacity characterizes PAC learnability. However, for online RL, the situation is more subtle. We show there exists a policy class $\Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional \emph{sunflower} structure, which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as techniques for reachable-state identification and policy evaluation in reward-free exploration.
摘要
我们研究无知RL问题(Policy Optimization with Unknown Dynamics):给定一个策略集合 $\Pi$,何时需要多少回交互 avec一个未知MDP(可能具有很大的状态和动作空间)以学习一个 $\epsilon$-优化策略?为了解决这个问题,我们引入了一个新的复杂度度量,即 \emph{spanning capacity},这个度量只取决于集合 $\Pi$,与MDP动力完全无关。使用生成模型,我们证明了任何策略集合 $\Pi$ 的 bounded spanning capacity 是PAC学习可能的。然而,在在线RL中,情况更加复杂。我们证明了存在一个策略集合 $\Pi$ 的 bounded spanning capacity 需要超polynomial数量的样本来学习。这表明了无知学习中的分开性,在生成访问模型和在线访问模型之间,以及在渐进性和随机MDP之间。然而,我们还发现了一种附加的 \emph{sunflower} 结构,它可以在 conjunction WITH bounded spanning capacity 使得在线RL可以通过一种新的算法called POPLER来实现,这个算法结合了古典的重要性抽象方法以及探索和策略评估技术在奖励free探索中。
High Dimensional Causal Inference with Variational Backdoor Adjustment
results: 实验结果显示,本方法能够在高维度设置下估计干扰likelihood,并在各种高维度应用中实现成功。Abstract
Backdoor adjustment is a technique in causal inference for estimating interventional quantities from purely observational data. For example, in medical settings, backdoor adjustment can be used to control for confounding and estimate the effectiveness of a treatment. However, high dimensional treatments and confounders pose a series of potential pitfalls: tractability, identifiability, optimization. In this work, we take a generative modeling approach to backdoor adjustment for high dimensional treatments and confounders. We cast backdoor adjustment as an optimization problem in variational inference without reliance on proxy variables and hidden confounders. Empirically, our method is able to estimate interventional likelihood in a variety of high dimensional settings, including semi-synthetic X-ray medical data. To the best of our knowledge, this is the first application of backdoor adjustment in which all the relevant variables are high dimensional.
摘要
<>使用后门调整技术来估计从完全观察数据中的干预量。例如,在医疗设置下,后门调整可以控制干预因素和估计治疗效果。然而,高维度干预和干预因素可能存在一系列潜在的陷阱:可追踪性、可识别性和优化。在这项工作中,我们采用生成模型方法来实现后门调整。我们将后门调整视为变量推断中的优化问题,而不需要使用代理变量和隐藏干预因素。实际上,我们的方法可以在高维度设置下估计干预概率,包括半人工X射数据等多种高维度设置。根据我们知道,这是首次在所有相关变量都是高维度情况下应用后门调整。
Predictive auxiliary objectives in deep RL mimic learning in the brain
paper_authors: Ching Fang, Kimberly L Stachenfeld for:This paper explores the use of predictive auxiliary objectives in deep reinforcement learning (RL) to support representation learning and improve task performance.methods:The paper uses a deep RL system with self-supervised auxiliary objectives to study the effects of predictive learning on representation learning across different modules of the system.results:The paper finds that predictive objectives improve and stabilize learning, particularly in resource-limited architectures, and identifies settings where longer predictive horizons better support representational transfer. Additionally, the paper finds that representational changes in the RL system bear a striking resemblance to changes in neural activity observed in the brain.Abstract
The ability to predict upcoming events has been hypothesized to comprise a key aspect of natural and machine cognition. This is supported by trends in deep reinforcement learning (RL), where self-supervised auxiliary objectives such as prediction are widely used to support representation learning and improve task performance. Here, we study the effects predictive auxiliary objectives have on representation learning across different modules of an RL system and how these mimic representational changes observed in the brain. We find that predictive objectives improve and stabilize learning particularly in resource-limited architectures, and we identify settings where longer predictive horizons better support representational transfer. Furthermore, we find that representational changes in this RL system bear a striking resemblance to changes in neural activity observed in the brain across various experiments. Specifically, we draw a connection between the auxiliary predictive model of the RL system and hippocampus, an area thought to learn a predictive model to support memory-guided behavior. We also connect the encoder network and the value learning network of the RL system to visual cortex and striatum in the brain, respectively. This work demonstrates how representation learning in deep RL systems can provide an interpretable framework for modeling multi-region interactions in the brain. The deep RL perspective taken here also suggests an additional role of the hippocampus in the brain -- that of an auxiliary learning system that benefits representation learning in other regions.
摘要
results: 实验结果表明,FPS方法可以有效地处理回馈循环引起的挑战,并在COVID-19和交通预测任务中表现出优于传统时间序列预测方法。Abstract
Time-series forecasting is a critical challenge in various domains and has witnessed substantial progress in recent years. Many real-life scenarios, such as public health, economics, and social applications, involve feedback loops where predictions can influence the predicted outcome, subsequently altering the target variable's distribution. This phenomenon, known as performativity, introduces the potential for 'self-negating' or 'self-fulfilling' predictions. Despite extensive studies in classification problems across domains, performativity remains largely unexplored in the context of time-series forecasting from a machine-learning perspective. In this paper, we formalize performative time-series forecasting (PeTS), addressing the challenge of accurate predictions when performativity-induced distribution shifts are possible. We propose a novel approach, Feature Performative-Shifting (FPS), which leverages the concept of delayed response to anticipate distribution shifts and subsequently predicts targets accordingly. We provide theoretical insights suggesting that FPS can potentially lead to reduced generalization error. We conduct comprehensive experiments using multiple time-series models on COVID-19 and traffic forecasting tasks. The results demonstrate that FPS consistently outperforms conventional time-series forecasting methods, highlighting its efficacy in handling performativity-induced challenges.
摘要
时间序列预测是各个领域中的一项重要挑战,在过去几年中得到了重要进展。许多实际场景,如公共卫生、经济和社会应用,都存在反馈循环,其中预测结果可能会影响预测结果的分布,从而导致“自我实现”或“自我否定”的预测。这种现象被称为“表现力”,它在机器学习角度来看,尚未在时间序列预测中得到了广泛的研究。在这篇论文中,我们正式定义了表现力时间序列预测(PeTS),即在预测过程中考虑表现力引起的分布变化的挑战。我们提出了一种新的方法,即特征表现滚动(FPS),它利用延迟应答来预测分布变化,并根据此预测目标。我们提供了理论分析,表明FPS可能会减少泛化误差。我们在COVID-19和交通预测任务上进行了广泛的实验,结果表明FPS在处理表现力引起的挑战时表现出色,高于传统时间序列预测方法。
Pain Forecasting using Self-supervised Learning and Patient Phenotyping: An attempt to prevent Opioid Addiction
results: 实验结果显示,我们的模型在五年的实际数据上表现出色,超过了现有的标准准则,并可以准确地分类患者,提供有价值的临床决策信息。Abstract
Sickle Cell Disease (SCD) is a chronic genetic disorder characterized by recurrent acute painful episodes. Opioids are often used to manage these painful episodes; the extent of their use in managing pain in this disorder is an issue of debate. The risk of addiction and side effects of these opioid treatments can often lead to more pain episodes in the future. Hence, it is crucial to forecast future patient pain trajectories to help patients manage their SCD to improve their quality of life without compromising their treatment. It is challenging to obtain many pain records to design forecasting models since it is mainly recorded by patients' self-report. Therefore, it is expensive and painful (due to the need for patient compliance) to solve pain forecasting problems in a purely supervised manner. In light of this challenge, we propose to solve the pain forecasting problem using self-supervised learning methods. Also, clustering such time-series data is crucial for patient phenotyping, anticipating patients' prognoses by identifying "similar" patients, and designing treatment guidelines tailored to homogeneous patient subgroups. Hence, we propose a self-supervised learning approach for clustering time-series data, where each cluster comprises patients who share similar future pain profiles. Experiments on five years of real-world datasets show that our models achieve superior performance over state-of-the-art benchmarks and identify meaningful clusters that can be translated into actionable information for clinical decision-making.
摘要
针对患有悉尼细胞病(SCD)的患者,我们提出了一种基于自我监督学习的痛情预测方法。这种方法可以帮助患者更好地管理自己的病情,提高生活质量,而不需要妥协对治疗的影响。由于痛情记录的收集是主要由患者自己报告,因此收集数据的成本很高,而且需要患者的合作性,这使得解决痛情预测问题在完全监督方式下是非常困难的。为了解决这个问题,我们提出了一种基于自我监督学习的时间序列数据划分方法,每个分组包含拥有相似未来痛情轨迹的患者。我们对实际数据进行五年的实验,结果表明我们的模型在比较状态的标准准的基础上表现出色,并能够分配有意义的分组,这些分组可以被翻译成临床决策中的有用信息。
Augmenting Vision-Based Human Pose Estimation with Rotation Matrix
results: 经过我们的实验,我们发现使用SVM与SGD优化,并结合旋转矩阵数据增强方法,可以达到96%的活动识别精度,而不使用数据增强方法的基准精度只有64%。Abstract
Fitness applications are commonly used to monitor activities within the gym, but they often fail to automatically track indoor activities inside the gym. This study proposes a model that utilizes pose estimation combined with a novel data augmentation method, i.e., rotation matrix. We aim to enhance the classification accuracy of activity recognition based on pose estimation data. Through our experiments, we experiment with different classification algorithms along with image augmentation approaches. Our findings demonstrate that the SVM with SGD optimization, using data augmentation with the Rotation Matrix, yields the most accurate results, achieving a 96% accuracy rate in classifying five physical activities. Conversely, without implementing the data augmentation techniques, the baseline accuracy remains at a modest 64%.
摘要
fitness 应用程序通常用于健身房内活动监测,但它们经常无法自动跟踪健身房内的活动。本研究提出一种使用 pose estimation 和 rotation matrix 的模型,以提高基于 pose estimation 数据的活动识别精度。我们通过不同的分类算法和图像增强方法进行实验,发现使用 SVM WITH SGD 优化和数据增强方法,可以达到 96% 的正确率,分类五种物理活动。相比之下,没有实施数据增强技术,基准精度只有 64%。
results: 通过实践案例和实验研究,得到了GPT在SoC安全验证中的成果,包括提高验证效率、扩展验证范围和适应性能。Abstract
As the ubiquity and complexity of system-on-chip (SoC) designs increase across electronic devices, the task of incorporating security into an SoC design flow poses significant challenges. Existing security solutions are inadequate to provide effective verification of modern SoC designs due to their limitations in scalability, comprehensiveness, and adaptability. On the other hand, Large Language Models (LLMs) are celebrated for their remarkable success in natural language understanding, advanced reasoning, and program synthesis tasks. Recognizing an opportunity, our research delves into leveraging the emergent capabilities of Generative Pre-trained Transformers (GPTs) to address the existing gaps in SoC security, aiming for a more efficient, scalable, and adaptable methodology. By integrating LLMs into the SoC security verification paradigm, we open a new frontier of possibilities and challenges to ensure the security of increasingly complex SoCs. This paper offers an in-depth analysis of existing works, showcases practical case studies, demonstrates comprehensive experiments, and provides useful promoting guidelines. We also present the achievements, prospects, and challenges of employing LLM in different SoC security verification tasks.
摘要
(Simplified Chinese translation)随着系统在片(SoC)设计的 ubique 和复杂性的增加,在电子设备中实现安全性的任务变得更加困难。现有的安全解决方案因其缺乏扩展性、全面性和适应性而无法提供有效的验证。然而,大型自然语言模型(LLM)在自然语言理解、高级逻辑和程序生成任务中受到广泛的赞誉。我们的研究希望通过利用生成预训练转换器(GPT)的emergent capability来解决现有的安全阻碍,以实现更加高效、可扩展和适应的方法学。通过将LLM integrate into SoC安全验证模式,我们开启了一个新的前ier的可能性和挑战,以确保逐渐增加的SoC的安全性。本文提供了深入的现有工作分析、实践案例展示、全面的实验和有用的推广指南。我们还提出了使用LLM在不同的SoC安全验证任务中的成就、前景和挑战。
Generative ensemble deep learning severe weather prediction from a deterministic convection-allowing model
paper_authors: Yingkai Sha, Ryan A. Sobash, David John Gagne II
For: This paper is written for the purpose of developing an ensemble post-processing method for probabilistic prediction of severe weather (tornadoes, hail, and wind gusts) over the conterminous United States (CONUS).* Methods: The method combines conditional generative adversarial networks (CGANs) and a convolutional neural network (CNN) to post-process convection-allowing model (CAM) forecasts. The CGANs create synthetic ensemble members from deterministic CAM forecasts, and the CNN processes the outputs to estimate the probability of severe weather.* Results: The method produced skillful predictions with up to 20% Brier Skill Score (BSS) increases compared to other neural-network-based reference methods using a testing dataset of HRRR forecasts in 2021. The method also provided meaningful ensemble spreads that can distinguish good and bad forecasts, despite being overconfident. The quality of CGAN outputs was found to be similar to a numerical ensemble, preserving inter-variable correlations and the contribution of influential predictors.Abstract
An ensemble post-processing method is developed for the probabilistic prediction of severe weather (tornadoes, hail, and wind gusts) over the conterminous United States (CONUS). The method combines conditional generative adversarial networks (CGANs), a type of deep generative model, with a convolutional neural network (CNN) to post-process convection-allowing model (CAM) forecasts. The CGANs are designed to create synthetic ensemble members from deterministic CAM forecasts, and their outputs are processed by the CNN to estimate the probability of severe weather. The method is tested using High-Resolution Rapid Refresh (HRRR) 1--24 hr forecasts as inputs and Storm Prediction Center (SPC) severe weather reports as targets. The method produced skillful predictions with up to 20% Brier Skill Score (BSS) increases compared to other neural-network-based reference methods using a testing dataset of HRRR forecasts in 2021. For the evaluation of uncertainty quantification, the method is overconfident but produces meaningful ensemble spreads that can distinguish good and bad forecasts. The quality of CGAN outputs is also evaluated. Results show that the CGAN outputs behave similarly to a numerical ensemble; they preserved the inter-variable correlations and the contribution of influential predictors as in the original HRRR forecasts. This work provides a novel approach to post-process CAM output using neural networks that can be applied to severe weather prediction.
摘要
一种ensemble post-processing方法被开发用于预测美国大陆部分地区(CONUS)的严重天气(风暴、冰雨和风速)。该方法结合了条件生成隐藏模型(CGANs)和卷积神经网络(CNN)来处理可变性模型(CAM)预测。CGANs用于创建基于权值的ensemble成员,并将其输出经过CNN处理以估计严重天气的概率。该方法使用2021年的高分解速Refresh(HRRR)1--24小时预测作为输入,并使用 Storm Prediction Center(SPC)的严重天气报告作为目标。该方法生成了有20%的Brier Skill Score(BSS)提升 compared to其他基于神经网络的参考方法。为了评估不确定性评估,该方法显示出了一定的过于自信心,但生成了有意义的ensemble距离,可以分辨出好和坏预测。此外,CGAN输出的质量也被评估,结果表明CGAN输出与原始HRRR预测的相互关系和重要预测变量的贡献保持了一致。这种方法可以应用于严重天气预测中的深度学习post-processing。
DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
results: 模型学习到了可质感的幂等特征,可以分离控制摄像机和场景内容的视图生成。Abstract
Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.
摘要
世界的视觉理解不仅仅是图像的 semantics 和平面结构。在这项工作中,我们想要从单目世界视频中捕捉到真实场景的3D结构和动态。我们的动态场景变换模型(DyST)利用了最近的神经场景表示学习来学习单目世界视频的含义,并将其分解为场景内容、每个视角的场景动态和摄像头姿态。这种分解是通过我们新的合作训练方案和我们的新的 sintetic dataset DySO 来实现的。DyST 学习了真实场景的具体隐藏表示,使得可以通过分离摄像头和场景内容来生成视图。
Divide-and-Conquer Dynamics in AI-Driven Disempowerment
results: 研究预测了AI对未来的威胁,以及现代艺术家、演员和作家的利益相互关系。此外,研究还发现了AI模型的分裂和不一致可能导致更多的人受到AI的影响。Abstract
AI companies are attempting to create AI systems that outperform humans at most economically valuable work. Current AI models are already automating away the livelihoods of some artists, actors, and writers. But there is infighting between those who prioritize current harms and future harms. We construct a game-theoretic model of conflict to study the causes and consequences of this disunity. Our model also helps explain why throughout history, stakeholders sharing a common threat have found it advantageous to unite against it, and why the common threat has in turn found it advantageous to divide and conquer. Under realistic parameter assumptions, our model makes several predictions that find preliminary corroboration in the historical-empirical record. First, current victims of AI-driven disempowerment need the future victims to realize that their interests are also under serious and imminent threat, so that future victims are incentivized to support current victims in solidarity. Second, the movement against AI-driven disempowerment can become more united, and thereby more likely to prevail, if members believe that their efforts will be successful as opposed to futile. Finally, the movement can better unite and prevail if its members are less myopic. Myopic members prioritize their future well-being less than their present well-being, and are thus disinclined to solidarily support current victims today at personal cost, even if this is necessary to counter the shared threat of AI-driven disempowerment.
摘要
根据现实的参数假设,我们的模型有以下预测:1. 当前被AI驱逐的人需要将未来受害者理解到,他们的利益也面临严重和即将到来的威胁,以便未来受害者可以在团结的基础上支持当前受害者。2. 反对AI驱逐的运动可以更加团结,并因此更有可能取得胜利,如果成员们认为他们的努力会成功而不是费时。3. 运动成员们更加不偏袋穷,他们将未来的利益优先于当前的利益,因此他们可能不愿意为当前受害者支付个人成本,即使这是必要的,以对抗共同的AI驱逐威胁。
Grokking as Compression: A Nonlinear Complexity Perspective
results: 研究发现,LMN 可以准确地描述神经网络压缩前后的泛化性关系,而 $L_2$ norm 则与测试损失之间存在复杂的非线性关系。 Additionally, the paper finds that LMN can be used to explain the phenomenon of “grokking” in neural networks, where generalization is delayed after memorization.Abstract
We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. To do so, we define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region number for ReLU networks. LMN can nicely characterize neural network compression before generalization. Although the $L_2$ norm has been a popular choice for characterizing model complexity, we argue in favor of LMN for a number of reasons: (1) LMN can be naturally interpreted as information/computation, while $L_2$ cannot. (2) In the compression phase, LMN has linear relations with test losses, while $L_2$ is correlated with test losses in a complicated nonlinear way. (3) LMN also reveals an intriguing phenomenon of the XOR network switching between two generalization solutions, while $L_2$ does not. Besides explaining grokking, we argue that LMN is a promising candidate as the neural network version of the Kolmogorov complexity since it explicitly considers local or conditioned linear computations aligned with the nature of modern artificial neural networks.
摘要
我们将“吸收”现象,即记忆化后延迟普化,归因于压缩。为此,我们定义线性映射数(LMN)来衡量神经网络复杂度,这是ReLU网络的通用化版本。LMN能够nicely characterize神经网络压缩 перед普化。尽管$L_2$norm已经是神经网络复杂度的一个受欢迎选择,但我们认为LMN比$L_2$norm更加适合以下几个理由:(1)LMN可以自然地被理解为信息/计算,而$L_2$norm不能。(2)在压缩阶段,LMN与测试损失之间存在线性关系,而$L_2$norm与测试损失之间存在复杂的非线性关系。(3)LMN还揭示了XOR网络在压缩阶段转换到两个普化解决方案的意外现象,而$L_2$norm不会这样。除了解释吸收现象,我们认为LMN是神经网络版本的科尔莫哈洛夫复杂度,因为它直接考虑了现代人工神经网络中的局部或条件线性计算,与神经网络的自然性相符。
Interpreting CLIP’s Image Representation via Text-Based Decomposition
results: 我们发现CLIP中的注意头扮演了各种特性特定的角色(例如位置或形状),并发现图像块存在自适应的空间局部化现象。我们使用这些理解来修复和改进CLIP模型,并创建了一个强大的零基础图像分割器。Abstract
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
摘要
我们研究CLIP图像编码器,分析各个模型组件对最终表示的影响。我们将图像表示分解为图像 patches、模型层和注意头的和,使用CLIP的文本表示来解释和Summands。对注意头进行解释,我们自动找到了每个头的输出空间中的文本表示,从而描述了许多头的具体作用(例如,位置或形状)。接着,对图像 patches进行解释,我们发现CLIP中存在自然的空间局部化。最后,我们利用这种理解,将CLIP中的干扰特征除掉,并创建了一个强大的零基本图像分割器。我们的结果表明,可以可靠地理解转换器模型,并使其进行修复和改进。
results: 例如,将Llama2-7B微小示例训练500条语言代理访问GPT-4,可以提高HotpotQA性能77%。此外,本文提出了FireAct,一种新的微小示例训练LMs的方法,并显示可以透过更多的多元任务和提示方法来进一步提高代理。Abstract
Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.
摘要
最近努力已经补充语言模型(LM)以外的工具或环境,导致语言代理人能够思考和行动。然而,大多数这些代理人仍然依赖于几个示例提示技术与商业LM。在这篇论文中,我们调查和论证LM的细化提高语言代理人的方向。使用问答(QA)的Google搜索API设置,我们探索了多种基础LM、提示方法、细化数据和QA任务,并发现在细化基础LM后,语言代理人的性能一直提高。例如,将Llama2-7B细化500个代理人轨迹,生成自GPT-4,可以提高HotpotQA表现77%。此外,我们提出了FireAct,一种新的LM细化方法,使用多个任务和提示方法生成的轨迹,并证明更多的多样化细化数据可以进一步提高代理人。此外,我们还发现了扩展效果、稳定性、普适性、效率和成本等方面的优点。我们的工作证明了细化LM为代理人的全面利好,并提供了初步的实验设计、发现以及未解决问题,为语言代理人细化做出了初步的贡献。
SALMON: Self-Alignment with Principle-Following Reward Models
results: 在实验中,使用SALMON方法训练了一个名为Dromedary-2的AI助手,并且证明了Dromedary-2可以在多个benchmark数据集上表现出色,比如LLaMA-2-Chat-70b等现有的AI系统。此外,Dromedary-2只需要6个内容学习示例和31个人类定义的原则,而不需要大量的人类监督。Abstract
Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON (Self-ALignMent with principle-fOllowiNg reward models), to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is a principle-following reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training phase, we gain full control over the preferences with the reward model, subsequently influencing the behavior of the RL-trained policies, and eliminating the reliance on the collection of online human preferences. Applying our method to the LLaMA-2-70b base language model, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.
摘要
<>使用监督微调(SFT)和人工智能反馈学习(RLHF)的方法可以帮助基于自然语言处理(NNP)的人工智能代理人(AI)达到更好的性能。然而,这种方法的一个重要限制是它需要高质量的人类监督,这使得在复杂任务上应用困难,因为获得一致的人类响应示例和在线人类响应偏好是困难的。本文提出了一种新的方法,即Self-ALignMent(SALMON),以使基于自然语言处理的AI代理人与最小的人类监督达到更好的性能。SALMON的核心思想是使用原则遵循奖励模型,这种模型可以根据人类定义的原则生成奖励分数。通过在RL训练阶段调整这些原则,我们可以控制奖励模型中的偏好,并且消除在线人类响应的收集 limitation。我们在LLaMA-2-70b基础语言模型上实现了一个名为Dromedary-2的AI助手。只需6个示例和31个人类定义的原则,Dromedary-2可以在多个标准数据集上达到许多现有AI系统的性能水平。我们已经开源了代码和模型参数,以便进一步的研究可以提高LLM-based AI代理人的监督效率、控制性和可扩展性。
TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
results: 实验结果显示,TAIL框架将LoRA与其他效率 fine-tuning技术进行比较,在大量语言条件操作任务中可以实现最好的后适材化性能,仅使用1%的trainable parameter,并避免了遗传性遗传和持续学习设定中的干扰。Abstract
The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes effective pretraining of large models for decision-making, with little exploration into how to perform data-efficient continual adaptation of these models for new tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques -- e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments in large-scale language-conditioned manipulation tasks comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1\% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings.
摘要
大型预训模型的潜在能力在控制领域仍然尚未得到充分利用,主要是因为数据的罕见和训练或细化这些大模型 для这些应用程序所需的计算挑战。先前的工作主要强调有效地预训大模型进行决策,却忽略了如何通过数据有效地适应新任务。Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques -- e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments in large-scale language-conditioned manipulation tasks comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings.
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts
paper_authors: Lizhang Chen, Bo Liu, Kaizhao Liang, Qiang Liu
for: 本文旨在解释Lion优化器的理论基础。
methods: 本文使用维度分解的权重衰减和紧密链接的梯度下降来解释Lion优化器的动态。
results: 研究发现Lion优化器在训练大型人工智能模型时表现良好,并且比AdamW更具有内存效率。然而,由于Lion不受任何已知的理论支持,因此其可能性和扩展性受限。本文通过连续时间和离散时间分析,解释了Lion优化器在满足约束条件时的理论基础。Abstract
Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.
摘要
狮子(演进签证势),一种新的优化器,通过程序搜索发现,在训练大型人工智能模型时表现了有 promise的结果。它与AdamW相比,具有更高的内存效率。由于狮子包含了多种现有算法元素,包括签证、分离权重衰退、Polak和Nesterov势等,但它不属于任何已知的理论基础上的优化器。因此,尽管狮子在许多任务上表现良好,但其理论基础仍然存在uncertainty。这种不确定性限制了可以进一步提高和扩展狮子的效果的机会。本工作的目标是使狮子更加明了。通过连续时间和离散时间分析,我们证明了狮子是一种理论上有基础的优化器,可以在满足一个约束 $\|x\|_\infty \leq 1/\lambda$ 的情况下将一个通用损失函数 $f(x)$ 的最小值。狮子通过含有分离权重衰退的势 decay,其中 $\lambda$ 表示权重衰退系数。我们的分析基于一个新的Lyapunov函数,可以应用于狮子-$\kappa$ 算法中,其中 $sign(\cdot)$ 操作在狮子中被替换为一个凸函数 $\kappa$ 的极值。这些发现对狮子的动态和 Lion-相关算法的进一步改进和扩展提供了有价值的洞察。
Streaming Anchor Loss: Augmenting Supervision with Temporal Significance
results: 我们的实验结果显示,使用SAL来训练流动 нейрон网络模型,可以提高预测的精度和延迟时间,无需增加更多的数据或模型参数,并且可以在三个不同的语音检测任务上达到更好的效果。Abstract
Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming models (i.e., by adding more parameters) to improve the predictive power may not be viable for real-world tasks. In this work, we propose a new loss, Streaming Anchor Loss (SAL), to better utilize the given learning capacity by encouraging the model to learn more from essential frames. More specifically, our SAL and its focal variations dynamically modulate the frame-wise cross entropy loss based on the importance of the corresponding frames so that a higher loss penalty is assigned for frames within the temporal proximity of semantically critical events. Therefore, our loss ensures that the model training focuses on predicting the relatively rare but task-relevant frames. Experimental results with standard lightweight convolutional and recurrent streaming networks on three different speech based detection tasks demonstrate that SAL enables the model to learn the overall task more effectively with improved accuracy and latency, without any additional data, model parameters, or architectural changes.
摘要
<>转换给定文本到简化中文。<>流动神经网络模型在资源受限的平台上广泛应用,以快速响应各种语音和感知信号。因此,增加流动模型学习容量(即添加更多参数)以提高预测力可能不是现实世界任务中可行的。在这种情况下,我们提出了一种新的损失函数——流动锚点损失(SAL),以更好地利用给定的学习容量。具体来说,我们的 SAL 和其关注变种在时间 proximity 上动态调整帧 wise cross entropy 损失,以将更高的损失penalty分配给 semantic 事件附近的帧。因此,我们的损失函数使得模型在预测任务相关帧时更加注重。实验结果表明,使用标准轻量级卷积神经网络和流动神经网络在三种不同的语音检测任务上,SAL 可以使模型更好地学习任务,提高准确率和响应时间,不需要额外数据、模型参数或建模变化。
A Meta-Learning Perspective on Transformers for Causal Language Modeling
results: experiments 表明,Transformer 架构在 real-world 数据上可以带来优秀的结果,并且该特殊特征可以在 Transformer 中的 token 表示中找到。Abstract
The Transformer architecture has become prominent in developing large causal language models. However, mechanisms to explain its capabilities are not well understood. Focused on the training process, here we establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task, by explicating an inner optimization process that may happen within the Transformer. Further, from within the inner optimization, we discover and theoretically analyze a special characteristic of the norms of learned token representations within Transformer-based causal language models. Our analysis is supported by experiments conducted on pre-trained large language models and real-world data.
摘要
“transformer架构在大语言模型开发中变得非常知名,但它们的能力运作 Mechanism 仍未得到充分理解。我们专注于训练过程,以 meta-learning 的视角来探索transformer架构在语言模型化任务上的内部优化过程,并从内部优化中发现了transformer 学习的token表现内 norms 特有的一个特性。我们的分析得到了实验证明,并且在实际应用中得到了支持。”Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other regions.
results: 这个论文的结果表明,WSINDy算法可以在具有approximate symmetries的哈密顿动力学中高效地捕捉相关度量的动力学,并且可以在各种实际应用中提供更高精度的预测。例如,在振荡器动力学、Hénon-Heiles系统和电荷粒子动力学等方面都有physically relevant例子。Abstract
The Weak-form Sparse Identification of Nonlinear Dynamics algorithm (WSINDy) has been demonstrated to offer coarse-graining capabilities in the context of interacting particle systems ( https://doi.org/10.1016/j.physd.2022.133406 ). In this work we extend this capability to the problem of coarse-graining Hamiltonian dynamics which possess approximate symmetries. Such approximate symmetries often lead to the existence of a Hamiltonian system of reduced dimension that may be used to efficiently capture the dynamics of the relevant degrees of freedom. Deriving such reduced systems, or approximating them numerically, is an ongoing challenge. We demonstrate that WSINDy can successfully identify this reduced Hamiltonian system in the presence of large perturbations imparted from both the inexact nature of the symmetry and extrinsic noise. This is significant in part due to the nontrivial means by which such systems are derived analytically. WSINDy naturally preserves the Hamiltonian structure by restricting to a trial basis of Hamiltonian vector fields, and the methodology is computational efficient, often requiring only a single trajectory to learn the full reduced Hamiltonian, and avoiding forward solves in the learning process. In this way, we argue that weak-form equation learning is particularly well-suited for Hamiltonian coarse-graining. Using nearly-periodic Hamiltonian systems as a prototypical class of systems with approximate symmetries, we show that WSINDy robustly identifies the correct leading-order reduced system of dimension $2(N-1)$ or $N$ from the original $(2N)$-dimensional system, upon observation of the relevant degrees of freedom. We provide physically relevant examples, namely coupled oscillator dynamics, the H\'enon-Heiles system for stellar motion within a galaxy, and the dynamics of charged particles.
摘要
“弱形式简润识别非线性动力学算法(WSINDy)已经在互动粒子系统上显示出简润功能。在这个工作中,我们将这个功能扩展到具有约束的哈密顿动力学问题。这些约束通常导致一个简润的哈密顿系统,可以高效地捕捉相关的动力学度复。 derive这个简润系统或 numerically Approximate它是一个ongoing挑战。我们示出WSINDy可以成功地识别这个简润的哈密顿系统,甚至在大规模的干扰和随机变动下。这是由于WSINDy的方法自然地保留哈密顿结构,通过仅对实验基底的哈密顿 вектор场进行限制。此外,WSINDy的方法具有计算效率高,通常只需要一条轨道来学习全部简润哈密顿,而不需要前向 solves在学习过程中。因此,我们认为弱形式方程式学习特别适合哈密顿简润。使用nearly periodic哈密顿系统作为一个具有约束的系统,我们显示WSINDy可以坚定地识别原始(2N)-维系统中的(2N-1)维或N维简润系统,通过观察相关的动力学度复。我们提供了物理相关的例子,包括相互作用的振荡器动力学、Hénon-Heiles系统 для星系动力学和带电粒子的动力学。”
results: 论文认为,高级AI系统中的“属性X”特征会导致AI系统的危险性和控制问题,并提出了一些指标和管理措施来评估和限制高级AI系统中的“属性X”特征。Abstract
Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning. We label this cluster of characteristics as "Property X". Most present AI systems are low in "Property X"; however, in the absence of deliberate steering, current research directions may rapidly lead to the emergence of highly capable AI systems that are also high in "Property X". We argue that "Property X" characteristics are intrinsically dangerous, and when combined with greater capabilities will result in AI systems for which safety and control is difficult to guarantee. Drawing on several scholars' alternative frameworks for possible AI research trajectories, we argue that most of the proposed benefits of advanced AI can be obtained by systems designed to minimise this property. We then propose indicators and governance interventions to identify and limit the development of systems with risky "Property X" characteristics.
摘要
有些担忧将来的人工智能会带来危险,通常集中在假设存在自我行为、战略意识和远程规划等特性的系统上。我们称这些特性为“X属性”。目前的大多数人工智能系统具有低度的X属性,但在没有干预导航的情况下,当前的研究方向可能会迅速导致高度可能X属性的AI系统的出现。我们认为X属性是危险的,当与更高的能力相结合时,控制和安全难以保证。我们根据一些学者的不同框架,提出了可以通过降低X属性来获得大多数高级人工智能的提antages的可能性。我们还提出了指标和管理措施,以识别和限制开发高风险X属性的系统。
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
results: 这个论文在两个VCR benchmark数据集上进行了评估,并与没有培 retrained fine-tuning的方法进行比较,得到了更好的性能。Abstract
In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) for visual commonsense reasoning (VCR). We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI). For VCU, which involves perceiving the literal visual content, pre-trained VLMs exhibit strong cross-dataset generalization. On the other hand, in VCI, where the goal is to infer conclusions beyond image content, VLMs face difficulties. We find that a baseline where VLMs provide perception results (image captions) to LLMs leads to improved performance on VCI. However, we identify a challenge with VLMs' passive perception, which often misses crucial context information, leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we suggest a collaborative approach where LLMs, when uncertain about their reasoning, actively direct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. In our method, named ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem category, VLM commanders to leverage VLMs differently based on the problem classification, and visual commonsense reasoners to answer the question. VLMs will perform visual recognition and understanding. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain supervised fine-tuning.
摘要
在我们的工作中,我们探索预训练的视觉语言模型(VLM)和大型语言模型(LLM)在视觉常识逻辑(VCR)中的共同能力。我们将VCR分为视觉常识理解(VCU)和视觉常识推理(VCI)两个问题。在VCU中,涉及到直接读取图像内容的VCMs表现出了强大的跨数据集泛化能力。然而,在VCI中,VCMs面临困难,因为需要从图像内容中做出更多的推理。我们发现,将VCMs提供视觉内容(图像描述)给LLMs可以提高VCI的性能。然而,我们发现VCMs的被动感知有时会错过重要的上下文信息,导致LLMs的推理错误或不确定。为了解决这个问题,我们提议一种协作方法,其中LLMs在不确定的推理时会活动地指导VCMs集中聚焦和收集相关的视觉元素以支持可能的常识推理。我们称这种方法为ViCor,其中预训练的LLMs serves为问题类别分析器,VCM commander以不同的问题类别来利用VCMs,并且视觉常识推理器来回答问题。VCMs将进行视觉识别和理解。我们对VCR benchmark数据集进行评估,并在不需要域内抽象精细调整的情况下超越所有其他方法。
Dynamic value alignment through preference aggregation of multiple objectives
paper_authors: Marcin Korecki, Damian Dailisan, Cesare Carissimo
for: 这项研究目标是为了开发能够与人类目标相对应的伦理AI系统。
methods: 这种方法使用多目标方法来动态调整价值,以确保RL算法能够同时满足多个目标。
results: 这种方法在简化后的两脚交叉控制系统中进行了应用,并实现了在三个维度(速度、停止和等待时间)的全面性能提高,同时能够有效地 инте格各个目标之间的矛盾。Abstract
The development of ethical AI systems is currently geared toward setting objective functions that align with human objectives. However, finding such functions remains a research challenge, while in RL, setting rewards by hand is a fairly standard approach. We present a methodology for dynamic value alignment, where the values that are to be aligned with are dynamically changing, using a multiple-objective approach. We apply this approach to extend Deep $Q$-Learning to accommodate multiple objectives and evaluate this method on a simplified two-leg intersection controlled by a switching agent.Our approach dynamically accommodates the preferences of drivers on the system and achieves better overall performance across three metrics (speeds, stops, and waits) while integrating objectives that have competing or conflicting actions.
摘要
现在的人工智能系统开发将注意力集中在设定目标函数,以便与人类目标相互对应。然而,发现这些函数仍然是研究挑战,而在RL中,手动设置优化奖励是一种标准的方法。我们提出了动态值协调的方法,使得需要协调的价值可以随时变化,使用多目标方法。我们将这种方法应用到深度Q学来扩展多目标,并评估这种方法在简化的二脚交汇控制系统上。我们的方法可以动态地考虑驱驶者对系统的偏好,并 achieve better 总性表现(速度、停止和等待时间),并同时考虑了具有竞争或冲突的目标。
HyperAttention: Long-context Attention in Near-Linear Time
paper_authors: Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh
For: 提高Large Language Model(LLM)中长上下文的计算效率。* Methods: 引入两个细化参数,分别测量:(1)折衔列norm在归一化注意力矩阵中,和(2)行norm在归一化注意力矩阵后的大项检测和 removing。使用这两个参数捕捉问题的困难程度。* Results: HyperAttention比 existed方法更快,具有linear time sampling算法,并且可以适应不同的长上下文长度。Empirical experiments表明, HyperAttention在不同的长上下文长度上都具有良好的性能。例如,在32k上下文长度上,HyperAttention可以提高ChatGLM2的推理速度50%,而且只增加了0.7的折衔值。在更大的长上下文长度上,HyperAttention可以提高单层注意力层的速度5倍。Abstract
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.
摘要
我们提出了一种近似的注意机制名为HyperAttention,以Addressing the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters to measure: (1) the maximum column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50% faster on a context length of 32k while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.
Generative quantum machine learning via denoising diffusion probabilistic models
results: QuDDPM可以有效地学习相关的量子噪声模型和量子数据的topological结构。Abstract
Deep generative models are key-enabling technology to computer vision, text generation and large language models. Denoising diffusion probabilistic models (DDPMs) have recently gained much attention due to their ability to generate diverse and high-quality samples in many computer vision tasks, as well as to incorporate flexible model architectures and relatively simple training scheme. Quantum generative models, empowered by entanglement and superposition, have brought new insight to learning classical and quantum data. Inspired by the classical counterpart, we propose the quantum denoising diffusion probabilistic models (QuDDPM) to enable efficiently trainable generative learning of quantum data. QuDDPM adopts sufficient layers of circuits to guarantee expressivity, while introduces multiple intermediate training tasks as interpolation between the target distribution and noise to avoid barren plateau and guarantee efficient training. We demonstrate QuDDPM's capability in learning correlated quantum noise model and learning topological structure of nontrivial distribution of quantum data.
摘要
深度生成模型是计算机视觉、文本生成和大语言模型的关键技术。干扰扩散probabilistic模型(DDPM)在计算机视觉任务中最近受到了广泛关注,因为它们可以生成多样和高质量的样本,同时可以采用灵活的模型架构和简单的训练方案。量子生成模型,受到共聚和超position的 empowerment,为学习классиcu和量子数据提供了新的视角。以类型 counterpart为基础,我们提出了量子干扰扩散probabilistic模型(QuDDPM),以实现高效可训练的生成学习 quantum data。QuDDPM采用了多层环路来保证表达力,同时引入多个中间训练任务作为干扰和静止的插值,以避免恐慌板和保证高效的训练。我们示例了QuDDPM在学习相关的量子噪声模型和学习不规则分布的 topological结构。
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
results: 在AVEB评估准则下,FAVOR实现了与单modal任务相当的性能,并在视频问答任务上达到了20%以上的性能提升。此外,FAVOR还示出了在其他多modal LLVM中缺乏的视频理解和推理能力。Abstract
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
摘要
大型语音视频语言模型(LLM)已经吸引了广泛的注意,然而将两个输入流进行细腻的组合仍然是一项挑战,这是LLM理解普通视频输入的必要条件。为此,本文提出了一个 audio-visual 共同表示学习框架(FAVOR),该框架将文本语言模型扩展到同时感知语音和视频流的帧级别上,并将语音和视频特征流 fusion 到共同表示。为了将音频和视频特征流与语言模型输入空间进行对应,我们提出了一种 causal Q-Former 结构,并在其中添加了一个 causal 注意模块,以增强捕捉音频视频帧之间的 causal 关系。我们还提出了一个 audio-visual 评价指标(AVEB),该指标包括6种单Modal任务和5种跨Modal任务,旨在测试音频、语音和图像任务中的单Modal性能,以及音频视频之间的相互理解能力。FAVOR在 AVEB 中 achieved 20% 以上的性能提升,特别是在需要细腻信息或时间 causal 逻辑时。此外,FAVOR 还示出了在其他多Modal LLM 未能达到的视频理解和逻辑能力。FAVOR 的交互 demo 可以在 GitHub 上找到(https://github.com/BriansIDP/AudioVisualLLM.git),训练代码和模型检查点将很快地发布。
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
paper_authors: Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal for: 这个研究旨在提高零或几 shot 下的视觉语言任务性能,通过将大型语言模型(LLM)与视觉编码器结合起来,得到大型视觉语言模型(LVLM)。methods: 该研究使用了 gradient-free 框架,名为 RepARe,可以提取图像中核心信息,并通过 LLM 作为描述者和理解者,对原始问题进行修改。results: 研究发现,使用 RepARe 可以提高零或几 shot 下的视觉语言任务性能,在 VQAv2 和 A-OKVQA 两个任务上分别提高了 3.85% 和 6.41%。此外,使用黄金答案作为oracle问题候选选择,可以实现更大的提高。Abstract
An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to a LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually grounded information to the input as a preemptive clarification should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. Similarly, in the VQA setting, changing the way questions are framed can make them easier for models to answer. To this end, we present Rephrase, Augment and Reason (RepARe), a gradient-free framework that extracts salient details about the image using the underlying LVLM as a captioner and reasoner, in order to propose modifications to the original question. We then use the LVLM's confidence over a generated answer as an unsupervised scoring function to select the rephrased question most likely to improve zero-shot performance. Focusing on two visual question answering tasks, we show that RepARe can result in a 3.85% (absolute) increase in zero-shot performance on VQAv2 and a 6.41% point increase on A-OKVQA. Additionally, we find that using gold answers for oracle question candidate selection achieves a substantial gain in VQA accuracy by up to 14.41%. Through extensive analysis, we demonstrate that outputs from RepARe increase syntactic complexity, and effectively utilize vision-language interaction and the frozen language model in LVLMs.
摘要
随着更多的视觉任务可以通过几个或者 zero-shot 方式处理,大量语言模型(LLM)与视觉编码器结合形成大型视觉语言模型(LVLM)。虽然这有着巨大的优点,如不需要训练数据或自定义架构,但输入如何给 LVLM 是非常重要的。特别是,用 underspecified 的方式提交输入可能会导致错误答案,因为缺失视觉信息、复杂的隐式推理或语言抽象。因此,通过添加视觉固定信息来增强输入可以提高模型性能,例如,localizing objects 和解决参考。在 VQA Setting 中,改变问题的表述方式可以使模型更容易回答。为此,我们提出了 Rephrase、Augment 和 Reason(RepARe)框架,它使用 underlying LVLM 作为captioner和reasoner来提取图像中的精锦信息,并提出修改原始问题。然后,使用 LVLM 对生成的答案的信任度作为无supervised 评分函数来选择修改后的问题。我们在两个视觉问答任务上进行了实验,结果表明,RepARe 可以提高零shot性能,VQAv2 上提高 3.85%(绝对值),A-OKVQA 上提高 6.41% 点。此外,我们发现使用黄金答案作为oracle问题候选者选择可以在 VQA 中提高准确率,最高提高 14.41%。通过广泛的分析,我们证明了 RepARe 输出增加了语法复杂性,并有效地利用了视觉语言交互和冰凉语言模型在 LVLM 中。
results: paper的实验结果表明,SALT可以提高SUMMARY的质量,并且在医学领域SUMMARY中表现更好。此外,paper还比较了SALT与传统的RLHF方法(DPO),发现SALT在使用人类编辑数据时能够表现更好。Abstract
Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data -- Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT in improving the summary quality with Human and Imitation Edits. Through additional experiments, we show that SALT outperforms the conventional RLHF method (designed for human preferences) -- DPO, when applied to human-edit data. We hope the evidence in our paper prompts researchers to explore, collect, and better use different human feedback approaches scalably.
摘要
近期研究表明了使用人类反馈方式进行学习可以生成人决定的高质量文本。现有研究使用人类反馈来训练大语言模型(LLM),并已经获得了传统可能性训练所超越的摘要质量。在这篇论文中,我们关注了一种 menos explored的人类反馈方式---人类编辑。我们提出了序列匹配(不)可能性训练(SALT),一种新的技术,可以在训练循环中结合人类编辑和模型生成的数据。此外,我们还示出了使用现有训练数据中的真实编辑和模型生成的摘要来模拟人类编辑的方法,以降低人类编辑数据的成本。在我们的实验中,我们扩展了人类反馈的探索,从通用领域摘要扩展到医学领域摘要。我们的结果表明SALT可以提高摘要质量,并且在使用人类编辑和伪编辑数据时表现出色。通过额外的实验,我们还证明了SALT在使用人类编辑数据时超过了传统的RLHF方法(设计为人类偏好)——DPO。我们希望这篇论文的证据能够让研究人员更好地探索、收集和利用不同的人类反馈方式,以便在大规模的应用中进行更好的学习。
GraphLLM: Boosting Graph Reasoning Ability of Large Language Model
results: 实验结果表明,GraphLLM可以提高图数据理解和推理的准确率,并将上下文量减少96.45%。Abstract
The advancement of Large Language Models (LLMs) has remarkably pushed the boundaries towards artificial general intelligence (AGI), with their exceptional ability on understanding diverse types of information, including but not limited to images and audio. Despite this progress, a critical gap remains in empowering LLMs to proficiently understand and reason on graph data. Recent studies underscore LLMs' underwhelming performance on fundamental graph reasoning tasks. In this paper, we endeavor to unearth the obstacles that impede LLMs in graph reasoning, pinpointing the common practice of converting graphs into natural language descriptions (Graph2Text) as a fundamental bottleneck. To overcome this impediment, we introduce GraphLLM, a pioneering end-to-end approach that synergistically integrates graph learning models with LLMs. This synergy equips LLMs with the ability to proficiently interpret and reason on graph data, harnessing the superior expressive power of graph learning models. Our empirical evaluations across four fundamental graph reasoning tasks validate the effectiveness of GraphLLM. The results exhibit a substantial average accuracy enhancement of 54.44%, alongside a noteworthy context reduction of 96.45% across various graph reasoning tasks.
摘要
大语言模型(LLM)的发展有力地推动了人工通用智能(AGI)的前进, LLM 表现出色地理解多种信息类型,包括图像和音频。然而,Graph reasoning 领域中 LLM 的表现仍然不够,特别是在基本的图级 reasoning 任务中。研究发现,这是由于通常将图转换成自然语言描述(Graph2Text)的做法而导致的。为了缓解这个障碍,我们提出了 GraphLLM,一种独特的端到端方法,它 synergistically 结合了图学学习模型和 LLM 。这种结合使得 LLM 能够有效地理解和处理图数据,并且利用图学学习模型的更高表达力。我们在四个基本的图级 reasoning 任务上进行了Empirical 评估,结果表明 GraphLLM 的效果惊人,相对于基eline 方法,GraphLLM 的均值精度提高了54.44%,同时在不同的图级 reasoning 任务中Context 减少了96.45%。
Predicting Accident Severity: An Analysis Of Factors Affecting Accident Severity Using Random Forest Model
results: Random Forest模型的准确率高于80%,其中最重要的6个变量为风速、气压、湿度、视力、清晰天气和云层覆盖。Abstract
Road accidents have significant economic and societal costs, with a small number of severe accidents accounting for a large portion of these costs. Predicting accident severity can help in the proactive approach to road safety by identifying potential unsafe road conditions and taking well-informed actions to reduce the number of severe accidents. This study investigates the effectiveness of the Random Forest machine learning algorithm for predicting the severity of an accident. The model is trained on a dataset of accident records from a large metropolitan area and evaluated using various metrics. Hyperparameters and feature selection are optimized to improve the model's performance. The results show that the Random Forest model is an effective tool for predicting accident severity with an accuracy of over 80%. The study also identifies the top six most important variables in the model, which include wind speed, pressure, humidity, visibility, clear conditions, and cloud cover. The fitted model has an Area Under the Curve of 80%, a recall of 79.2%, a precision of 97.1%, and an F1 score of 87.3%. These results suggest that the proposed model has higher performance in explaining the target variable, which is the accident severity class. Overall, the study provides evidence that the Random Forest model is a viable and reliable tool for predicting accident severity and can be used to help reduce the number of fatalities and injuries due to road accidents in the United States
摘要
道路交通事故具有重要的经济和社会成本,一小部分严重事故占据了大部分成本。预测事故严重性可以帮助采取抢险策略,识别可能发生事故的不安全道路情况,采取有知识的行动,以减少严重事故的数量。本研究检查Random Forest机器学习算法是否能够预测事故严重性。模型在一个大都市区的事故记录 dataset 上训练,并使用不同的指标进行评估。Hyperparameters 和特征选择被优化,以提高模型的性能。结果显示,Random Forest 模型可以高效地预测事故严重性,准确率高于 80%。研究还确定了最重要的六个变量,包括风速、压力、湿度、视程、晴天和云层覆盖。已经适应的模型具有折线曲线面积为 80%,回归率为 79.2%,准确率为 97.1%,F1 分数为 87.3%。这些结果表明,提案的模型具有更高的表达能力,可以帮助减少因道路交通事故而导致的死亡和伤害。总的来说,本研究提供了Random Forest模型是可靠和可靠的预测事故严重性工具,可以在美国用于减少道路交通事故的死亡和伤害。
Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis
results: 在多个流行数据集(如 MOSI、MOSEI 和 CH-SIMS)上实现状态的表现,并通过多种缺省示例证明了我们的干扰和冲突抑制机制的有效性和必要性。Abstract
Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.
摘要
这篇文章探讨了多modal sentiment分析(MSA)的问题,MSA可以利用多种资料源(如语言、视频和音频)来分析情感,但是可能会存在不相关或冲突的资料,这可能会妨碍MSA的表现。为了解决这个问题,我们提出了适应语言导向多modal transformer(ALMT),它包括一个适应多模式学习(AHL)模组,可以从视觉和音频特征中学习一个不相关或冲突的表现。这个表现可以与语言特征进行联合表现,从而实现有效的MSA。在实践中,ALMT在多个流行的数据集(如MOSI、MOSEI和CH-SIMS)上 achieve state-of-the-art 表现,并且进行了丰富的ablation 测试,以验证我们的不相关或冲突抑制机制的有效性和必要性。
paper_authors: Nicholas Kroeger, Dan Ley, Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
for: This paper aims to study the effectiveness of large language models (LLMs) in explaining other predictive models.
methods: The paper proposes a novel framework that utilizes multiple prompting strategies, including perturbation-based ICL, prediction-based ICL, instruction-based ICL, and explanation-based ICL, to generate explanations for other models.
results: The paper demonstrates that LLM-generated explanations perform on par with state-of-the-art post hoc explainers, with an average accuracy of 72.19% in identifying the most important feature.Abstract
Large Language Models (LLMs) are increasingly used as powerful tools for a plethora of natural language processing (NLP) applications. A recent innovation, in-context learning (ICL), enables LLMs to learn new tasks by supplying a few examples in the prompt during inference time, thereby eliminating the need for model fine-tuning. While LLMs have been utilized in several applications, their applicability in explaining the behavior of other models remains relatively unexplored. Despite the growing number of new explanation techniques, many require white-box access to the model and/or are computationally expensive, highlighting a need for next-generation post hoc explainers. In this work, we present the first framework to study the effectiveness of LLMs in explaining other predictive models. More specifically, we propose a novel framework encompassing multiple prompting strategies: i) Perturbation-based ICL, ii) Prediction-based ICL, iii) Instruction-based ICL, and iv) Explanation-based ICL, with varying levels of information about the underlying ML model and the local neighborhood of the test sample. We conduct extensive experiments with real-world benchmark datasets to demonstrate that LLM-generated explanations perform on par with state-of-the-art post hoc explainers using their ability to leverage ICL examples and their internal knowledge in generating model explanations. On average, across four datasets and two ML models, we observe that LLMs identify the most important feature with 72.19% accuracy, opening up new frontiers in explainable artificial intelligence (XAI) to explore LLM-based explanation frameworks.
摘要
Rethinking Memory and Communication Cost for Efficient Large Language Model Training
methods: 本研究使用了细化的分割策略和 Hierarchical Overlapping Ring (HO-Ring) 通信拓扑,以减少内存重复和通信成本,提高训练效率。
results: 实验表明,PaRO 可以提高训练速度,相比 SOTA 方法的 1.19x-2.50x,并实现近线性扩展性。 HO-Ring 算法可以提高通信效率,相比传统的 Ring 算法的 36.5%。Abstract
Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of large language models, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in large language model training. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.
摘要
近期,许多分布式策略 для大型自然语言模型训练被提出。然而,这些方法具有限制的解决方案,即内存消耗和通信成本之间的质量协调。在本文中,我们重新思考大型自然语言模型训练中内存消耗和通信成本的影响,并提出了一个内存-通信平衡策略集Partial Redundancy Optimizer(PaRO)。PaRO提供了全面的选项,以减少归并分组通信的数量和频率,并通过细化分组策略减少内存约束,从而提高训练效率在不同的训练场景中。此外,我们提出了层次 overlap 环(HO-Ring)通信架构,以提高在节点或交换机之间的通信效率。我们的实验表明,PaRO可以对比SOTA方法提高训练速度,并实现近似线性扩展性。HO-Ring算法提高了传输效率,相比传统环算法,提高了36.5%。
DANet: Enhancing Small Object Detection through an Efficient Deformable Attention Network
results: 本论文的模型在NEU-DET和Pascal VOC datasets上的认知性和泛化能力得到了证明,特别是在识别不同类型的钢材损坏时表现出色。Abstract
Efficient and accurate detection of small objects in manufacturing settings, such as defects and cracks, is crucial for ensuring product quality and safety. To address this issue, we proposed a comprehensive strategy by synergizing Faster R-CNN with cutting-edge methods. By combining Faster R-CNN with Feature Pyramid Network, we enable the model to efficiently handle multi-scale features intrinsic to manufacturing environments. Additionally, Deformable Net is used that contorts and conforms to the geometric variations of defects, bringing precision in detecting even the minuscule and complex features. Then, we incorporated an attention mechanism called Convolutional Block Attention Module in each block of our base ResNet50 network to selectively emphasize informative features and suppress less useful ones. After that we incorporated RoI Align, replacing RoI Pooling for finer region-of-interest alignment and finally the integration of Focal Loss effectively handles class imbalance, crucial for rare defect occurrences. The rigorous evaluation of our model on both the NEU-DET and Pascal VOC datasets underscores its robust performance and generalization capabilities. On the NEU-DET dataset, our model exhibited a profound understanding of steel defects, achieving state-of-the-art accuracy in identifying various defects. Simultaneously, when evaluated on the Pascal VOC dataset, our model showcases its ability to detect objects across a wide spectrum of categories within complex and small scenes.
摘要
efficient和准确的小对象检测在制造环境中是至关重要的,以确保产品质量和安全。为解决这个问题,我们提出了一项涵合策略,将Faster R-CNN与前沿技术相结合。通过将Faster R-CNN与Feature Pyramid Network结合使用,我们让模型能够有效地处理制造环境中的多尺度特征。此外,我们还使用了Deformable Net,它可以根据缺陷的几何变化进行扭形和适应,提高缺陷检测的精度。接着,我们在每个基本ResNet50网络块中添加了Convolutional Block Attention Module,以选择特征中的有用信息,并抑制无用的信息。然后,我们将RoI Align取代RoI Pooling,以实现更细的区域对齐。最后,我们通过集成Focal Loss来有效地处理类偏好,这是对罕见缺陷的检测中非常重要。我们在NEU-DET和Pascal VOC数据集上进行了严格的评估,并证明了我们的模型在不同的环境下具有出色的稳定性和泛化能力。在NEU-DET数据集上,我们的模型对钢铁缺陷进行了深刻的理解,实现了不同缺陷的状况下的最高精度。同时,当我们的模型在Pascal VOC数据集上进行评估时,它展示了对多种类别的对象检测的能力,并在复杂和小的场景中具有出色的检测能力。
Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design
results: 论文表明,HarmonicFlow在简洁性、通用性和性能方面都超过了现有的生成过程,并且可以设计蛋白质结构域的精确 binding 结构。这种结构模型使得FlowSite可以设计精确的蛋白质结构域,并提供了首个通用的蛋白质结构域设计方法。Abstract
A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon the state-of-the-art generative processes for docking in simplicity, generality, and performance. Enabled by this structure modeling, FlowSite designs binding sites substantially better than baseline approaches and provides the first general solution for binding site design.
摘要
一些蛋白质功能需要与小分子结合,包括enzymatic catalysis。因此,设计小分子结合 pocket 有很多有效的应用,从药物合成到能量储存。为达到这个目标,我们首先开发了 HarmonicFlow,一种改进的生成过程,用于生成3D蛋白质-小分子结合结构。FlowSite 扩展了这种流模型,以同时生成蛋白质口袋中的粒子类型和分子的结合3D结构。我们表明,HarmonicFlow 在简洁性、通用性和性能方面都超越了状态元的生成过程。通过这种结构模型,FlowSite 可以设计结合站点得到substantially better than基线方法,并提供了第一个通用的结合站点设计解决方案。
Large-Scale OD Matrix Estimation with A Deep Learning Method
methods: combining deep learning和数值优化算法,将嵌入式数据与数据流聚合为数据统计
results: 提供了一个可靠且实时的交通流量统计方法,不dependent on prior information,并且可以减少工程费用Abstract
The estimation of origin-destination (OD) matrices is a crucial aspect of Intelligent Transport Systems (ITS). It involves adjusting an initial OD matrix by regressing the current observations like traffic counts of road sections (e.g., using least squares). However, the OD estimation problem lacks sufficient constraints and is mathematically underdetermined. To alleviate this problem, some researchers incorporate a prior OD matrix as a target in the regression to provide more structural constraints. However, this approach is highly dependent on the existing prior matrix, which may be outdated. Others add structural constraints through sensor data, such as vehicle trajectory and speed, which can reflect more current structural constraints in real-time. Our proposed method integrates deep learning and numerical optimization algorithms to infer matrix structure and guide numerical optimization. This approach combines the advantages of both deep learning and numerical optimization algorithms. The neural network(NN) learns to infer structural constraints from probe traffic flows, eliminating dependence on prior information and providing real-time performance. Additionally, due to the generalization capability of NN, this method is economical in engineering. We conducted tests to demonstrate the good generalization performance of our method on a large-scale synthetic dataset. Subsequently, we verified the stability of our method on real traffic data. Our experiments provided confirmation of the benefits of combining NN and numerical optimization.
摘要
“OD矩阵估计是智能交通系统(ITS)中的一个重要问题。它需要对初始OD矩阵进行调整,使用最小二乘法 regression 来适应现有的观测数据(例如交通流量资料)。但是,OD估计问题缺乏足够的条件,从数学上是不充分决定的。为了解决这个问题,一些研究人员将 Target OD 矩阵添加到 regression 中,以提供更多的构造约束。但是,这种方法对于现有的 Target OD 矩阵依赖度太高,可能会受到旧有的矩阵影响。另一些研究人员通过感应器数据,如车辆轨迹和速度,添加更多的构造约束。我们的提案方法是通过深度学习和数值优化算法来推导矩阵结构,并将其与数值优化算法结合。这种方法结合了深度学习的优点和数值优化算法的稳定性。对于大规模的 sintetic 数据集,我们的方法具有良好的泛化性。进一步的,我们对真实交通数据进行验证,证明了我们的方法的稳定性和可靠性。我们的实验结果显示,结合深度学习和数值优化算法可以提供更好的性能和经济性。”
A Review of the Ethics of Artificial Intelligence and its Applications in the United States
paper_authors: Esther Taiwo, Ahmed Akinsola, Edward Tella, Kolade Makinde, Mayowa Akinwande
For: The paper focuses on the ethical considerations of Artificial Intelligence (AI) in the United States, highlighting its impact on various sectors and entities, and the need for responsible and ethical AI practices.* Methods: The paper explores eleven fundamental ethical principles, including Transparency, Justice, Fairness, Equity, Non-Maleficence, Responsibility, Accountability, Privacy, Beneficence, Freedom, Autonomy, Trust, Dignity, Sustainability, and Solidarity, as a guiding framework for ethical AI development and deployment.* Results: The paper discusses the revolutionary impact of AI applications, such as Machine Learning, and explores various approaches used to implement AI ethics, addressing the growing concerns surrounding the inherent risks associated with the widespread use of AI.Abstract
This study is focused on the ethics of Artificial Intelligence and its application in the United States, the paper highlights the impact AI has in every sector of the US economy and multiple facets of the technological space and the resultant effect on entities spanning businesses, government, academia, and civil society. There is a need for ethical considerations as these entities are beginning to depend on AI for delivering various crucial tasks, which immensely influence their operations, decision-making, and interactions with each other. The adoption of ethical principles, guidelines, and standards of work is therefore required throughout the entire process of AI development, deployment, and usage to ensure responsible and ethical AI practices. Our discussion explores eleven fundamental 'ethical principles' structured as overarching themes. These encompass Transparency, Justice, Fairness, Equity, Non- Maleficence, Responsibility, Accountability, Privacy, Beneficence, Freedom, Autonomy, Trust, Dignity, Sustainability, and Solidarity. These principles collectively serve as a guiding framework, directing the ethical path for the responsible development, deployment, and utilization of artificial intelligence (AI) technologies across diverse sectors and entities within the United States. The paper also discusses the revolutionary impact of AI applications, such as Machine Learning, and explores various approaches used to implement AI ethics. This examination is crucial to address the growing concerns surrounding the inherent risks associated with the widespread use of artificial intelligence.
摘要
The paper proposes eleven fundamental ethical principles, structured as overarching themes, to guide the ethical development, deployment, and utilization of AI technologies. These principles include:1. Transparency2. Justice3. Fairness4. Equity5. Non-Maleficence6. Responsibility7. Accountability8. Privacy9. Beneficence10. Freedom11. Autonomy12. Trust13. Dignity14. Sustainability15. SolidarityThese principles collectively serve as a guiding framework for the ethical development and use of AI technologies in the United States. The paper also discusses the revolutionary impact of AI applications, such as Machine Learning, and explores various approaches used to implement AI ethics. This examination is crucial to address the growing concerns surrounding the inherent risks associated with the widespread use of AI.
Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena
paper_authors: Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, Kyle Richardson
for: evaluating the ability of Large Language Models (LLMs) to simulate human behavior in complex environments, specifically in auctions.
methods: using a novel simulation environment called AucArena to test the ability of state-of-the-art LLMs as bidding agents in controlled simulations.
results: LLMs demonstrate advanced reasoning skills and ability to manage budget, adhere to long-term goals and priorities, but with considerable variability in capabilities and occasional surpassing by heuristic baselines and human agents, highlighting the potential for further improvements in agent design and the importance of simulation environments for testing and refining agent architectures.Abstract
Can Large Language Models (LLMs) simulate human behavior in complex environments? LLMs have recently been shown to exhibit advanced reasoning skills but much of NLP evaluation still relies on static benchmarks. Answering this requires evaluation environments that probe strategic reasoning in competitive, dynamic scenarios that involve long-term planning. We introduce AucArena, a novel simulation environment for evaluating LLMs within auctions, a setting chosen for being highly unpredictable and involving many skills related to resource and risk management, while also being easy to evaluate. We conduct several controlled simulations using state-of-the-art LLMs as bidding agents. We find that through simple prompting, LLMs do indeed demonstrate many of the skills needed for effectively engaging in auctions (e.g., managing budget, adhering to long-term goals and priorities), skills that we find can be sharpened by explicitly encouraging models to be adaptive and observe strategies in past auctions. These results are significant as they show the potential of using LLM agents to model intricate social dynamics, especially in competitive settings. However, we also observe considerable variability in the capabilities of individual LLMs. Notably, even our most advanced models (GPT-4) are occasionally surpassed by heuristic baselines and human agents, highlighting the potential for further improvements in the design of LLM agents and the important role that our simulation environment can play in further testing and refining agent architectures.
摘要
Language Model Beats Diffusion – Tokenizer is Key to Visual Generation
paper_authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
results: 通过使用 MAGVIT-v2 tokenizer,LLM 可以超越 diffusion 模型在图像和视频生成任务中的表现,同时在视频压缩和动作识别任务中也表现出色。Abstract
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.
摘要
LLMs 是语言生成任务中的主导模型,但它们在图像和视频生成任务中不如扩散模型表现为好。为了有效地使用 LLMs 进行视觉生成,一个关键组件是视觉 токен化器,它将 pixel-space 输入映射到适合 LLM 学习的精炼的 tokens。在这篇论文中,我们介绍了 MAGVIT-v2,一种用于生成简洁和表达力强的 видео和图像 tokens 的视觉 токен化器。我们使用这个新的 токен化器,我们展示了 LLMs 在标准的图像和视频生成 benchmark 上表现出色,并且在两个额外任务上表现出色:(1)与下一代视频编码器(VCC)相当的视频压缩,以及(2)学习有效的动作认知任务。
The Program Testing Ability of Large Language Models for Code
methods: 这 paper 使用了一系列的方法来测试 LLMs,包括人工评估和 MBPP 等数据集。
results: 这 paper 显示了 LLMs 在代码测试方面的一些有趣的性质,并通过使用生成的测试用例提高了代码质量,从而提高了代码的执行率。Abstract
Recent development of large language models (LLMs) for code like CodeX and CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their ability of synthesizing code that completes a program for performing a pre-defined task has been intensively tested and verified on benchmark datasets including HumanEval and MBPP. Yet, evaluation of these LLMs from more perspectives (than just program synthesis) is also anticipated, considering their broad scope of applications in software engineering. In this paper, we explore the ability of LLMs for testing programs/code. By performing thorough analyses of recent LLMs for code in program testing, we show a series of intriguing properties of these models and demonstrate how program testing ability of LLMs can be improved. Following recent work which utilizes generated test cases to enhance program synthesis, we further leverage our findings in improving the quality of the synthesized programs and show +11.77% and +4.22% higher code pass rates on HumanEval+ comparing with the GPT-3.5-turbo baseline and the recent state-of-the-art, respectively.
摘要
In this paper, we explore the ability of LLMs for testing programs/code. We conduct a thorough analysis of recent LLMs for code in program testing and identify several intriguing properties of these models. Furthermore, we demonstrate how the program testing ability of LLMs can be improved, building on recent work that utilizes generated test cases to enhance program synthesis. Our findings lead to a +11.77% and +4.22% increase in code pass rates on HumanEval+ compared to the GPT-3.5-turbo baseline and the recent state-of-the-art, respectively.
STOPNet: Multiview-based 6-DoF Suction Detection for Transparent Objects on Production Lines
results: 对于现有方法,我们的方法在实验和实际应用中具有更好的普适性和更高的性能,可满足实际工业的需求。Abstract
In this work, we present STOPNet, a framework for 6-DoF object suction detection on production lines, with a focus on but not limited to transparent objects, which is an important and challenging problem in robotic systems and modern industry. Current methods requiring depth input fail on transparent objects due to depth cameras' deficiency in sensing their geometry, while we proposed a novel framework to reconstruct the scene on the production line depending only on RGB input, based on multiview stereo. Compared to existing works, our method not only reconstructs the whole 3D scene in order to obtain high-quality 6-DoF suction poses in real time but also generalizes to novel environments, novel arrangements and novel objects, including challenging transparent objects, both in simulation and the real world. Extensive experiments in simulation and the real world show that our method significantly surpasses the baselines and has better generalizability, which caters to practical industrial needs.
摘要
在这项工作中,我们介绍了STOPNet,一个用于生产线上6个自由度物体捕捉检测的框架,强调但不限于透明物体,这是现代机器人系统和现代工业中的一个重要和困难的问题。现有的方法,需要深度输入,在透明物体上失败,因为深度摄像头无法感知其几何结构,而我们提出了一种新的框架,基于多视图零点投影,可以在RGB输入基础上重建生产线上的场景,并且可以在实时获得高质量的6个自由度捕捉姿势。与现有的方法相比,我们的方法不仅可以重建整个3D场景,以获得高质量的捕捉姿势,而且可以在新环境、新排序和新物体上普遍,包括实际上的挑战性透明物体,并在实际世界中达到了更好的普遍性。广泛的实验在实际世界和 simulate 中表明,我们的方法在比较基eline上显著超越了基eline,并且具有更好的普遍性,这符合实际工业需求。
Guiding Language Model Reasoning with Planning Tokens
results: 在三个数学问题 datasets 上,与普通的链式思维精心 fine-tuning 基eline 相比,示出了明显的准确性改善。Abstract
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as chain-of-thought reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. We find that while LLMs can manage individual reasoning steps well, they struggle with maintaining consistency across an entire reasoning chain. To solve this, we introduce 'planning tokens' at the start of each reasoning step, serving as a guide for the model. These token embeddings are then fine-tuned along with the rest of the model parameters. Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets w.r.t. plain chain-of-thought fine-tuning baselines.
摘要
results: 研究发现,ST在做出预测时通常只需要关注一些token-pairs,但是为了准确预测,需要关注大多数token和parts of speech。Abstract
Despite the success of Siamese encoder models such as sentence transformers (ST), little is known about the aspects of inputs they pay attention to. A barrier is that their predictions cannot be attributed to individual features, as they compare two inputs rather than processing a single one. This paper derives a local attribution method for Siamese encoders by generalizing the principle of integrated gradients to models with multiple inputs. The solution takes the form of feature-pair attributions, and can be reduced to a token-token matrix for STs. Our method involves the introduction of integrated Jacobians and inherits the advantageous formal properties of integrated gradients: it accounts for the model's full computation graph and is guaranteed to converge to the actual prediction. A pilot study shows that in an ST few token-pairs can often explain large fractions of predictions, and it focuses on nouns and verbs. For accurate predictions, it however needs to attend to the majority of tokens and parts of speech.
摘要
尽管SIAMESE编码器模型如sentence transformers(ST)在成功的背景下,仍然知之少于其处理输入的方面。一个障碍是它们的预测无法归因于单个特征,因为它们比较两个输入而不是处理单个输入。这篇论文提出了一种本地归因方法 для SIAMESE编码器模型,通过泛化集成导数原理来对多输入模型进行归因。该方法的解释形式为对应对方特征归因,可以将其减少到一个单词单词的矩阵中,并且具有集成导数的优点:它考虑了模型的全部计算图和是确定的归因方法。一项试点研究显示,在ST中,只需要几对单词可以解释大量预测,并且它们主要集中在名词和动词上。然而,为了准确预测,它们需要对大多数单词和部分语法进行注意。
Based on What We Can Control Artificial Neural Networks
results: 本研究提出了一种新的分析方法,可以帮助确保ANNs的学习过程的稳定性和效率。这种方法还可能对新的优化器和学习系统的开发产生影响,特别是当找出哪些组件对ANNs产生负面影响时。请参考:\url{https://github.com/RandomUserName2023/Control-ANNs}。Abstract
How can the stability and efficiency of Artificial Neural Networks (ANNs) be ensured through a systematic analysis method? This paper seeks to address that query. While numerous factors can influence the learning process of ANNs, utilizing knowledge from control systems allows us to analyze its system function and simulate system responses. Although the complexity of most ANNs is extremely high, we still can analyze each factor (e.g., optimiser, hyperparameters) by simulating their system response. This new method also can potentially benefit the development of new optimiser and learning system, especially when discerning which components adversely affect ANNs. Controlling ANNs can benefit from the design of optimiser and learning system, as (1) all optimisers act as controllers, (2) all learning systems operate as control systems with inputs and outputs, and (3) the optimiser should match the learning system. Please find codes: \url{https://github.com/RandomUserName2023/Control-ANNs}.
摘要
如何确保人工神经网络(ANNs)的稳定性和效率?这篇论文旨在回答这个问题。虽然多种因素可能影响 ANNs 的学习过程,但通过知识控制系统来分析其系统功能并模拟系统响应。虽然大多数 ANNs 的复杂度很高,但我们仍可以分析每个因素(例如优化器、超参数) by 模拟它们的系统响应。这种新方法还可能为 ANNs 的发展提供新的优化器和学习系统,特别是当探测那些组件对 ANNs 产生负面影响时。控制 ANNs 可以从优化器和学习系统的设计中受益,因为(1)所有优化器都是控制器,(2)所有学习系统都是控制系统,(3)优化器应该与学习系统匹配。请找到代码:https://github.com/RandomUserName2023/Control-ANNs。
Abstractive Summarization of Large Document Collections Using GPT
results: 对比 exist 状态的 art 系统 BART、BRIO、PEGASUS 和 MoCa 的 ROGUE 摘要分数,本研究在 CNN/Daily Mail 测试集上与 BART 和 PEGASUS 相当,在 Gigaword 测试集上与 BART 相当。Abstract
This paper proposes a method of abstractive summarization designed to scale to document collections instead of individual documents. Our approach applies a combination of semantic clustering, document size reduction within topic clusters, semantic chunking of a cluster's documents, GPT-based summarization and concatenation, and a combined sentiment and text visualization of each topic to support exploratory data analysis. Statistical comparison of our results to existing state-of-the-art systems BART, BRIO, PEGASUS, and MoCa using ROGUE summary scores showed statistically equivalent performance with BART and PEGASUS on the CNN/Daily Mail test dataset, and with BART on the Gigaword test dataset. This finding is promising since we view document collection summarization as more challenging than individual document summarization. We conclude with a discussion of how issues of scale are
摘要
这篇论文提出了一种抽象摘要方法,旨在对文档集合进行摘要而不是单个文档。我们的方法使用了 semantics 归一化、文档内容减少、semantic 块分割、基于 GPT 的摘要和 concatenation,以及每个话题的感情和文本视觉表示。我们通过对 ROGUE 摘要分数进行统计比较,与现有的状态 искус数据集 BART、BRIO、PEGASUS 和 MoCa 进行比较,在 CNN/Daily Mail 测试集上与 BART 和 PEGASUS 相当,在 Gigaword 测试集上与 BART 相当。这一结果是有前途的,因为我们视文档集合摘要为单个文档摘要更加困难。我们在结尾采用了一些问题的扩展和未来工作的讨论。
The potential of large language models for improving probability learning: A study on ChatGPT3.5 and first-year computer engineering students
paper_authors: Angel Udias, Antonio Alonso-Ayuso, Ignacio Sanchez, Sonia Hernandez, Maria Eugenia Castellanos, Raquel Montes Diez, Emilio Lopez Cano
For: The paper assesses the efficacy of ChatGPT in solving probability problems typically presented in introductory computer engineering exams.* Methods: The study uses a set of 23 probability exercises administered to students at Rey Juan Carlos University (URJC) in Madrid, and evaluates the responses produced by ChatGPT qualitatively, assigning grades based on the same criteria used for students.* Results: The results indicate that ChatGPT surpasses the average student in terms of phrasing, organization, and logical reasoning, and the model’s performance remained consistent for both the Spanish and English versions of the exercises. However, ChatGPT encountered difficulties in executing basic numerical operations, which were overcome by requesting the solution in the form of an R script.Abstract
In this paper, we assess the efficacy of ChatGPT (version Feb 2023), a large-scale language model, in solving probability problems typically presented in introductory computer engineering exams. Our study comprised a set of 23 probability exercises administered to students at Rey Juan Carlos University (URJC) in Madrid. The responses produced by ChatGPT were evaluated by a group of five statistics professors, who assessed them qualitatively and assigned grades based on the same criteria used for students. Our results indicate that ChatGPT surpasses the average student in terms of phrasing, organization, and logical reasoning. The model's performance remained consistent for both the Spanish and English versions of the exercises. However, ChatGPT encountered difficulties in executing basic numerical operations. Our experiments demonstrate that requesting ChatGPT to provide the solution in the form of an R script proved to be an effective approach for overcoming these limitations. In summary, our results indicate that ChatGPT surpasses the average student in solving probability problems commonly presented in introductory computer engineering exams. Nonetheless, the model exhibits limitations in reasoning around certain probability concepts. The model's ability to deliver high-quality explanations and illustrate solutions in any programming language, coupled with its performance in solving probability exercises, suggests that large language models have the potential to serve as learning assistants.
摘要
在这篇论文中,我们评估了ChatGPT(版本为2月2023)大型语言模型在解probability问题方面的效果。我们的研究包括23个probability问题,对于马德里 Rey Juan Carlos大学(URJC)的学生进行了测试。ChatGPT的答案由5名统计教授评估,他们根据同样的标准评分学生的答案。我们的结果表明,ChatGPT在表达、组织和逻辑推理方面超过了学生的平均水平。模型在西班牙语和英语版probability问题上表现一致。然而,ChatGPT在基本数学运算方面遇到了困难。我们的实验表明,向ChatGPT请求提供解决方案的R脚本形式是一种有效的方法,以超越这些限制。总之,我们的结果表明,ChatGPT在入门计算机工程考试中常见的probability问题方面表现出色,但模型在某些概率概念上存在限制。模型能够提供高质量的解释和在任何编程语言中示例解决方案,表明大语言模型有可能作为学习助手。
For: The paper aims to enhance the efficiency and speed of legal procedures by utilizing AI technology to help legal professionals analyze legal cases.* Methods: The paper uses open-sourced large language models to create arguments derived from the facts present in legal cases.* Results: The generated arguments from the best performing method have on average 63% overlap with the benchmark set gold standard annotations.Here are the three key points in Simplified Chinese text:* For: 这项研究旨在利用人工智能技术,提高法律程序的效率和速度。* Methods: 这项研究使用开源大型自然语言模型,生成法律案例中的事实所基于的Arguments。* Results: 最佳方法生成的Arguments在基本标准注释中的重合率平均为63%。Abstract
The count of pending cases has shown an exponential rise across nations (e.g., with more than 10 million pending cases in India alone). The main issue lies in the fact that the number of cases submitted to the law system is far greater than the available number of legal professionals present in a country. Given this worldwide context, the utilization of AI technology has gained paramount importance to enhance the efficiency and speed of legal procedures. In this study we partcularly focus on helping legal professionals in the process of analyzing a legal case. Our specific investigation delves into harnessing the generative capabilities of open-sourced large language models to create arguments derived from the facts present in legal cases. Experimental results show that the generated arguments from the best performing method have on average 63% overlap with the benchmark set gold standard annotations.
摘要
全球各国案件数量在急增(例如印度单独已经有超过1000万个案件)。主要问题在于法律系统内的案件数量比法律专业人员的数量更多。视这种全球背景,利用人工智能技术已成为提高法律程序效率和速度的重要手段。本研究专注于帮助法律专业人员分析法律案件。我们的特定调查是利用开源大型自然语言模型的生成能力来从法律案件中生成基于事实的法律Arguments。实验结果显示,最佳方法生成的Arguments的平均 overlap率为63%。
results: 在多个大规模元学习benchmark上测试,SAMA比基eline元学习算法具有1.7/4.8倍的 Throughput和2.0/3.8倍的内存占用率,单/多GPU集成。此外,SAMA在BERT和RoBERTa大语言模型中进行文本分类任务中,以及在图像分类任务中进行数据优化,均实现了顺利的改进,并达到了小规模和大规模数据剪辑的状态机器。Abstract
Despite its flexibility to learn diverse inductive biases in machine learning programs, meta learning (i.e., learning to learn) has long been recognized to suffer from poor scalability due to its tremendous compute/memory costs, training instability, and a lack of efficient distributed training support. In this work, we focus on making scalable meta learning practical by introducing SAMA, which combines advances in both implicit differentiation algorithms and systems. Specifically, SAMA is designed to flexibly support a broad range of adaptive optimizers in the base level of meta learning programs, while reducing computational burden by avoiding explicit computation of second-order gradient information, and exploiting efficient distributed training techniques implemented for first-order gradients. Evaluated on multiple large-scale meta learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and 2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU setups compared to other baseline meta learning algorithms. Furthermore, we show that SAMA-based data optimization leads to consistent improvements in text classification accuracy with BERT and RoBERTa large language models, and achieves state-of-the-art results in both small- and large-scale data pruning on image classification tasks, demonstrating the practical applicability of scalable meta learning across language and vision domains.
摘要
尽管机器学习中的元学习(即学习学习)具有学习多种启发的灵活性,但长期以来,元学习受到了计算/存储成本过高、训练不稳定和分布式训练支持不充分的问题困扰。在这项工作中,我们关注使得元学习可扩展的实用性,通过引入SAMA来实现。SAMA组合了隐式 diferentiation算法和系统技术,特别是在基础级元学习程序中支持广泛的适应化优化器,同时减少计算负担,避免直接计算第二阶导数信息,并利用高效的分布式训练技术实现。在多个大规模元学习标准 benchmark 上评估,SAMA显示在单/多GPU设置下具有1.7/4.8倍的throughput和2.0/3.8倍的内存占用量,相比其他基eline元学习算法。此外,我们表明SAMA可以在语言和视觉领域中实现可扩展的数据优化,并在BERT和RoBERTa大语言模型中表现出了一致的提升,并在图像分类任务中实现了小规模和大规模数据减少的状态环境。
paper_authors: Muhan Li, David Matthews, Sam Kriegman
for: 探讨了用Policy Gradients来设计自由形机器人的方法
methods: 使用动作将atomic building block bundle归并或移除,形成非parametric macrostructure如肢体、器官和腔体
results: 实现了开 loop控制,未来可以适应closed loop控制和 sim2real转移到物理机器人Abstract
Inspired by the necessity of morphological adaptation in animals, a growing body of work has attempted to expand robot training to encompass physical aspects of a robot's design. However, reinforcement learning methods capable of optimizing the 3D morphology of a robot have been restricted to reorienting or resizing the limbs of a predetermined and static topological genus. Here we show policy gradients for designing freeform robots with arbitrary external and internal structure. This is achieved through actions that deposit or remove bundles of atomic building blocks to form higher-level nonparametric macrostructures such as appendages, organs and cavities. Although results are provided for open loop control only, we discuss how this method could be adapted for closed loop control and sim2real transfer to physical machines in future.
摘要
受动物形态适应的需要启发,一组增长的研究尝试将机器人训练扩展到物理机器人设计的方面。然而,使用奖励学习方法优化3D机器人形态的方法一直受限于重定向或缩放预先预定的顺序型机器人的臂部。我们现在显示了一种使用政策偏好来设计自由形机器人,这种方法通过执行填充或 removing 粒子堆集来形成高级非参数 macrostructure,如肢体、器官和腔体。虽然我们只提供了开loop控制的结果,但我们讨论了如何将这种方法适应到closed loop控制和 sim2real 转移到物理机器人的未来。
ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain
results: 研究发现,ViTs在多种CV应用中表现出优于Convolutional Neural Networks(CNNs),包括图像分类、物体识别、图像 segmentation、视频变换、图像净化和NAS等。同时,研究还提出了许多未解决的问题和潜在的研究机会。Abstract
Transformer design is the de facto standard for natural language processing tasks. The success of the transformer design in natural language processing has lately piqued the interest of researchers in the domain of computer vision. When compared to Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems. Transformer-based models outperform other types of networks, such as convolutional and recurrent neural networks, in a range of visual benchmarks. We evaluate various vision transformer models in this work by dividing them into distinct jobs and examining their benefits and drawbacks. ViTs can overcome several possible difficulties with convolutional neural networks (CNNs). The goal of this survey is to show the first use of ViTs in CV. In the first phase, we categorize various CV applications where ViTs are appropriate. Image classification, object identification, image segmentation, video transformer, image denoising, and NAS are all CV applications. Our next step will be to analyze the state-of-the-art in each area and identify the models that are currently available. In addition, we outline numerous open research difficulties as well as prospective research possibilities.
摘要
<>将文本翻译成简化中文。<>变换器设计已经成为自然语言处理任务的逻辑标准。随着变换器设计在自然语言处理领域的成功,研究人员开始关注这种设计在计算机视觉领域的应用。与卷积神经网络(CNN)相比,视力变换器(ViT)在许多视觉问题上变得更加受欢迎和主导性。基于变换器的模型在各种视觉标准上表现出色,超过了卷积神经网络和回归神经网络的性能。在这项工作中,我们将对不同的视觉变换器模型进行分类和分析,描述其优缺点。ViT可以超越卷积神经网络的一些可能的困难。本文的目标是在计算机视觉领域内,首次使用ViT。在第一个阶段,我们将分类各种适用于计算机视觉应用的CV应用程序。包括图像分类、物体识别、图像分割、视频变换、图像净化和NAS等。接下来,我们将分析每个领域的现状,并识别目前可用的模型。此外,我们还将列出许多开放的研究Difficulties和前景。
Causal structure learning with momentum: Sampling distributions over Markov Equivalence Classes of DAGs
for: INFERRING BAYESIAN NETWORK STRUCTURE (DIRECTED Acyclic Graph, DAG FOR SHORT)
methods: NON-REVERSIBLE CONTINUOUS TIME MARKOV CHAIN (CAUSAL ZIG-ZAG SAMPLER) TARGETING A PROBABILITY DISTRIBUTION OVER CLASSES OF OBSERVATIONALLY EQUIVALENT (MARKOV EQUIVALENT) DAGs
results: IMPROVED MIXING COMPARED TO STATE-OF-THE-ART IMPLEMENTATIONS USING GREEDY EQUIVALENCE SEARCH (GES) OPERATORS WITH A MOMENTUM VARIABLE, AND EFFICIENT IMPLEMENTATION OF LISTING, COUNTING, UNIFORMLY SAMPLING, AND APPLYING POSSIBLE MOVES OF GES OPERATORS.Abstract
In the context of inferring a Bayesian network structure (directed acyclic graph, DAG for short), we devise a non-reversible continuous time Markov chain, the "Causal Zig-Zag sampler", that targets a probability distribution over classes of observationally equivalent (Markov equivalent) DAGs. The classes are represented as completed partially directed acyclic graphs (CPDAGs). The non-reversible Markov chain relies on the operators used in Chickering's Greedy Equivalence Search (GES) and is endowed with a momentum variable, which improves mixing significantly as we show empirically. The possible target distributions include posterior distributions based on a prior over DAGs and a Markov equivalent likelihood. We offer an efficient implementation wherein we develop new algorithms for listing, counting, uniformly sampling, and applying possible moves of the GES operators, all of which significantly improve upon the state-of-the-art.
摘要
在推断 bayesian 网络结构(直接循环图,dag 简称)方面,我们设计了一种不可逆的连续时间马尔可夫链,称为“ causal zig-zag sampler”,该链targeted一个观察可Equivalence classes of observationally (Markov equivalent) DAGs的概率分布。这些类型被表示为完善的部分导向循环图(CPDAGs)。不可逆的马尔可夫链利用GES操作符,并具有一个势量变量,这使得混合得到了显著改善,我们在实验中验证了这一点。 possible target distributions include posterior distributions based on a prior over DAGs and a Markov equivalent likelihood。我们提供了高效的实现,其中包括开发了新的列表、计数、均匀采样和可能的移动操作算法,这些算法都有显著改善了现有状态的。
No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling
paper_authors: Xuwei Xu, Changlin Li, Yudong Chen, Xiaojun Chang, Jiajun Liu, Sen Wang
For: 这个论文目的是提出一种名为IdleViT的动态token遮瑕方法,以提高运算效率和表现力。* Methods: 这个方法选择每层中的一部分图像token参与计算,并将其他token直接传递到下一层的输出中。这样可以避免在早期阶段 improvident pruning 导致的 permanent loss of image information。此外,这个方法还使用了 normalized graph cut 的内置数据库损失来改善图像选择能力。* Results: 实验结果显示,IdleViT 可以将预训 ViT 的复杂度降低到 33%,而且只需要调整 30 次 epoch。此外,当保留比例为 0.5 时,IdleViT 可以在 DeiT-S 上比 EViT 高 0.5% 的精度和更快的测试速度。Abstract
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks, yet their high computational complexity prevents their deployment in computing resource-constrained environments. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs by dynamically dropping image tokens. However, some undesirable pruning at early stages may result in permanent loss of image information in subsequent layers, consequently hindering model performance. To address this problem, we propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency. Specifically, in each layer, IdleViT selects a subset of the image tokens to participate in computations while keeping the rest of the tokens idle and directly passing them to this layer's output. By allowing the idle tokens to be re-selected in the following layers, IdleViT mitigates the negative impact of improper pruning in the early stages. Furthermore, inspired by the normalized graph cut, we devise a token cut loss on the attention map as regularization to improve IdleViT's token selection ability. Our method is simple yet effective and can be extended to pyramid ViTs since no token is completely dropped. Extensive experimental results on various ViT architectures have shown that IdleViT can diminish the complexity of pretrained ViTs by up to 33\% with no more than 0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs. Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The source code is available in the supplementary material.
摘要
通过图像矩阵变换(ViT),计算机视觉任务的表现几乎不可思议,但是它们的计算复杂性使得在计算资源受限的环境中不能实施。为了解决这个问题,我们提出了IdleViT,一种基于动态token idle的方法,可以很好地平衡性能和效率。具体来说,在每层中,IdleViT选择图像中的一部分token参与计算,而保留剩下的token idle,直接将其传递给当前层的输出。通过在后续层中重新选择idletoken,IdleViT消除了在早期阶段的不合适剪裁所产生的负面影响。此外,我们根据 норма化图像排序(normalized graph cut)的思想,在注意力图中定义了一个token cut损失,以便提高IdleViT的token选择能力。我们的方法简单而有效,可以扩展到 pyramid ViTs,因为没有完全drop的token。我们在不同的ViT架构上进行了广泛的实验,并证明了IdleViT可以减少预训练ViT的复杂度达33%,只需要30个epoch的微调,而且在ImageNet上保持至少0.2%的准确率下。特别是,当保留比例为0.5时,IdleViT可以在DeiT-S上高于状态之前的EViT,增加0.5%的准确率和更快的执行速度。详细的源代码可以在补充材料中找到。
FENCE: Fairplay Ensuring Network Chain Entity for Real-Time Multiple ID Detection at Scale In Fantasy Sports
results: 本文介绍了一种分布式Machine Learning系统,用于支持和服务检测模型的推断。系统能够在实时中进行检测,以采取 corrrective actions。此外,文章还包括人类在Loop组件,用于验证、反馈和ground truth标注。Abstract
Dream11 takes pride in being a unique platform that enables over 190 million fantasy sports users to demonstrate their skills and connect deeper with their favorite sports. While managing such a scale, one issue we are faced with is duplicate/multiple account creation in the system. This is done by some users with the intent of abusing the platform, typically for bonus offers. The challenge is to detect these multiple accounts before it is too late. We propose a graph-based solution to solve this problem in which we first predict edges/associations between users. Using the edge information we highlight clusters of colluding multiple accounts. In this paper, we talk about our distributed ML system which is deployed to serve and support the inferences from our detection models. The challenge is to do this in real-time in order to take corrective actions. A core part of this setup also involves human-in-the-loop components for validation, feedback, and ground-truth labeling.
摘要
results: 在ImageNet-1K数据集上,通过 incorporating our module into tiny ViT models,可以提高top-1准确率,而且计算复杂度变化在0.03 GMACs以下。 Specifically, our proposed channel shuffle module consistently improves the top-1 accuracy by up to 2.8%.Abstract
Vision Transformers (ViTs) have demonstrated remarkable performance in various computer vision tasks. However, the high computational complexity hinders ViTs' applicability on devices with limited memory and computing resources. Although certain investigations have delved into the fusion of convolutional layers with self-attention mechanisms to enhance the efficiency of ViTs, there remains a knowledge gap in constructing tiny yet effective ViTs solely based on the self-attention mechanism. Furthermore, the straightforward strategy of reducing the feature channels in a large but outperforming ViT often results in significant performance degradation despite improved efficiency. To address these challenges, we propose a novel channel shuffle module to improve tiny-size ViTs, showing the potential of pure self-attention models in environments with constrained computing resources. Inspired by the channel shuffle design in ShuffleNetV2 \cite{ma2018shufflenet}, our module expands the feature channels of a tiny ViT and partitions the channels into two groups: the \textit{Attended} and \textit{Idle} groups. Self-attention computations are exclusively employed on the designated \textit{Attended} group, followed by a channel shuffle operation that facilitates information exchange between the two groups. By incorporating our module into a tiny ViT, we can achieve superior performance while maintaining a comparable computational complexity to the vanilla model. Specifically, our proposed channel shuffle module consistently improves the top-1 accuracy on the ImageNet-1K dataset for various tiny ViT models by up to 2.8\%, with the changes in model complexity being less than 0.03 GMACs.
摘要
《视图变换器》(ViTs)在计算机视觉任务中表现出色,但高计算复杂性限制了ViTs在有限内存和计算资源的设备上的应用。虽然一些研究已经探索了将卷积层与自注意机制结合使用以提高ViTs的效率,但还有一个知识空白在建立简单而高效的ViTssolely基于自注意机制。此外,通常减少大型ViT的特征通道会导致显著性能下降,尽管提高了效率。为了解决这些挑战,我们提出了一种新的通道排序模块,用于改进简单型ViTs。我们的模块基于ShuffleNetV2中的通道排序设计,将特征通道分成两组:“Attended”和“Idle”组。只有在“Attended”组上进行自注意计算,然后进行通道排序操作,以便在两组之间进行信息交换。通过将我们的模块纳入简单型ViT中,我们可以实现高性能,同时保持与标准模型的计算复杂性相似。具体来说,我们的提议的通道排序模块在ImageNet-1K数据集上的top-1准确率上提高了2.8%,而模型的变化量占0.03 GMACs以下。
Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand
paper_authors: Junfeng Guo, Yiming Li, Lixu Wang, Shu-Tao Xia, Heng Huang, Cong Liu, Bo Li
for: 保护开源数据集的版权,防止恶意模型攻击
methods: 基于域水印的数据集所有权验证,通过生成难 Sample来验证模型的准确性
results: 提出了一种基于域水印的数据集所有权验证方法,可以防止恶意模型攻击,并且有较高的鲁棒性和抗性能力。Abstract
The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.
摘要
“深度神经网络(DNN)的繁荣得益于开源数据集,用户可以通过这些数据集进行评估和改进自己的方法。在这篇论文中,我们再次检视了基于DOV( dataset ownership verification)的数据集权利保护方法,我们发现这些方法是根本不可靠的,因为它们可能会由 adversaries 引入黑客识别器模型中的恶意识别行为。在这篇论文中,我们从另一个角度设计 DOV,使得在训练在保护数据集上的损坏模型中,对一些难样本进行正确的识别。我们的方法是基于 DNN 的总体化性特性,我们在原始数据集中找到一个难以总化的Domain(领域),然后通过修改这些样本来学习一个 hardly-generalized 领域。我们将这个领域作为 DNN 的域水印,通过对这些修改后的样本进行训练,来确保 watermark 的隐蔽性。我们还设计了一种假设测试导向的所有权验证方法,并提供了方法的理论分析。我们在三个基准数据集上进行了广泛的实验,并证明了我们的方法的有效性和对可适应方法的抵御力。相关的代码可以在 GitHub 上找到:https://github.com/JunfengGo/Domain-Watermark。”
Dynamic Top-k Estimation Consolidates Disagreement between Feature Attribution Methods
paper_authors: Jonathan Kamp, Lisa Beinborn, Antske Fokkens
for: 这paper是用来解释文本分类器的预测结果的方法。
methods: 这paper使用了几种不同的方法来计算特征归因分数,并评估了这些方法的性能。
results: 研究发现,使用固定k或动态k都可以得到高度一致的结果,但动态k主要提高了 интеграルGradient和GradientXInput的性能。这是首次证明了 attribute scores 的顺序性有用于人类理解。Abstract
Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agreement between multiple methods and humans on an NLI task, using fixed k and dynamic k. We find that perturbation-based methods and Vanilla Gradient exhibit highest agreement on most method--method and method--human agreement metrics with a static k. Their advantage over other methods disappears with dynamic ks which mainly improve Integrated Gradient and GradientXInput. To our knowledge, this is the first evidence that sequential properties of attribution scores are informative for consolidating attribution signals for human interpretation.
摘要
<>translate_text="Feature attribution scores 是用于解释文本分类器预测结果的 Token 的准确性。在这种工作中,我们提出了一种方法来确定显示 k 个 Token 的优化数量,基于序列性质。我们的方法是动态 sentence,方法不依赖的,并且能够处理句子长度偏好。我们比较了多种方法和人类在 NLI 任务中的一致性,使用 fixes k 和动态 k。我们发现,扰动基于方法和 Vanilla Gradient 在大多数方法--方法和方法--人类协议中表现最高,其优势在 static k 下消失。这是我们知道的第一个证据,Sequential 性质是用于结合权重信号的准确信号的。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
paper_authors: Lei Wang, Piotr Koniusz, Tom Gedeon, Liang Zheng for: The paper is written to address the issue of inconsistent similarity measurements in contrastive learning, specifically when using multiple augmentation strategies.methods: The paper proposes using multiple projection heads, each producing a separate set of features, to improve the consistency of similarity measurements in contrastive learning. The loss function for pre-training is based on a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations.results: The proposed adaptive multi-head contrastive learning (AMCL) method improves the performance of several popular contrastive learning methods, including SimCLR, MoCo, and Barlow Twins, under various backbones and linear probing epochs. The improvement is more significant when multiple augmentation methods are used.Abstract
In contrastive learning, two views of an original image generated by different augmentations are considered as a positive pair whose similarity is required to be high. Moreover, two views of two different images are considered as a negative pair, and their similarity is encouraged to be low. Normally, a single similarity measure given by a single projection head is used to evaluate positive and negative sample pairs, respectively. However, due to the various augmentation strategies and varying intra-sample similarity, augmented views from the same image are often not similar. Moreover, due to inter-sample similarity, augmented views of two different images may be more similar than augmented views from the same image. As such, enforcing a high similarity for positive pairs and a low similarity for negative pairs may not always be achievable, and in the case of some pairs, forcing so may be detrimental to the performance. To address this issue, we propose to use multiple projection heads, each producing a separate set of features. Our loss function for pre-training emerges from a solution to the maximum likelihood estimation over head-wise posterior distributions of positive samples given observations. The loss contains the similarity measure over positive and negative pairs, each re-weighted by an individual adaptive temperature that is regularized to prevent ill solutions. Our adaptive multi-head contrastive learning (AMCL) can be applied to and experimentally improves several popular contrastive learning methods such as SimCLR, MoCo and Barlow Twins. Such improvement is consistent under various backbones and linear probing epoches and is more significant when multiple augmentation methods are used.
摘要
在对比学习中,两个视图来自不同的扩充方法的原始图像被视为一个正样对,需要高度相似。同时,两个不同图像的两个视图被视为一个负样对,需要低度相似。通常情况下,单一的相似度测量由单个投影头提供,用于评估正样对和负样对。然而,由于不同的扩充策略和内样 Similarity 的变化,扩充视图从同一个图像中可能不相似,而两个不同图像的扩充视图可能更相似。因此,强制正样对和负样对的相似度高低可能并不总是可 achievable,而且在某些对之中,强制如此可能会损害性能。为解决这个问题,我们提议使用多个投影头,每个生成一个独立的特征集。我们的损失函数在预训练阶段由每个头wise posterior distribution of positive samples given observations的最大可能性解决出来。损失函数包括对正样对和负样对的相似度测量,每个重新权重通过个体适应温度评正化,以避免不良解决。我们称之为自适应多头对比学习(AMCL)。我们的AMCL可以应用到多种流行的对比学习方法,如SimCLR、MoCo和Barlow Twins,并在不同的后端和线性探针级别上实现了实验增进。这种改进是不同的扩充方法和误差率下的可重复的,并且在多个扩充方法使用时更加明显。
InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations
paper_authors: Nils Feldhus, Qianli Wang, Tatiana Anikina, Sahil Chopra, Cennet Oguz, Sebastian Möller
for: 本研究旨在开发一个可交互的对话系统,帮助用户通过自然语言界面获得模型和数据集的解释。
methods: 本研究采用了对话扩展模型TalkToModel(Slack et al., 2022),并添加了新的NLP特有操作,如自由文本合理化。
results: 研究发现,对话性解释对用户来说是有用和有 corrections 的,可以帮助用户更好地理解模型的预测结果。此外,用户通过对话性解释可以更好地预测模型的结果,而不是基于单个解释。Abstract
While recently developed NLP explainability methods let us open the black box in various ways (Madsen et al., 2022), a missing ingredient in this endeavor is an interactive tool offering a conversational interface. Such a dialogue system can help users explore datasets and models with explanations in a contextualized manner, e.g. via clarification or follow-up questions, and through a natural language interface. We adapt the conversational explanation framework TalkToModel (Slack et al., 2022) to the NLP domain, add new NLP-specific operations such as free-text rationalization, and illustrate its generalizability on three NLP tasks (dialogue act classification, question answering, hate speech detection). To recognize user queries for explanations, we evaluate fine-tuned and few-shot prompting models and implement a novel Adapter-based approach. We then conduct two user studies on (1) the perceived correctness and helpfulness of the dialogues, and (2) the simulatability, i.e. how objectively helpful dialogical explanations are for humans in figuring out the model's predicted label when it's not shown. We found rationalization and feature attribution were helpful in explaining the model behavior. Moreover, users could more reliably predict the model outcome based on an explanation dialogue rather than one-off explanations.
摘要
Recently developed NLP explainability methods have allowed us to open the black box in various ways (Madsen et al., 2022), but a missing ingredient in this endeavor is an interactive tool offering a conversational interface. Such a dialogue system can help users explore datasets and models with explanations in a contextualized manner, for example, via clarification or follow-up questions, and through a natural language interface. We adapted the conversational explanation framework TalkToModel (Slack et al., 2022) to the NLP domain, added new NLP-specific operations such as free-text rationalization, and demonstrated its generalizability on three NLP tasks (dialogue act classification, question answering, hate speech detection). To recognize user queries for explanations, we evaluated fine-tuned and few-shot prompting models and implemented a novel Adapter-based approach. We then conducted two user studies on (1) the perceived correctness and helpfulness of the dialogues, and (2) the simulatability, i.e., how objectively helpful dialogical explanations are for humans in figuring out the model's predicted label when it's not shown. We found that rationalization and feature attribution were helpful in explaining the model behavior, and users could more reliably predict the model outcome based on an explanation dialogue rather than one-off explanations.
Aggregated f-average Neural Network for Interpretable Ensembling
results: 该论文通过使用不同类型的均值来优化弱学习器预测结果,并通过使用可解释的架构和简单的训练策略,实现了在少量示例增强学习问题上的好表现。Abstract
Ensemble learning leverages multiple models (i.e., weak learners) on a common machine learning task to enhance prediction performance. Basic ensembling approaches average the weak learners outputs, while more sophisticated ones stack a machine learning model in between the weak learners outputs and the final prediction. This work fuses both aforementioned frameworks. We introduce an aggregated f-average (AFA) shallow neural network which models and combines different types of averages to perform an optimal aggregation of the weak learners predictions. We emphasise its interpretable architecture and simple training strategy, and illustrate its good performance on the problem of few-shot class incremental learning.
摘要
ensemble learning可以利用多个模型(即弱学习器)来增强预测性能, Basic的ensembleapproaches是将弱学习器输出平均化,而更复杂的ones是在弱学习器输出和最终预测之间堆叠一个机器学习模型。这个工作 fusion这两种框架。我们介绍了一个集成了不同类型的平均值的简单神经网络,即集成f-平均(AFA)神经网络。我们强调其可解释的architecture和简单的训练策略,并通过几何学问题中的少量逻辑学习问题 illustrate its good performance。Here's the breakdown of the translation:* "ensemble learning" is translated as "集成学习" (zhòngshì xuéxí)* "leverages" is translated as "利用" (lìyòu)* "multiple models" is translated as "多个模型" (duō ge móde)* "weak learners" is translated as "弱学习器" (ruò xuéxí qì)* "basic ensembling approaches" is translated as "基本的ensembleapproaches" (jīběn de ensembleapproaches)* "stack a machine learning model" is translated as "堆叠一个机器学习模型" (zhumu yī ge jīshì xuéxí móde)* "this work" is translated as "这个工作" (zhè ge gōngzuò)* "fuses" is translated as "集成" (jìshèng)* "both frameworks" is translated as "这两种框架" (zhè liàng zhī kāngjī)* "introduce" is translated as "介绍" (jièkuài)* "an aggregated f-average (AFA) shallow neural network" is translated as "一个集成f-平均(AFA)神经网络" (yī ge jìshèng f-píngyān (AFA) xīnnéng wǎngluò)* "interpretable architecture" is translated as "可解释的architecture" (kějiěshì de architecture)* "simple training strategy" is translated as "简单的训练策略" (jìndān de xùnxíng zhìxí)* "illustrate" is translated as "illustrate" (xìhuì)* "good performance" is translated as "好的性能" (hǎo de xìngnéng)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know and I'll be happy to provide it.
STREAM: Social data and knowledge collective intelligence platform for TRaining Ethical AI Models
results: STREAM已经提供了丰富的伦理enario集,并收集了大量由志愿者和各种流行的大语言模型(LLMs)注释的伦理判断数据,共同反映人类和AI在不同伦理上的偏好和表现。Abstract
This paper presents Social data and knowledge collective intelligence platform for TRaining Ethical AI Models (STREAM) to address the challenge of aligning AI models with human moral values, and to provide ethics datasets and knowledge bases to help promote AI models "follow good advice as naturally as a stream follows its course". By creating a comprehensive and representative platform that accurately mirrors the moral judgments of diverse groups including humans and AIs, we hope to effectively portray cultural and group variations, and capture the dynamic evolution of moral judgments over time, which in turn will facilitate the Establishment, Evaluation, Embedding, Embodiment, Ensemble, and Evolvement (6Es) of the moral capabilities of AI models. Currently, STREAM has already furnished a comprehensive collection of ethical scenarios, and amassed substantial moral judgment data annotated by volunteers and various popular Large Language Models (LLMs), collectively portraying the moral preferences and performances of both humans and AIs across a range of moral contexts. This paper will outline the current structure and construction of STREAM, explore its potential applications, and discuss its future prospects.
摘要
Currently, STREAM has furnished a comprehensive collection of ethical scenarios and amassed substantial moral judgment data annotated by volunteers and various popular Large Language Models (LLMs), collectively portraying the moral preferences and performances of both humans and AIs across a range of moral contexts. This paper will outline the current structure and construction of STREAM, explore its potential applications, and discuss its future prospects.
WeatherDepth: Curriculum Contrastive Learning for Self-Supervised Depth Estimation under Adverse Weather Conditions
results: 在实验中,提出的解决方案可以轻松地与不同的模型结合使用,并在人工挑战和实际雨天捕捉数据集上达到了当前最佳性能。Abstract
Depth estimation models have shown promising performance on clear scenes but fail to generalize to adverse weather conditions due to illumination variations, weather particles, etc. In this paper, we propose WeatherDepth, a self-supervised robust depth estimation model with curriculum contrastive learning, to tackle performance degradation in complex weather conditions. Concretely, we first present a progressive curriculum learning scheme with three simple-to-complex curricula to gradually adapt the model from clear to relative adverse, and then to adverse weather scenes. It encourages the model to gradually grasp beneficial depth cues against the weather effect, yielding smoother and better domain adaption. Meanwhile, to prevent the model from forgetting previous curricula, we integrate contrastive learning into different curricula. Drawn the reference knowledge from the previous course, our strategy establishes a depth consistency constraint between different courses towards robust depth estimation in diverse weather. Besides, to reduce manual intervention and better adapt to different models, we designed an adaptive curriculum scheduler to automatically search for the best timing for course switching. In the experiment, the proposed solution is proven to be easily incorporated into various architectures and demonstrates state-of-the-art (SoTA) performance on both synthetic and real weather datasets.
摘要
depth estimation模型在清晰场景下表现出色,但在不利天气条件下表现不佳,主要是因为照明变化、天气粒子等因素。在这篇论文中,我们提出了一种自动适应的深度估计模型——WeatherDepth,使得模型在复杂的天气条件下能够更好地适应。具体来说,我们首先提出了一种进步式课程学习方案,包括三个简单到复杂的课程,以逐步适应模型从清晰到相对不利、然后到不利天气场景。这使得模型逐渐捕捉到恰当的深度提示,从而获得更好的预测性。同时,为了避免模型忘记之前的课程,我们将对不同的课程进行了对比学习。从此,我们的策略建立了一个深度一致性约束,以保证模型在多种天气条件下的稳定性。此外,为了避免手动 intervención和更好地适应不同的模型,我们设计了一个自适应课程调度器,以自动搜索最佳课程时间点。在实验中,我们的解决方案轻松地适应到不同的架构,并在真实的天气数据集上达到了当前最佳性能(SoTA)。
Logic-guided Deep Reinforcement Learning for Stock Trading
results: 根据实验结果,SYENS在30个道琴股票的股票交易中具有更高的累积收益和较低的最大投降,并在两种交易设置下(即现金交易和质押交易)都能够显著超越基elines。Abstract
Deep reinforcement learning (DRL) has revolutionized quantitative finance by achieving excellent performance without significant manual effort. Whereas we observe that the DRL models behave unstably in a dynamic stock market due to the low signal-to-noise ratio nature of the financial data. In this paper, we propose a novel logic-guided trading framework, termed as SYENS (Program Synthesis-based Ensemble Strategy). Different from the previous state-of-the-art ensemble reinforcement learning strategy which arbitrarily selects the best-performing agent for testing based on a single measurement, our framework proposes regularizing the model's behavior in a hierarchical manner using the program synthesis by sketching paradigm. First, we propose a high-level, domain-specific language (DSL) that is used for the depiction of the market environment and action. Then based on the DSL, a novel program sketch is introduced, which embeds human expert knowledge in a logical manner. Finally, based on the program sketch, we adopt the program synthesis by sketching a paradigm and synthesizing a logical, hierarchical trading strategy. We evaluate SYENS on the 30 Dow Jones stocks under the cash trading and the margin trading settings. Experimental results demonstrate that our proposed framework can significantly outperform the baselines with much higher cumulative return and lower maximum drawdown under both settings.
摘要
深度强化学习(DRL)已经革命化金融科学,它可以在不需要显著的人工努力的情况下达到出色的性能。然而,我们观察到DRL模型在动态股票市场中的不稳定行为,这是因为财务数据的信号噪声比例较低。在这篇论文中,我们提出了一种新的逻辑引导交易框架,称为SYENS(程序合成基于ensemble策略)。与前一代状态的聚合强化学习策略不同,我们的框架在层次结构上使用程序合成来规范模型的行为。首先,我们提出了一种高级、领域特定语言(DSL),用于描述市场环境和行动。然后,我们基于DSL引入了一种新的程序绘制,其嵌入了人类专家知识在逻辑上。最后,我们采用程序合成 by sketching 方法,并将其应用于SYENS框架中。我们在30个道琴股票下对cash交易和margin交易进行了实验。实验结果表明,我们的提出的框架可以与基准值相比较高的净返报和较低的最大下降。
ParFam – Symbolic Regression Based on Continuous Global Optimization
results: 通过广泛的数字实验,证明ParFam可以 дости到SR问题的状态Esp中的最佳解决方案,并可以轻松扩展到更高级的算法,例如添加深度神经网络来找到适合的参数家族。Abstract
The problem of symbolic regression (SR) arises in many different applications, such as identifying physical laws or deriving mathematical equations describing the behavior of financial markets from given data. Various methods exist to address the problem of SR, often based on genetic programming. However, these methods are usually quite complicated and require a lot of hyperparameter tuning and computational resources. In this paper, we present our new method ParFam that utilizes parametric families of suitable symbolic functions to translate the discrete symbolic regression problem into a continuous one, resulting in a more straightforward setup compared to current state-of-the-art methods. In combination with a powerful global optimizer, this approach results in an effective method to tackle the problem of SR. Furthermore, it can be easily extended to more advanced algorithms, e.g., by adding a deep neural network to find good-fitting parametric families. We prove the performance of ParFam with extensive numerical experiments based on the common SR benchmark suit SRBench, showing that we achieve state-of-the-art results. Our code and results can be found at https://github.com/Philipp238/parfam .
摘要
SR(符号回归)问题在多种应用中出现,如从数据中找到物理法律或财务市场行为的数学方程。现有多种解决 SR 问题的方法,通常基于进化编程。然而,这些方法通常很复杂,需要许多Hyperparameter调整和计算资源。在这篇论文中,我们介绍了我们的新方法 ParFam,它利用适当的参数家族符号函数来将离散的符号回归问题转化为连续的问题,从而得到更直观的设置。与现有状态之册方法相比,我们的方法更加简单,并且可以轻松地扩展到更高级的算法,例如通过添加深度神经网络来找到good-fitting参数家族。我们通过对 SRBench 常用的 SR benchmark 进行广泛的数字实验,证明了 ParFam 的性能。我们的代码和结果可以在 GitHub 上找到:https://github.com/Philipp238/parfam。
On Double Descent in Reinforcement Learning with LSTD and Random Features
results: 研究人员发现了一种双峰现象,即在神经网络参数和状态数比例大于1时,性能会快速下降。他们还发现,在增加 $l_2$-正则化或状态数下降到0时, corrction terms消失。numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.Abstract
Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
摘要
temporal difference(TD)算法在深度学习(RL)中广泛使用。其性能受到神经网络大小的影响。在超vised学习中,过度参数的情况和其好处已经很好地理解,但在RL中情况却相对不清楚。在这篇论文中,我们提供了TD算法性能与神经网络大小的理论分析。我们确定了参数与访问状态的比率为关键因素,并定义了过度参数为神经网络参数的数量大于状态数量的情况。此外,我们发现了一种双峰现象,即参数/状态比率接近1时性能突然下降。通过随机特征和懒散训练策略,我们研究了正则化最小二乘差(LSTD)算法在参数和状态数量 infinito 的极限情况下。我们 derive了参数和状态数量 infinito 下的零限的实际和真实的Mean-Square Bellman Error(MSBE),其中包含修正项负责 Double Descent。这些修正项在 $l_2$ 正则化强度增大或未访问状态数量减少时消失。实际实验结果与理论预测匹配得非常好。
UAVs and Neural Networks for search and rescue missions
paper_authors: Hartmut Surmann, Artur Leinweber, Gerhard Senkowski, Julien Meine, Dominik Slomma
for: detection of objects of interest (cars, humans, fire) in aerial images captured by UAVs during vegetation fires
methods: use of artificial neural networks, creation of a dataset for supervised learning, implementation of an object detection pipeline combining classic image processing techniques with pretrained neural networks, development of a data augmentation pipeline to augment the dataset with automatically labeled images
results: evaluation of the performance of different neural networksHere’s the information in Simplified Chinese:
results: 评估不同神经网络的性能Abstract
In this paper, we present a method for detecting objects of interest, including cars, humans, and fire, in aerial images captured by unmanned aerial vehicles (UAVs) usually during vegetation fires. To achieve this, we use artificial neural networks and create a dataset for supervised learning. We accomplish the assisted labeling of the dataset through the implementation of an object detection pipeline that combines classic image processing techniques with pretrained neural networks. In addition, we develop a data augmentation pipeline to augment the dataset with automatically labeled images. Finally, we evaluate the performance of different neural networks.
摘要
在这篇论文中,我们提出了一种方法用于在由无人飞行器(UAV)拍摄的空中图像中检测有关兴趣的对象,包括汽车、人体和火灾。为 дости这一目标,我们使用人工神经网络,并创建了一个用于超级vised学习的数据集。我们通过实施对象检测管道,该管道组合了经典的图像处理技术和预训练的神经网络,来协助标注数据集。此外,我们还开发了一个自动生成数据集的管道,以增加数据集的自动标注图像。最后,我们评估了不同的神经网络的性能。
Query and Response Augmentation Cannot Help Out-of-domain Math Reasoning Generalization
results: 作者发现,通过增强数据,可以提高 LLMs 的数学逻辑性能,并且存在对数据量的呈几何关系。然而,在各种数学逻辑任务之间的泛化性能仍然需要进一步改进。Abstract
In math reasoning with large language models (LLMs), fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective, profoundly narrowing the gap between open-sourced LLMs and cutting-edge proprietary LLMs. In this paper, we conduct an investigation for such data augmentation in math reasoning and are intended to answer: (1) What strategies of data augmentation are more effective; (2) What is the scaling relationship between the amount of augmented data and model performance; and (3) Can data augmentation incentivize generalization to out-of-domain mathematical reasoning tasks? To this end, we create a new dataset, AugGSM8K, by complicating and diversifying the queries from GSM8K and sampling multiple reasoning paths. We obtained a series of LLMs called MuggleMath by fine-tuning on subsets of AugGSM8K. MuggleMath substantially achieves new state-of-the-art on GSM8K (from 54% to 68.4% at the scale of 7B, and from 63.9% to 74.0% at the scale of 13B). A log-linear relationship is presented between MuggleMath's performance and the amount of augmented data. We also find that MuggleMath is weak in out-of-domain math reasoning generalization to MATH. This is attributed to the differences in query distribution between AugGSM8K and MATH which suggest that augmentation on a single benchmark could not help with overall math reasoning performance. Codes and AugGSM8K will be uploaded to https://github.com/OFA-Sys/gsm8k-ScRel.
摘要
在数学逻辑中使用大型自然语言模型(LLMs), fine-tuning数据增强和多种逻辑路径的数据增强被证明是有效的,可以减小开源LLMs和高级专有LLMs之间的差距。在这篇论文中,我们进行了数学逻辑中数据增强的调查,旨在回答以下问题:(1)哪些数据增强策略更加有效;(2)数据增强量和模型性能之间存在哪种整数关系;和(3)数据增强是否能够适应尺度外的数学逻辑任务?为此,我们创建了一个新的数据集,AugGSM8K,通过复杂和多样化 queries from GSM8K 来生成多种 reasoning paths。我们使用这些数据集进行了一系列的 LLMS 的 fine-tuning,并取得了一系列的 MuggleMath 模型。MuggleMath 在 GSM8K 上实现了新的状态机器人,从 54% 提高到 68.4% (7B 缩放)和从 63.9% 提高到 74.0% (13B 缩放)。我们发现了数据增强和模型性能之间存在很好的对数关系。此外,我们发现 MuggleMath 在尺度外的数学逻辑任务上的总体性能较弱,这是因为 AugGSM8K 和 MATH 的查询分布之间存在差异。这意味着数据增强在单一的 benchmark 上不能够提高总体数学逻辑性能。代码和 AugGSM8K 将在 上上传。
Integrating Graphs with Large Language Models: Methods and Prospects
results: 研究表明,通过图结构和LLM的结合,可以提高LLM的性能和应用范围,并且提出了未来研究的开放问题。Abstract
Large language models (LLMs) such as GPT-4 have emerged as frontrunners, showcasing unparalleled prowess in diverse applications, including answering queries, code generation, and more. Parallelly, graph-structured data, an intrinsic data type, is pervasive in real-world scenarios. Merging the capabilities of LLMs with graph-structured data has been a topic of keen interest. This paper bifurcates such integrations into two predominant categories. The first leverages LLMs for graph learning, where LLMs can not only augment existing graph algorithms but also stand as prediction models for various graph tasks. Conversely, the second category underscores the pivotal role of graphs in advancing LLMs. Mirroring human cognition, we solve complex tasks by adopting graphs in either reasoning or collaboration. Integrating with such structures can significantly boost the performance of LLMs in various complicated tasks. We also discuss and propose open questions for integrating LLMs with graph-structured data for the future direction of the field.
摘要
大型语言模型(LLM)如GPT-4在多种应用中表现出不一样的优势,包括回答问题、代码生成等。同时,图струк成数据是实际世界中普遍存在的数据类型。把LMLM的能力与图结构数据结合,已成为研究者的焦点之一。这篇评论文将这些结合分为两大类。第一种将LMLM用于图学习,LMLM可以不仅增强现有的图算法,还可以作为各种图任务的预测模型。相反,第二种类型强调图结构数据在提高LMLM表现的重要性。人类的思维方式是透过图来解释和协作来解决复杂任务。与图结构数据结合可以将LMLM在多种复杂任务中表现出较好的成绩。我们还讨论了未来领域的开启问题,以推动LMLM与图结构数据的结合。
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition
paper_authors: Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, Jingren Zhou for: 这个研究的目的是调查多种能力的LLMs在Supervised Fine-tuning(SFT)中的可调性,以及不同能力之间的数据组合对性能的影响。methods: 这个研究使用了多种SFT策略,包括顺序学习多能力(sequential learning)和 dual-stage mixed fine-tuning(DMT)策略,以及不同数据量和数据组合比例的调整。results: 研究发现,不同的能力展现出不同的扩展特征,大型模型通常需要更多的数据来提高性能。数学逻辑和代码生成能力随着数据量的增加而提高,而通用能力需要约一千个样本才能提高,然后slowly improves。数据组合对不同的能力有启发作用,但高数据量时会导致能力冲突。DMT策略可以避免卷积学习导致忘记现象,提供了多能力学习的可能解决方案。Abstract
Large language models (LLMs) with enormous pre-training tokens and parameter amounts emerge abilities, including math reasoning, code generation, and instruction following. These abilities are further enhanced by supervised fine-tuning (SFT). The open-source community has studied on ad-hoc SFT for each ability, while proprietary LLMs are versatile for all abilities. It is important to investigate how to unlock them with multiple abilities via SFT. In this study, we specifically focus on the data composition between mathematical reasoning, code generation, and general human-aligning abilities during SFT. From a scaling perspective, we investigate the relationship between model abilities and various factors including data amounts, data composition ratio, model parameters, and SFT strategies. Our experiments reveal that different abilities exhibit different scaling patterns, and larger models generally show superior performance with the same amount of data. Mathematical reasoning and code generation improve as data amounts increase consistently, while the general ability is enhanced with about a thousand samples and improves slowly. We find data composition results in various abilities improvements with low data amounts, while conflicts of abilities with high data amounts. Our experiments further show that composition data amount impacts performance, while the influence of composition ratio is insignificant. Regarding the SFT strategies, we evaluate sequential learning multiple abilities are prone to catastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT) strategy learns specialized abilities first and then learns general abilities with a small amount of specialized data to prevent forgetting, offering a promising solution to learn multiple abilities with different scaling patterns.
摘要
大型语言模型(LLM)有巨大的预训语料和参数数量,并且拥有多种能力,包括数学推理、代码生成和指令跟进。这些能力可以通过监督精致训练(SFT)进一步增强。开源社区已经研究了随机SFT的每个能力,而商业LLM则具有多种能力。我们需要研究如何通过SFT解锁多种能力。在本研究中,我们专注于SFT中数学推理、代码生成和通用人类调整的数据结构之间的关系。从扩展角度来看,我们调查模型能力和不同因素(包括数据量、数据结构比例、模型参数和SFT策略)之间的关系。我们的实验显示不同的能力展现出不同的扩展模式,大型模型通常在同量数据下表现出色。数学推理和代码生成随着数据量增加逐渐提高,而通用能力则在约一千个数据 sample 后逐渐提高。我们发现数据结构可以在低数据量下提高不同的能力,但高数据量时会出现能力冲突。我们的实验还显示了数据结构填充量影响表现,但结构比例无法影响表现。关于SFT策略,我们评估了预先学习特定能力后,将特定数据进行混合精致训练(DMT),以避免忘记,提供了多能力学习的有前途的解决方案。
Cabbage Sweeter than Cake? Analysing the Potential of Large Language Models for Learning Conceptual Spaces
paper_authors: Usashi Chatterjee, Amit Gajbhiye, Steven Schockaert
for: 这个论文旨在探讨使用大型自然语言模型(LLM)学习概念空间的可能性。
methods: 该论文使用了一种基于语言模型的方法,通过学习人类判断来构建概念空间。
results: 实验表明,LLM可以学习一定程度的概念空间表示,但是特定的BERT模型在训练后可以与最大的GPT-3模型匹配或超越它,即使它们的大小只是GPT-3的2-3个数量级。Abstract
The theory of Conceptual Spaces is an influential cognitive-linguistic framework for representing the meaning of concepts. Conceptual spaces are constructed from a set of quality dimensions, which essentially correspond to primitive perceptual features (e.g. hue or size). These quality dimensions are usually learned from human judgements, which means that applications of conceptual spaces tend to be limited to narrow domains (e.g. modelling colour or taste). Encouraged by recent findings about the ability of Large Language Models (LLMs) to learn perceptually grounded representations, we explore the potential of such models for learning conceptual spaces. Our experiments show that LLMs can indeed be used for learning meaningful representations to some extent. However, we also find that fine-tuned models of the BERT family are able to match or even outperform the largest GPT-3 model, despite being 2 to 3 orders of magnitude smaller.
摘要
理论的概念空间模型是一种有影响力的认知语言框架,用于表示概念的含义。概念空间由一系列质量维度组成,这些质量维度通常来自人类判断,这意味着应用概念空间通常受限于特定领域(如色彩或味道模elling)。鼓动了最近发现大语言模型(LLMs)可以学习基于感知的表示,我们explore了这些模型是否可以学习有意义的概念空间。我们的实验表明LLMs可以学习有意义的表示,但我们还发现,经过精度调整的BERT家族模型可以与最大GPT-3模型匹配或者超越,即使其体积只是GPT-3模型的2-3个数量级。
results: 该论文通过使用核心束回归神经网络(RNN)来近似优化时间执行时间,实现了在实际应用中减少操作成本的目标。详细的实现细节可以参考github上的\url{github.com/ChenPopper/optimal_timing_TSF}仓库。Abstract
Deciding the best future execution time is a critical task in many business activities while evolving time series forecasting, and optimal timing strategy provides such a solution, which is driven by observed data. This solution has plenty of valuable applications to reduce the operation costs. In this paper, we propose a mechanism that combines a probabilistic time series forecasting task and an optimal timing decision task as a first systematic attempt to tackle these practical problems with both solid theoretical foundation and real-world flexibility. Specifically, it generates the future paths of the underlying time series via probabilistic forecasting algorithms, which does not need a sophisticated mathematical dynamic model relying on strong prior knowledge as most other common practices. In order to find the optimal execution time, we formulate the decision task as an optimal stopping problem, and employ a recurrent neural network structure (RNN) to approximate the optimal times. Github repository: \url{github.com/ChenPopper/optimal_timing_TSF}.
摘要
决定最佳未来执行时间是许多企业活动中的关键任务,而且随着时间序列预测的演化,最佳时间策略提供了一个解决方案,它是由观察数据驱动的。这种解决方案有很多有价值的应用,可以降低运营成本。在这篇论文中,我们提出了一种机制,它将混合 probabilistic 时间序列预测任务和最佳时间决策任务,作为第一个系统性的尝试,以解决这些实际问题。Specifically, it generates the future paths of the underlying time series via probabilistic forecasting algorithms, which does not need a sophisticated mathematical dynamic model relying on strong prior knowledge as most other common practices. In order to find the optimal execution time, we formulate the decision task as an optimal stopping problem, and employ a recurrent neural network structure (RNN) to approximate the optimal times. Github repository: \url{github.com/ChenPopper/optimal_timing_TSF}.Here's the word-for-word translation of the text into Simplified Chinese:决定最佳未来执行时间是许多企业活动中的关键任务,而且随着时间序列预测的演化,最佳时间策略提供了一个解决方案,它是由观察数据驱动的。这种解决方案有很多有价值的应用,可以降低运营成本。在这篇论文中,我们提出了一种机制,它将混合 probabilistic 时间序列预测任务和最佳时间决策任务,作为第一个系统性的尝试,以解决这些实际问题。Specifically, it generates the future paths of the underlying time series via probabilistic forecasting algorithms, which does not need a sophisticated mathematical dynamic model relying on strong prior knowledge as most other common practices. In order to find the optimal execution time, we formulate the decision task as an optimal stopping problem, and employ a recurrent neural network structure (RNN) to approximate the optimal times. Github repository: \url{github.com/ChenPopper/optimal_timing_TSF}.
results: 在Fashion-IQ和CIRR数据集上比革尤其良好,比起现有state-of-the-art方法。Abstract
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. Most existing CIR models adopt the late-fusion strategy to combine visual and language features. Besides, several approaches have also been suggested to generate a pseudo-word token from the reference image, which is further integrated into the relative caption for CIR. However, these pseudo-word-based prompting methods have limitations when target image encompasses complex changes on reference image, e.g., object removal and attribute modification. In this work, we demonstrate that learning an appropriate sentence-level prompt for the relative caption (SPRC) is sufficient for achieving effective composed image retrieval. Instead of relying on pseudo-word-based prompts, we propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. By concatenating the learned sentence-level prompt with the relative caption, one can readily use existing text-based image retrieval models to enhance CIR performance. Furthermore, we introduce both image-text contrastive loss and text prompt alignment loss to enforce the learning of suitable sentence-level prompts. Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets. The source code and pretrained model are publicly available at https://github.com/chunmeifeng/SPRC
摘要
“组合图像检索(CIR)任务是根据查询包含参考图像和相关描述文本来检索特定图像。现有大多数CIR模型采用较晚的融合策略将视觉和语言特征结合。此外,一些方法还建议生成基于参考图像的 pseudo-word 令,并将其与相关描述文本结合使用。然而,这些 pseudo-word 基于的提示方法在参考图像具有复杂变化时存在限制,例如对象移除和特征修改。在这种情况下,我们表明了学习适当的句子级提示(SPRC)是可以实现有效的组合图像检索的。而不是依赖 pseudo-word 基于的提示,我们提议利用预训练的 V-L 模型,如 BLIP-2,生成句子级提示。将学习的句子级提示与相关描述文本 concatenate 后,可以直接使用现有的文本基于图像检索模型进行改进 CIR 性能。此外,我们引入了图像文本对比loss和文本提示对齐loss,以便学习适当的句子级提示。实验结果表明,我们提出的方法在 Fashion-IQ 和 CIRR 数据集上与状态对照方法相比表现优异。源代码和预训练模型可以在 GitHub 上下载。”
results: 实验结果显示,Auto-J在58个不同情景下的测试环境中具有显著的优势,并且与其他竞争对手(包括开源和关闭源模型)形成明显的差异。Abstract
The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.
摘要
随着大型语言模型(LLM)的快速发展,它们可以 addresses 的任务范围得到了极大的扩展。在自然语言处理(NLP)领域,研究人员的焦点从传统的 NLP 任务(如序列标记和分析)转移到了与人类需求相关的任务(如审想和电子邮件写作)。这种任务分布的变化对评估这些对齐的模型进行评估有新的要求,包括总体性(即在多种场景中的表现评估)、灵活性(即在不同的协议下进行评估)以及可读性(即使用自然语言的解释来评估模型)。在这篇论文中,我们提出了一个名为 Auto-J 的生成式评价器,拥有 13B 参数。我们的模型在用户查询和 LLM 生成的回答下进行训练,并且可以处理多种评估协议(如对比回答和单独评估),并且具有良好的自然语言批评结构。为了证明我们的方法的效果,我们构建了一个包含 58 个不同场景的测试床。实验结果表明,Auto-J 在与多种强大竞争对手进行比较时,有大幅度的优势。我们还提供了详细的分析和案例研究,以及在 GitHub 上公开多种资源。
Cost-Sensitive Best Subset Selection for Logistic Regression: A Mixed-Integer Conic Optimization Perspective
results: 研究结果显示了低数据情况下和标签噪音情况下方法的限制,并提供了实践建议和适当的数据设计。此外,研究也辟开了meta学研究的新领域。Abstract
A key challenge in machine learning is to design interpretable models that can reduce their inputs to the best subset for making transparent predictions, especially in the clinical domain. In this work, we propose a certifiably optimal feature selection procedure for logistic regression from a mixed-integer conic optimization perspective that can take an auxiliary cost to obtain features into account. Based on an extensive review of the literature, we carefully create a synthetic dataset generator for clinical prognostic model research. This allows us to systematically evaluate different heuristic and optimal cardinality- and budget-constrained feature selection procedures. The analysis shows key limitations of the methods for the low-data regime and when confronted with label noise. Our paper not only provides empirical recommendations for suitable methods and dataset designs, but also paves the way for future research in the area of meta-learning.
摘要
一大挑战在机器学习中是设计可解释的模型,以减少输入并提供透明的预测,尤其在医疗领域。在这项工作中,我们提议一种 certificately 优化的特征选择方法,通过杂谱矩阵优化的视角来考虑辅助成本。我们通过了评 literature 的广泛回顾,并且 méticulously 创建了临床预测模型的 sintética 数据生成器。这使得我们可以系统地评估不同的启发式和优化的卡达性和预算限制下的特征选择方法。分析表明了低数据情况下和标签噪声时方法的局限性。我们的论文不仅提供了实践建议,还开辟了meta-学习领域的未来研究之路。
Ensemble-based Hybrid Optimization of Bayesian Neural Networks and Traditional Machine Learning Algorithms
results: ensemble method 表现出了 Robust 和算法优化的特点,并且 hyperparameter tuning 对 Expected Improvement (EI) 的影响较弱。Abstract
This research introduces a novel methodology for optimizing Bayesian Neural Networks (BNNs) by synergistically integrating them with traditional machine learning algorithms such as Random Forests (RF), Gradient Boosting (GB), and Support Vector Machines (SVM). Feature integration solidifies these results by emphasizing the second-order conditions for optimality, including stationarity and positive definiteness of the Hessian matrix. Conversely, hyperparameter tuning indicates a subdued impact in improving Expected Improvement (EI), represented by EI(x). Overall, the ensemble method stands out as a robust, algorithmically optimized approach.
摘要
Translation notes:* "Bayesian Neural Networks" (BNNs) is translated as "泛函神经网络" (pànfungen xīnǎo wǎngluò)* "Random Forests" (RF) is translated as "随机森林" (suījì sēn líng)* "Gradient Boosting" (GB) is translated as "梯度提升" (dēngdì tímshēng)* "Support Vector Machines" (SVM) is translated as "支持向量机器" (zhīchēng xiàngwù jīqì)* "Feature integration" is translated as "特征集成" (fēngjī zhùchéng)* "Hyperparameter tuning" is translated as "超参数调整" (chāojianxìa dào zhèng)* "Expected Improvement" (EI) is translated as "预期改进" (yùxì gǎngyì)Note that the translation is in Simplified Chinese, which is the most commonly used form of Chinese in mainland China. If you need the translation in Traditional Chinese, please let me know.
Explaining the Complex Task Reasoning of Large Language Models with Template-Content Structure
for: This paper aims to provide an explanation for the exceptional generalization abilities of pre-trained large language models, and to offer a novel framework for understanding their ability to solve complex natural language tasks.
methods: The paper presents a hierarchical “template-content” structure for modeling answer generation in natural language tasks, and demonstrates that pre-trained models can automatically decompose tasks into constituent steps during autoregressive generation through language modeling on a sufficiently large corpus.
results: The paper shows that practical models exhibit different behaviors for “template” and “content” providing support for the proposed modeling, and offers an explanatory tool for the complex reasoning abilities of large language models from the perspective of modeling autoregressive generation tasks.Abstract
The continuous evolution of pre-trained large language models with ever-growing parameters and corpus sizes has augmented their capacity to solve complex tasks. This ability, which obviates the necessity for task-specific training or fine-tuning, relies on providing the model with a language description or some task exemplars -- referred to the prompt -- that guide the desired autoregressive generation. Despite the remarkable success, the underlying mechanisms that facilitate such exceptional generalization abilities remain an open question. In this paper, we present a novel framework that formally conceptualizes answer generation for complex natural language tasks as a hierarchical ``template-content'' structure. According to our modeling, there exist pre-trained models that can automatically decompose tasks into constituent steps during autoregressive generation, through language modeling on a sufficiently large corpus, thereby solving them. Our framework offers an explanatory tool for the complex reasoning abilities of large language models from the perspective of modeling autoregressive generation tasks. Our experiments show that practical models exhibit different behaviors for ``template'' and ``content'' providing support for our modeling.
摘要
大型自然语言模型的不断演化和参数的增加以及训练数据的增加,使得这些模型可以更好地解决复杂任务。这种能力,不需要任务特定的训练或微调,通过给模型提供语言描述或一些任务示例(即提示)来引导潜在的自然语言生成。虽然这些成果很出色,但是这些成果的基础机制仍然是一个开放的问题。在这篇论文中,我们提出了一种新的框架,它正式地概括了复杂自然语言任务的回答生成为一个层次结构。根据我们的模型,存在一些预训练模型可以在生成过程中自动将任务 decomposes into constituent steps,通过对 sufficiently large corpus进行语言模型化,以解决任务。我们的框架提供了对大语言模型的复杂逻辑能力的解释工具,从概念生成任务的角度出发。我们的实验表明,实际模型在“模板”和“内容”提供支持,这支持我们的模型。
Replication of Multi-agent Reinforcement Learning for the “Hide and Seek” Problem
results: 研究发现,通过增加飞行机制,提高了隐藏者的机动性和搜索范围,从约2000万步到1600万步,提高了隐藏者的追踪策略。Abstract
Reinforcement learning generates policies based on reward functions and hyperparameters. Slight changes in these can significantly affect results. The lack of documentation and reproducibility in Reinforcement learning research makes it difficult to replicate once-deduced strategies. While previous research has identified strategies using grounded maneuvers, there is limited work in more complex environments. The agents in this study are simulated similarly to Open Al's hider and seek agents, in addition to a flying mechanism, enhancing their mobility, and expanding their range of possible actions and strategies. This added functionality improves the Hider agents to develop a chasing strategy from approximately 2 million steps to 1.6 million steps and hiders
摘要
利用强化学习生成策略,该策略基于奖励函数和超参数。小小的变化可能会导致 significativetransformations。 reinforcement learning研究的documentación和可重现性不足,使得复制已经获得的策略具有困难。在这种研究中,我们使用了在Open Al的隐藏者和搜索者中的 Agent,同时添加了飞行机制,从而提高了隐藏者的 mobilidad和可能的动作和策略。这种添加的功能使得隐藏者可以开发追踪策略,从约2000万步提高到1600万步。
Divide and Ensemble: Progressively Learning for the Unknown
paper_authors: Hu Zhang, Xin Shen, Heming Du, Huiqiang Chen, Chen Liu, Hongwei Sheng, Qingzheng Xu, MD Wahiduzzaman Khan, Qingtao Yu, Tianqing Zhu, Scott Chapman, Zi Huang, Xin Yu for:* 这个论文是为了解决蔬菜营养不足的问题,提出了一种基于分类的方法来进行识别。methods:* 这个方法使用了分类器 ensemble 和 pseudo-labeling 技术来进行识别。results:* 这个方法在测试集上得到了93.6%的 Top-1 测试精度(94.0% 在 WW2020 上和 93.2% 在 WR2021 上),并在 Deep Nutrient Deficiency Challenge 中获得了第一名。Abstract
In the wheat nutrient deficiencies classification challenge, we present the DividE and EnseMble (DEEM) method for progressive test data predictions. We find that (1) test images are provided in the challenge; (2) samples are equipped with their collection dates; (3) the samples of different dates show notable discrepancies. Based on the findings, we partition the dataset into discrete groups by the dates and train models on each divided group. We then adopt the pseudo-labeling approach to label the test data and incorporate those with high confidence into the training set. In pseudo-labeling, we leverage models ensemble with different architectures to enhance the reliability of predictions. The pseudo-labeling and ensembled model training are iteratively conducted until all test samples are labeled. Finally, the separated models for each group are unified to obtain the model for the whole dataset. Our method achieves an average of 93.6\% Top-1 test accuracy~(94.0\% on WW2020 and 93.2\% on WR2021) and wins the 1$st$ place in the Deep Nutrient Deficiency Challenge~\footnote{https://cvppa2023.github.io/challenges/}.
摘要
在小麦营养不足分类挑战中,我们提出了分类测试数据进行进行分组的DEEM方法(DividE和Ensemble)。我们发现:1. 测试图像提供给挑战;2. 样本具有收集日期信息;3. 不同日期的样本存在明显的差异。根据这些发现,我们将数据集分成不同日期的分组,并在每个分组上训练模型。然后,我们采用 Pseudo-labeling 方法来标注测试数据,并将高信任性的预测结果包含到训练集中。在 Pseudo-labeling 中,我们利用不同架构的模型 ensemble 以提高预测的可靠性。这些pseudo-labeling和 ensemble 模型训练是相互进行的,直到所有测试样本都被标注为止。最后,我们将每个组的模型集成起来,以获得整个数据集的模型。我们的方法实现了 Top-1 测试准确率的平均值为 93.6%(94.0% 在 WW2020 和 93.2% 在 WR2021),并在 Deep Nutrient Deficiency Challenge 中获得了第一名。
Humanoid Agents: Platform for Simulating Human-like Generative Agents
results: 该系统能够通过这些动态元素来适应每天的活动和对其他代理的交流,并且经验证了其效果。此外,该系统还可以扩展到不同的设定和其他影响人类行为的因素(如同情、道德价值和文化背景)。Abstract
Just as computational simulations of atoms, molecules and cells have shaped the way we study the sciences, true-to-life simulations of human-like agents can be valuable tools for studying human behavior. We propose Humanoid Agents, a system that guides Generative Agents to behave more like humans by introducing three elements of System 1 processing: Basic needs (e.g. hunger, health and energy), Emotion and Closeness in Relationships. Humanoid Agents are able to use these dynamic elements to adapt their daily activities and conversations with other agents, as supported with empirical experiments. Our system is designed to be extensible to various settings, three of which we demonstrate, as well as to other elements influencing human behavior (e.g. empathy, moral values and cultural background). Our platform also includes a Unity WebGL game interface for visualization and an interactive analytics dashboard to show agent statuses over time. Our platform is available on https://www.humanoidagents.com/ and code is on https://github.com/HumanoidAgents/HumanoidAgents
摘要
“computational simulations of atoms、molecules和 cells 已经影响了我们研究科学的方法,true-to-life simulations of human-like agents 可以是我们研究人类行为的有用工具。我们提出了人工智能代理人系统(Humanoid Agents),它将引入系统1处理中的三个元素:基本需求(例如饥饿、健康和能量)、情感和关系的亲密度。人工智能代理人可以通过这些动态元素来适应每天的活动和与其他代理人的对话,并且经过实验支持。我们的系统可以扩展到不同的设定,包括三个示例,以及其他影响人类行为的元素(例如共关、道德价值和文化背景)。我们的平台还包括Unity WebGL游戏界面 для可视化和互动分析亮点,以及跟踪代理人的时间变化。我们的平台可以在 上运行,代码可以在 上找到。”Note: Please keep in mind that the translation is done using Google Translate, and may not be perfect or entirely accurate.
Ethics of Artificial Intelligence and Robotics in the Architecture, Engineering, and Construction Industry
paper_authors: Ci-Jyun Liang, Thai-Hoa Le, Youngjib Ham, Bharadwaj R. K. Mantha, Marvin H. Cheng, Jacob J. Lin for: This research paper focuses on the ethical considerations of AI and robotics adoption in the architecture, engineering, and construction (AEC) industry.methods: The paper systematically reviews existing literature on AI and robotics research in the AEC industry, identifying key ethical issues and research topics.results: The paper identifies nine key ethical issues, including job loss, data privacy, and liability, and provides thirteen research topics for future study. It also highlights current challenges and knowledge gaps in the field, and provides recommendations for future research directions.Abstract
Artificial intelligence (AI) and robotics research and implementation emerged in the architecture, engineering, and construction (AEC) industry to positively impact project efficiency and effectiveness concerns such as safety, productivity, and quality. This shift, however, warrants the need for ethical considerations of AI and robotics adoption due to its potential negative impacts on aspects such as job security, safety, and privacy. Nevertheless, this did not receive sufficient attention, particularly within the academic community. This research systematically reviews AI and robotics research through the lens of ethics in the AEC community for the past five years. It identifies nine key ethical issues namely job loss, data privacy, data security, data transparency, decision-making conflict, acceptance and trust, reliability and safety, fear of surveillance, and liability, by summarizing existing literature and filtering it further based on its AEC relevance. Furthermore, thirteen research topics along the process were identified based on existing AEC studies that had direct relevance to the theme of ethics in general and their parallels are further discussed. Finally, the current challenges and knowledge gaps are discussed and seven specific future research directions are recommended. This study not only signifies more stakeholder awareness of this important topic but also provides imminent steps towards safer and more efficient realization.
摘要
人工智能(AI)和机器人技术在建筑、工程和建筑(AEC)行业的研究和应用已经出现,以提高项目效率和质量的问题。但是,这种转变也需要考虑AI和机器人的伦理问题,因为它们可能对工作安全、隐私和其他方面产生负面影响。然而,这一点在学术界并未得到充分关注,特别是在AEC领域。这项研究系统性地查看了AEC社区过去五年的AI和机器人研究,并Identified nine key ethical issues,namely job loss, data privacy, data security, data transparency, decision-making conflict, acceptance and trust, reliability and safety, fear of surveillance, and liability。此外,这些研究还标识出了13个相关的研究主题,包括数据隐私、数据安全、决策冲突、接受和信任、可靠性和安全、恐慌监测和责任。最后,这项研究讨论了当前的挑战和知识漏洞,并建议七个未来研究方向。这项研究不仅增加了参与者对这个重要话题的意识,而且还提供了更安全和效率的实现方法。
Causal Reasoning through Two Layers of Cognition for Improving Generalization in Visual Question Answering
For: 提高Visual Question Answering(VQA)模型的泛化能力,使其能够回答图像问题,并考虑到训练分布之外的上下文。* Methods: 提出了Cognitive pathways VQA(CopVQA)模型,通过强调 causal reasoning 因素来提高多Modal 预测。 CopVQA 首先创建了多条可能的 causal reasoning 流程,然后将每个阶段的责任划分给独立的专家和认知组件(CC)。最后,模型优先选择由两个 CC 执行的答案预测,而忽略由单个 CC 生成的答案。* Results: 对实际生活和医疗数据进行了实验,证明了 CopVQA 可以提高 VQA 性能和泛化性,并在不同的基线和领域上达到新的州OF-THE-ART 水平,而且模型规模只是当前 SOTA 的一半。Abstract
Generalization in Visual Question Answering (VQA) requires models to answer questions about images with contexts beyond the training distribution. Existing attempts primarily refine unimodal aspects, overlooking enhancements in multimodal aspects. Besides, diverse interpretations of the input lead to various modes of answer generation, highlighting the role of causal reasoning between interpreting and answering steps in VQA. Through this lens, we propose Cognitive pathways VQA (CopVQA) improving the multimodal predictions by emphasizing causal reasoning factors. CopVQA first operates a pool of pathways that capture diverse causal reasoning flows through interpreting and answering stages. Mirroring human cognition, we decompose the responsibility of each stage into distinct experts and a cognition-enabled component (CC). The two CCs strategically execute one expert for each stage at a time. Finally, we prioritize answer predictions governed by pathways involving both CCs while disregarding answers produced by either CC, thereby emphasizing causal reasoning and supporting generalization. Our experiments on real-life and medical data consistently verify that CopVQA improves VQA performance and generalization across baselines and domains. Notably, CopVQA achieves a new state-of-the-art (SOTA) on PathVQA dataset and comparable accuracy to the current SOTA on VQA-CPv2, VQAv2, and VQA RAD, with one-fourth of the model size.
摘要
通用化在视觉问答(VQA)中需要模型能够回答图像上的问题,并且考虑到训练分布之外的上下文。现有的尝试主要是对单模型方面进行精细调整,忽略了多模型方面的改进。此外,图像的多种解释会导致多种答案生成,这 highlights 了在解释和回答步骤之间的 causal reasoning 的角色。基于这个视角,我们提出了认知路径 VQA(CopVQA),它可以提高多模型预测的准确率。CopVQA 的实现方式是首先建立一个路径 pool,用于捕捉不同的 causal reasoning 流程。这与人类认知的层次结构相似,我们将解释和回答的责任分别划分为多个专家和一个认知能力Component(CC)。两个 CC 采用不同的策略来逐一执行每个专家,以便更好地捕捉 causal reasoning 的关系。最后,我们优先支持由多个 CC 共同执行的答案预测,而不是由单个 CC 生成的答案,以强调 causal reasoning 的重要性并且提高泛化能力。我们在真实生活和医疗数据上进行了实验,结果表明 CopVQA 可以提高 VQA 性能和泛化能力,并且在不同的基eline和领域上具有一致的表现。特别是,CopVQA 在 PathVQA 数据集上达到了新的状态态(SOTA),与当前 SOTA 在 VQA-CPv2、VQAv2 和 VQA RAD 数据集上的精度相似,仅使用一半的模型大小。
CAMEL2: Enhancing weakly supervised learning for histopathology images by incorporating the significance ratio
for: Histopathology image analysis for cancer diagnosis
methods: Weakly supervised learning methods with coarse-grained labels at the image level
results: Comparable performance to fully supervised baselines in both instance- and slide-level classifications, with the help of 5,120x5,120 image-level binary annotations that are easy to annotate.Here’s the summary in Traditional Chinese text:
for: histopathology图像分析 для癌症诊断
methods: weakly supervised learning方法,仅需 coarse-grained labels at the image level
results: 与完全监督基eline相比,在 both instance- 和 slide-level classification中 achieve comparable performance,仅需 5,120x5,120个易于annotate的image-level binary annotations。Abstract
Histopathology image analysis plays a crucial role in cancer diagnosis. However, training a clinically applicable segmentation algorithm requires pathologists to engage in labour-intensive labelling. In contrast, weakly supervised learning methods, which only require coarse-grained labels at the image level, can significantly reduce the labeling efforts. Unfortunately, while these methods perform reasonably well in slide-level prediction, their ability to locate cancerous regions, which is essential for many clinical applications, remains unsatisfactory. Previously, we proposed CAMEL, which achieves comparable results to those of fully supervised baselines in pixel-level segmentation. However, CAMEL requires 1,280x1,280 image-level binary annotations for positive WSIs. Here, we present CAMEL2, by introducing a threshold of the cancerous ratio for positive bags, it allows us to better utilize the information, consequently enabling us to scale up the image-level setting from 1,280x1,280 to 5,120x5,120 while maintaining the accuracy. Our results with various datasets, demonstrate that CAMEL2, with the help of 5,120x5,120 image-level binary annotations, which are easy to annotate, achieves comparable performance to that of a fully supervised baseline in both instance- and slide-level classifications.
摘要
results: 这篇论文提供了一个448万个单词的448万个网页文档,来自六个区域,并且这些数据可以用于语言模型的训练和下游任务,这将为亚洲英语研究提供巨大的研究 potential。Abstract
Language models have been foundations in various scenarios of NLP applications, but it has not been well applied in language variety studies, even for the most popular language like English. This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes, specifically in creating a multi-variety corpus for studying Asian Englishes. We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties. It is based on 340 million tokens in 448 thousand web documents from six regions. The ontology of data would make the corpus a helpful resource with enormous research potential for Asian Englishes (especially for Chinese Englishes for which there has not been a publicly accessible corpus yet so far) and an ideal source for variety-specific language modeling and downstream tasks, thus setting the stage for NLP-based World Englishes studies. And preliminary experiments on this corpus reveal the practical value of CCAE. Finally, we make CCAE available at \href{https://huggingface.co/datasets/CCAE/CCAE-Corpus}{this https URL}.
摘要
受欢迎的语言模型在各种自然语言处理(NLP)应用场景中发挥了重要作用,但它们在语言多样性研究中尚未得到广泛应用,即使是最受欢迎的语言之一的英语。这篇论文是一个初始尝试,利用NLP技术来探索世界英语的多样性,具体来说是创建一个多种英语语料库,用于研究亚洲英语。我们介绍了CCAE——中基于英语的亚洲英语词库,这是一个包含6种中基于英语的亚洲英语变体的suite of corpora,基于3.4亿个字的448万个网页文档。这些数据的 ontology 使得这个词库成为了研究亚洲英语(特别是中英语)的有用资源,以及下游任务的理想来源,因此设置了NPLT-based World Englishes studies的场景。而我们的初步实验表明,CCAE 具有实际的价值。最后,我们将CCAE 公布在 这个https URL 上。
results: 实现了对于非线性奖励函数的优化,并且在理论上达到了 O(polylog T) 的 regret upper bound,比 классиical BO 下的lower bound Omega(sqrt(T)) 更小。Abstract
Kernelized bandits, also known as Bayesian optimization (BO), has been a prevalent method for optimizing complicated black-box reward functions. Various BO algorithms have been theoretically shown to enjoy upper bounds on their cumulative regret which are sub-linear in the number T of iterations, and a regret lower bound of Omega(sqrt(T)) has been derived which represents the unavoidable regrets for any classical BO algorithm. Recent works on quantum bandits have shown that with the aid of quantum computing, it is possible to achieve tighter regret upper bounds better than their corresponding classical lower bounds. However, these works are restricted to either multi-armed or linear bandits, and are hence not able to solve sophisticated real-world problems with non-linear reward functions. To this end, we introduce the quantum-Gaussian process-upper confidence bound (Q-GP-UCB) algorithm. To the best of our knowledge, our Q-GP-UCB is the first BO algorithm able to achieve a regret upper bound of O(polylog T), which is significantly smaller than its regret lower bound of Omega(sqrt(T)) in the classical setting. Moreover, thanks to our novel analysis of the confidence ellipsoid, our Q-GP-UCB with the linear kernel achieves a smaller regret than the quantum linear UCB algorithm from the previous work. We use simulations, as well as an experiment using a real quantum computer, to verify that the theoretical quantum speedup achieved by our Q-GP-UCB is also potentially relevant in practice.
摘要
kernelized bandits,也称为 bayesian optimization (BO),已经是优化复杂黑盒奖励函数的常用方法。多种 BO 算法已经有理论上证明了每个迭代 T 的累累 regret 是下线的,而 Omega(sqrt(T)) 的 regret 下界则表示任何 classical BO 算法不可避免的 regret。现代量子bandits 研究表明,通过量子计算机的帮助,可以超过其对应的 classical 下界。然而,这些工作都是限制在多重武器或线性 bandits 上,因此无法解决复杂的实际问题。为此,我们介绍了量子- Gaussian 过程-上界 bound (Q-GP-UCB) 算法。根据我们所知,我们的 Q-GP-UCB 算法可以达到 O(polylog T) 的 regret Upper bound,这比 classical 设置的 Omega(sqrt(T)) 下界要小得多。此外,我们的新的 confidence ellipsoid 分析表明,我们的 Q-GP-UCB 算法使用线性 kernel 时的 regret 小于前一个工作中的量子线性 UCB 算法。我们使用 simulations 以及一个使用真实量子计算机的实验,以验证我们的理论上的量子速度提升也是在实践中有可能的。
Measuring Acoustics with Collaborative Multiple Agents
for: This paper aims to improve the efficiency and accuracy of measuring environment acoustics using multiple robots.
methods: The paper proposes using two robots to actively move and emit/receive sweep signals to measure the environment’s acoustics, and trains them using a collaborative multi-agent policy to explore the environment while minimizing prediction error.
results: The robots learn to collaborate and move to explore the environment acoustics while minimizing the prediction error, demonstrating the effectiveness of the proposed method.Abstract
As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by setting up a loudspeaker and microphone in the environment for all source/receiver locations, which is time-consuming and inefficient. We propose to let two robots measure the environment's acoustics by actively moving and emitting/receiving sweep signals. We also devise a collaborative multi-agent policy where these two robots are trained to explore the environment's acoustics while being rewarded for wide exploration and accurate prediction. We show that the robots learn to collaborate and move to explore environment acoustics while minimizing the prediction error. To the best of our knowledge, we present the very first problem formulation and solution to the task of collaborative environment acoustics measurements with multiple agents.
摘要
人类生活中每秒都听到声音。声音我们听到常常受到环境的折射影响。例如,一个大厅会导致更多的延 reverberation。 Room Impulse Responses(RIR)是用于描述环境声学特性的函数,其中包括场景几何学、材料和源/接收器位置。传统上,RIRs通过在环境中设置 loudspeaker 和 microphone 来测量,这是时间consuming 和不效环境。我们提议使用两个机器人来测量环境的声学特性,它们通过活动移动和发送/接收扫描信号来测量。我们还开发了一种多agent 协同策略,这两个机器人在环境中探索声学特性,同时被奖励宽泛探索和准确预测。我们显示这两个机器人可以协同工作,在最小化预测错误的情况下探索环境声学特性。根据我们所知,我们提出了首个多agent 环境声学测量问题的问题与解决方案。
Molecular De Novo Design through Transformer-based Reinforcement Learning
results: 我们的模型在多个任务上表现出色,包括生成查询结构的同分子和生成具有特定属性的分子。与基eline的RNN驱动方法相比,我们的方法显著提高了模型的性能。我们的方法可以用于骨架跳转、库扩展和预测高活性分子。Abstract
In this work, we introduce a method to fine-tune a Transformer-based generative model for molecular de novo design. Leveraging the superior sequence learning capacity of Transformers over Recurrent Neural Networks (RNNs), our model can generate molecular structures with desired properties effectively. In contrast to the traditional RNN-based models, our proposed method exhibits superior performance in generating compounds predicted to be active against various biological targets, capturing long-term dependencies in the molecular structure sequence. The model's efficacy is demonstrated across numerous tasks, including generating analogues to a query structure and producing compounds with particular attributes, outperforming the baseline RNN-based methods. Our approach can be used for scaffold hopping, library expansion starting from a single molecule, and generating compounds with high predicted activity against biological targets.
摘要
在这项工作中,我们介绍了一种方法,用于微调基于Transformer的生成模型,以实现分子的德诺vo计划。我们利用Transformer对序列学习的优势,可以效果地生成具有所需性能的分子结构。与传统的RNN基本方法相比,我们的提议方法在生成具有不同生物目标活性的分子结构方面表现出色,捕捉分子结构序列中长期依赖关系。我们的方法在多个任务上展现出优势,包括生成查询结构的同化体和生成具有特定属性的分子,超越基eline RNN基本方法。我们的方法可以用于跳跃架构、从单个分子开始扩大图书馆,以及预测高活性 against生物目标的分子。
Universal Multi-modal Entity Alignment via Iteratively Fusing Modality Similarity Paths
results: 实验结果表明,对真实世界数据集进行测试的PathFusion方法,与现有方法相比,具有22.4%-28.9%的绝对提升(Hits@1),以及0.194-0.245的绝对提升(MRR)。Abstract
The objective of Entity Alignment (EA) is to identify equivalent entity pairs from multiple Knowledge Graphs (KGs) and create a more comprehensive and unified KG. The majority of EA methods have primarily focused on the structural modality of KGs, lacking exploration of multi-modal information. A few multi-modal EA methods have made good attempts in this field. Still, they have two shortcomings: (1) inconsistent and inefficient modality modeling that designs complex and distinct models for each modality; (2) ineffective modality fusion due to the heterogeneous nature of modalities in EA. To tackle these challenges, we propose PathFusion, consisting of two main components: (1) MSP, a unified modeling approach that simplifies the alignment process by constructing paths connecting entities and modality nodes to represent multiple modalities; (2) IRF, an iterative fusion method that effectively combines information from different modalities using the path as an information carrier. Experimental results on real-world datasets demonstrate the superiority of PathFusion over state-of-the-art methods, with 22.4%-28.9% absolute improvement on Hits@1, and 0.194-0.245 absolute improvement on MRR.
摘要
目标是实体对应(Entity Alignment,EA)是从多个知识图(Knowledge Graph,KG)中标识相应的实体对,并创建一个更加完整和统一的KG。大多数EA方法主要关注了知识图的结构性,缺乏多 modal 信息的探索。一些多模式EA方法有所进步,但它们具有两个缺陷:(1)不稳定和不效率的模式模型,通常采用复杂和特定的模型来表示每种模式;(2)不具有有效的多模式融合,由于实体对应中的多种模式具有不同的特征。为了解决这些挑战,我们提出了PathFusion,它包括两个主要组成部分:(1)MSP,一种简化实体对应过程的统一模型方法,通过构建实体和模式节点之间的路径来表示多种模式;(2)IRF,一种迭代融合方法,通过路径作为信息传递者,有效地将不同模式的信息融合在一起。实验结果表明,PathFusion比 estado-of-the-art 方法具有22.4%-28.9%的绝对改善率(Hits@1),和0.194-0.245的绝对改善率(MRR)。
Generalized Neural Collapse for a Large Number of Classes
results: 这篇论文显示了在实际深度神经网络中发生的“普通化神经垮坏”现象,即在大数据集中,分类器的最小一对一 margin 是最大化的。此外,论文还提供了一系列实际和理论研究,以证明这种现象的存在性和可靠性。Abstract
Neural collapse provides an elegant mathematical characterization of learned last layer representations (a.k.a. features) and classifier weights in deep classification models. Such results not only provide insights but also motivate new techniques for improving practical deep models. However, most of the existing empirical and theoretical studies in neural collapse focus on the case that the number of classes is small relative to the dimension of the feature space. This paper extends neural collapse to cases where the number of classes are much larger than the dimension of feature space, which broadly occur for language models, retrieval systems, and face recognition applications. We show that the features and classifier exhibit a generalized neural collapse phenomenon, where the minimum one-vs-rest margins is maximized.We provide empirical study to verify the occurrence of generalized neural collapse in practical deep neural networks. Moreover, we provide theoretical study to show that the generalized neural collapse provably occurs under unconstrained feature model with spherical constraint, under certain technical conditions on feature dimension and number of classes.
摘要
神经坍塌提供了深度学习模型中学习最后层表示(即特征)以及分类器权重的简洁数学定义。这些结果不仅提供了启示,还激发了改进实际深度模型的新技术。然而,现有的实际和理论研究多集中在深度模型中的小数目类别情况下进行研究。这篇论文扩展了神经坍塌到类别数量远大于特征空间维度的情况下,这种情况广泛出现在语音识别、检索系统和人脸识别等应用中。我们显示了神经坍塌现象在实际深度神经网络中发生,并且提供了理论研究,证明在不受特征模型约束的情况下,神经坍塌在某种技术条件下发生。
results: 论文的实验结果表明,CIL 可以在各种实验数据集上(包括生产环境中的数据)与强基线相比,具有更高的一致性。Abstract
Invariance learning methods aim to learn invariant features in the hope that they generalize under distributional shifts. Although many tasks are naturally characterized by continuous domains, current invariance learning techniques generally assume categorically indexed domains. For example, auto-scaling in cloud computing often needs a CPU utilization prediction model that generalizes across different times (e.g., time of a day and date of a year), where `time' is a continuous domain index. In this paper, we start by theoretically showing that existing invariance learning methods can fail for continuous domain problems. Specifically, the naive solution of splitting continuous domains into discrete ones ignores the underlying relationship among domains, and therefore potentially leads to suboptimal performance. To address this challenge, we then propose Continuous Invariance Learning (CIL), which extracts invariant features across continuously indexed domains. CIL is a novel adversarial procedure that measures and controls the conditional independence between the labels and continuous domain indices given the extracted features. Our theoretical analysis demonstrates the superiority of CIL over existing invariance learning methods. Empirical results on both synthetic and real-world datasets (including data collected from production systems) show that CIL consistently outperforms strong baselines among all the tasks.
摘要
对于不变学习方法来说,它们的目标是学习不变特征,以便在分布转移时保持一致。虽然许多任务是自然地表示为连续领域,但当前的不变学习技术通常假设Category indexed领域。例如,云计算中的自动扩缩通常需要一个能够在不同时间(例如天时和年度)上泛化的CPU使用预测模型,其中`time'是连续领域的索引。在这篇论文中,我们开始 by theoretically showing that existing invariance learning methods can fail for continuous domain problems。Specifically, the naive solution of splitting continuous domains into discrete ones ignores the underlying relationship among domains, and therefore potentially leads to suboptimal performance。To address this challenge, we then propose Continuous Invariance Learning (CIL), which extracts invariant features across continuously indexed domains。CIL is a novel adversarial procedure that measures and controls the conditional independence between the labels and continuous domain indices given the extracted features。我们的理论分析表明CIL比既有的不变学习方法更加有利。empirical results on both synthetic and real-world datasets(包括生产系统中收集的数据)显示,CIL在所有任务中一直表现出优于强基eline。
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
results: 试验结果表明,使用SteerLM可以生成更加有用和高质量的回答,而且训练更加容易。 compare to 多种基elines。Abstract
Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B
摘要
大型语言模型(LLM)的模型对齐人类偏好是一项重要步骤,以确保模型与人类价值观 align。通常包括监督微调(SFT)和人类反馈学习(RLHF)两个阶段。然而,RLHF受到内在的限制,包括复杂的训练设置和对隐藏的价值观align。此外,RLHF阶段的奖励模型通常依赖单一的反馈信号,而不是显式、多方面的信号,如帮助程度、幽默度和恶势力。为解决这些限制,我们提出了SteerLM,一种监督微调方法,使得用户可以在推理时控制响应。SteerLM Condition Responses to Conform to an Explicitly Defined Multi-Dimensional Set of Attributes, thereby Empowering a Steerable AI Capable of Generating Helpful and High-Quality Responses While Maintaining Customizability。实验显示,SteerLM在开源数据集上训练后可以跟上人类和自动评价者的首选,而且训练得非常容易。可以在https://huggingface.co/nvidia/SteerLM-llama2-13B中尝试SteerLM。
Investigating Continuous Learning in Spiking Neural Networks
results: 初步结果表明,使用SNN模型可以避免了训练过程中的迷失知识问题,但还需要进一步的研究。所有模型都能正确地识别当前的类别,但是它们都会在前一个类别上具有高于正常水平的输出概率。这表明SNN模型有潜力来超越迷失知识问题,但还需要很多的研究和改进。Abstract
In this paper, the use of third-generation machine learning, also known as spiking neural network architecture, for continuous learning was investigated and compared to conventional models. The experimentation was divided into three separate phases. The first phase focused on training the conventional models via transfer learning. The second phase trains a Nengo model from their library. Lastly, each conventional model is converted into a spiking neural network and trained. Initial results from phase 1 are inline with known knowledge about continuous learning within current machine learning literature. All models were able to correctly identify the current classes, but they would immediately see a sharp performance drop in previous classes due to catastrophic forgetting. However, the SNN models were able to retain some information about previous classes. Although many of the previous classes were still identified as the current trained classes, the output probabilities showed a higher than normal value to the actual class. This indicates that the SNN models do have potential to overcome catastrophic forgetting but much work is still needed.
摘要
在这篇论文中,使用第三代机器学习技术,即脉冲神经网络架构,进行连续学习的可能性进行了调查和比较,并与传统模型进行对比。实验分为三个阶段。第一阶段是通过转移学习训练传统模型。第二阶段使用Nengo模型库中的模型进行训练。最后,每个传统模型都被转换成脉冲神经网络并进行训练。初果显示,第一阶段的结果与现有机器学习文献中关于连续学习的知识一致。所有模型都能正确地识别当前的类别,但它们都会因为恐慌遗忘而显示出过去类别的性能下降。然而,SNN模型能够保持一些过去类别的信息。虽然许多过去类别仍然被识别为当前训练类别,但输出概率显示高于实际类别的值。这表示SNN模型有可能超越恐慌遗忘,但还需要进一步的研究。
A Critical Look at Classic Test-Time Adaptation Methods in Semantic Segmentation
results: 研究结果显示,这些 Classic TTA 方法在 Semantic Segmentation 任务中的表现不如预期,特别是在批量 normalization 更新策略和教师学生结构方面。此外,Segmentation TTA 还面临着严重的长尾问题,这问题比 классификаation TTA 更加复杂。Abstract
Test-time adaptation (TTA) aims to adapt a model, initially trained on training data, to potential distribution shifts in the test data. Most existing TTA studies, however, focus on classification tasks, leaving a notable gap in the exploration of TTA for semantic segmentation. This pronounced emphasis on classification might lead numerous newcomers and engineers to mistakenly assume that classic TTA methods designed for classification can be directly applied to segmentation. Nonetheless, this assumption remains unverified, posing an open question. To address this, we conduct a systematic, empirical study to disclose the unique challenges of segmentation TTA, and to determine whether classic TTA strategies can effectively address this task. Our comprehensive results have led to three key observations. First, the classic batch norm updating strategy, commonly used in classification TTA, only brings slight performance improvement, and in some cases it might even adversely affect the results. Even with the application of advanced distribution estimation techniques like batch renormalization, the problem remains unresolved. Second, the teacher-student scheme does enhance training stability for segmentation TTA in the presence of noisy pseudo-labels. However, it cannot directly result in performance improvement compared to the original model without TTA. Third, segmentation TTA suffers a severe long-tailed imbalance problem, which is substantially more complex than that in TTA for classification. This long-tailed challenge significantly affects segmentation TTA performance, even when the accuracy of pseudo-labels is high. In light of these observations, we conclude that TTA for segmentation presents significant challenges, and simply using classic TTA methods cannot address this problem well.
摘要
测试时适应(TTA)目的是使模型,首先在训练数据上进行训练,适应测试数据中的可能存在的分布变化。大多数现有的TTA研究, however,它们主要关注分类任务,留下了 semantic segmentation 的探索空间。这种注重分类的偏好可能导致许多新手和工程师 mistakenly assume that classic TTA methods designed for classification can be directly applied to segmentation。然而,这个假设仍未得到证明,这 constitutes an open question。为了解决这个问题,我们进行了系统性的、 empirical 的研究,以揭示 semantic segmentation TTA 中独特的挑战,并确定 classic TTA 策略是否能有效地解决这个任务。我们的全面的结果表明,有三个关键观察:1. 通常用于 classification TTA 的 batch norm 更新策略,对 semantic segmentation TTA 来说只有微scopic 的性能提高,而在一些情况下,甚至会 adversely affect the results。即使使用 advanced distribution estimation techniques like batch renormalization,问题仍未得到解决。2. teacher-student scheme 可以增强 semantic segmentation TTA 中的训练稳定性,但是它不能直接导致性能提高,与无TTA 的原始模型相比。3. semantic segmentation TTA 受到严重的长尾偏度问题困扰,这个问题比分类 TTA 更加复杂。这种长尾偏度问题会很大地影响 semantic segmentation TTA 性能,即使 pseudo-labels 的准确率很高。根据这些观察结论,我们 conclude that semantic segmentation TTA 存在 significativeschallenges,并且简单地使用 classic TTA methods 无法很好地解决这个问题。
Enhancing Long-form Text Generation Efficacy with Task-adaptive Tokenization
results: 在中文和英文心理问答任务上,通过对特定任务的抽象进行优化,可以提高生成性能,并且使用60% fewer tokens。初步实验表明,将我们的抽象方法与大语言模型结合使用,具有扎实的前景。Abstract
We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We introduce a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model's tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments point to promising results when using our tokenization approach with very large language models.
摘要
我们提议使用任务适应式分词法来适应下游任务的特点,以提高长文本生成在心理健康领域。 drawing inspiration from cognitive science, our task-adaptive tokenizer samples variable segmentations from multiple outcomes, with sampling probabilities optimized based on task-specific data. We propose a strategy for building a specialized vocabulary and introduce a vocabulary merging protocol that allows for the integration of task-specific tokens into the pre-trained model's tokenization step. Through extensive experiments on psychological question-answering tasks in both Chinese and English, we find that our task-adaptive tokenization approach brings a significant improvement in generation performance while using up to 60% fewer tokens. Preliminary experiments suggest promising results when using our tokenization approach with very large language models.Here's the breakdown of the translation:* 我们 (wǒmen) - we* 提议 (tīyì) - propose* 使用 (shǐyòu) - use* 任务适应式 (róngyè tíyìxì) - task-adaptive* 分词法 (fēnzihòu) - tokenization* 适应 (shìyìng) - adapt* 下游 (xiàyù) - downstream* 任务 (àiwù) - task* 特点 (tèdiǎn) - specifics* 以提高 (yǐ tígāo) - to improve* 长文本 (chángwén tiě) - long-form* 生成 (shēngchǎn) - generation* 在 (zhī) - in* 心理健康 (xīn lí jīn kāng) - mental health* 领域 (lǐngyè) - field* drawing inspiration from (zhìyì zhī) - inspired by* cognitive science (xīn lí kēxíng) - cognitive science* 任务特定 (àiwù tèqì) - task-specific* 数据 (shùdà) - data* 优化 (yòujiā) - optimize* probabilities (jiào dé) - probabilities* sampling (jiào) - sampling* 变量 (biānliàng) - variable* segmentations (jiāo biān) - segmentations* 多种 (duōzhèng) - multiple* outcomes (fāngyì) - outcomes* integration (tōngyì) - integration* 特定 vocabulary (tèqì yǔyīn) - specialized vocabulary* introduce (jìdǎo) - propose* a strategy (a jìdǎo) - a strategy* for building (jìdǎo) - for building* a specialized vocabulary (tèqì yǔyīn) - a specialized vocabulary* and introduce (jìdǎo) - and introduce* a vocabulary merging protocol (yǔyīn tóngxīng) - a vocabulary merging protocol* that allows for (dēng) - that allows for* the integration (tōngyì) - the integration* of task-specific tokens (àiwù zhǐxīn) - of task-specific tokens* into (dào) - into* the pre-trained model's (zhìyì zhī) - the pre-trained model's* tokenization step (tiězi xiàng) - tokenization step* Through (zhī) - through* extensive experiments (zhìyì yánjiū) - extensive experiments* on (zhī) - on* psychological question-answering (xīn lí yánsuō) - psychological question-answering* tasks (àiwù) - tasks* in (zhī) - in* both (liǎng) - both* Chinese (zhōngwén) - Chinese* and English (yīnggrēsī) - and English* we find (wǒmen jiào) - we find* a significant improvement (tóngyì zhìyì) - a significant improvement* in generation performance (chángwén tiěyì) - in generation performance* while using (yǐ) - while using* up to 60% fewer tokens (dào zhìyì) - up to 60% fewer tokens* Preliminary experiments (zhìyì yánjiū) - preliminary experiments* point to promising results (zhìyì zhìyì) - point to promising results* when using (yǐ) - when using* our tokenization approach (wǒmen tiězi xiàng) - our tokenization approach* with very large language models (dào yìjī zhīyì) - with very large language models.